讀書筆記（十）——python簡單爬取企查查網企業信息，並以excel格式存儲

時間 2019-11-12

標籤讀書筆記 python 簡單查查企業信息 excel 格式存儲欄目 Python 简体版

原文原文鏈接

今天這個小爬蟲是應朋友，幫忙寫的一個簡單的爬蟲，目的是爬取企查查這個網站的企業信息。javascript

編程最終要的就是搭建編程環境，這裏咱們的編程環境是：html

python3.6
BeautifulSoup模塊
lxml模塊
requests模塊
xlwt模塊
geany

首先分析需求網頁的信息：java

http://www.qichacha.com/search?key=婚慶

能夠看到咱們想要提取的消息內容有公司的名字，法定表明人，註冊資本，成立時間，電話，郵箱，地址。好的，接下來咱們打開firebug,查看各個內容在網頁中的具體位置：python

能夠看到這些消息分別位於：chrome

#公司名字------<a class="ma_h1" href="/firm_8c640ea3b396783ab4e013ea5f7f295e.html" target="_blank">
#               昆明嘉馨
#               <em>
#               有限公司
#               </a>
#法定表明人----<p class="m-t-xs">
#                 法定表明人：
#                 <a class="a-blue" href="********">鄢顯莉</a>
#註冊資本----       <span class="m-l">註冊資本：100萬</span>
#成立時間----       <span class="m-l">成立時間：2002-05-20</span>
#            </p>
#            <p class="m-t-xs">
#聯繫方式----       電話：13888677871
#公司郵箱----       <span class="m-l">郵箱：-</span>
#            </p>
#公司地址----  <p class="m-t-xs"> 地址：昆明市南屏街88號世紀廣場B2幢12樓A+F號 </p>
#            <p></p>

可是有一個巨大的問題擺在咱們面前，企查查在點擊搜索按鈕後，雖然也能呈現部分資料，可是首當其衝的是一個登陸頁面，在沒有登陸前，咱們實際上經過爬蟲訪問到的是僅有前五個公司信息+登陸窗口的網頁編程

若是咱們不處理這個登陸頁面，那麼很抱歉，此次爬取到此結束了。瀏覽器

因此咱們必須處理這個問題，首先須要在企查查上註冊一個賬號，註冊步驟略，通常能夠經過安全

構造請求頭，配置cookies
使用selenium
requests.post去遞交用戶名密碼等

selenium模擬真實的瀏覽器去訪問頁面，可是其訪問速度又慢，還要等加載完成，容易報錯，直接放棄。服務器

requests.post方法，這個可能能夠，沒仔細研究，由於企查查登陸涉及三個選項，第一個是手機號，第二個是您的密碼，第三個是一個滑塊，滑塊估計須要構造一個True或者什麼東西吧。cookie

第一先想確定是構造請求頭，配置一個cookies。在這裏我要說明本身犯的一個錯誤，User-Agent寫成了User_agent，致使個人請求頭是錯誤的，訪問獲得的是一個被防火牆攔截的網頁頁面，以下：

#-*- coding-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup

def craw(url):
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'
    headers = {'User-Agent':user_agent,

}
    response = requests.get(url,headers = headers)
    if response.status_code != 200:
        response.encoding = 'utf-8'
        print(response.status_code)
        print('ERROR')

    soup = BeautifulSoup(response.text,'lxml')
    print(soup)
if __name__ == '__main__':
    url = r'http://www.qichacha.com/search?key=%E5%A9%9A%E5%BA%86'
    s1 = craw(url)

代碼僅僅是輸出soup，方便調試，請求狀態是一個405錯誤，獲得的頁面以下：

<!DOCTYPE html>
<html lang="zh-cn">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="a3c0e" name="data-spm"/>
<title>405</title>
<style>
            html, body, div, a, h2, p { margin: 0; padding: 0; font-family: 微軟
雅黑; }
            a { text-decoration: none; color: #3b6ea3;  }
            .container { width: 1000px; margin: auto; color: #696969; }


            .header { padding: 50px 0; }
            .header .message { height: 36px; padding-left: 120px; background: ur
l(https://errors.aliyun.com/images/TB1TpamHpXXXXaJXXXXeB7nYVXX-104-162.png) no-r
epeat 0 -128px; line-height: 36px; }

            .main { padding: 50px 0; background: #f4f5f7; }
            .main img { position: relative; left: 120px; }

            .footer { margin-top: 30px; text-align: right; }
            .footer a { padding: 8px 30px; border-radius: 10px; border: 1px soli
d #4babec; }
            .footer a:hover { opacity: .8; }

            .alert-shadow { display: none; position: absolute; top: 0; left: 0;
width: 100%; height: 100%; background: #999; opacity: .5; }
            .alert { display: none; position: absolute; top: 200px; left: 50%; w
idth: 600px; margin-left: -300px; padding-bottom: 25px; border: 1px solid #ddd;
box-shadow: 0 2px 2px 1px rgba(0, 0, 0, .1); background: #fff; font-size: 14px;
color: #696969; }
            .alert h2 {  margin: 0 2px; padding: 10px 15px 5px 15px; font-size:
14px; font-weight: normal; border-bottom: 1px solid #ddd; }
            .alert a { display: block; position: absolute; right: 10px; top: 8px
; width: 30px; height: 20px; text-align: center; }
            .alert p {  padding: 20px 15px; }
        </style>
</head>
<body data-spm="7663354">
<div data-spm="1998410538">
<div class="header">
<div class="container">
<div class="message">很抱歉，因爲您訪問的URL有可能對網站形成安全威脅，您的訪問被
阻斷。</div>
</div>
</div>
<div class="main">
<div class="container">
<img src="https://errors.aliyun.com/images/TB15QGaHpXXXXXOaXXXXia39XXX-660-117.p
ng"/>
</div>
</div>
<div class="footer">
<div class="container">
<a data-spm-click="gostr=/waf.123.123;locaid=d001;" href="javascript:;" id="repo
rt" target="_blank">誤報反饋</a>
</div>
</div>
</div>
<div class="alert-shadow" id="alertShadow"></div>
<div class="alert" id="alertContainer">
<h2>提示：<a href="javascript:;" id="closeAlert" title="關閉">X</a></h2>
<p>感謝您的反饋，應用防火牆會盡快進行分析和確認。</p>
</div>
<script>
             function show() {

                var g = function(ele) { return document.getElementById(ele); };
                var reportHandle = g('report');
                var alertShadow = g('alertShadow');
                var alertContainer = g('alertContainer');
                var closeAlert = g('closeAlert');

                var own = {};

                own.report = function() {
                    // SPM
                    own.alert();
                };

                own.alert = function() {
                    alertShadow.style.display = 'block';
                    alertContainer.style.display = 'block';
                };

                own.close = function() {
                    alertShadow.style.display = 'none';
                    alertContainer.style.display = 'none';
                };

            };
        </script>
<script charset="utf-8" src="https://errors.aliyun.com/error.js?s=3" type="text/
javascript"></script>
</body>
</html>

這個錯誤也說明了請求頭的重要性，這通常是服務器根據你的請求頭來簡單判斷你是一個攻擊者、爬蟲，仍是一個正常訪問的人。因此乾脆直接把請求頭整個複製下來。

這邊還有一點要注意，就是你使用的瀏覽器需打開COOKIES功能，並且關閉瀏覽器的時候不能自動或守清除cookies，不然都會致使只能獲得前五個公司的信息，剩下的仍是登錄消息。

直接上代碼，一點點的記錄：

#-*- coding-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup
import xlwt


def craw(url):
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'
    headers = {
'Host':'www.qichacha.com',
'User-Agent':r'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0',
'Accept':'*/*',
'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding':'gzip, deflate',
'Referer':'http://www.qichacha.com/',
'Cookie':r'UM_distinctid***************',
'Connection':'keep-alive',
'If-Modified-Since':'Wed, 30 **********',
'If-None-Match':'"59*******"',
'Cache-Control':'max-age=0',

}
    response = requests.get(url,headers = headers)
    if response.status_code != 200:
        response.encoding = 'utf-8'
        print(response.status_code)
        print('ERROR')    
    soup = BeautifulSoup(response.text,'lxml')
    #print(soup)
    com_names = soup.find_all(class_='ma_h1')
    #print(com_names)
    #com_name1 = com_names[1].get_text()
    #print(com_name1)
    peo_names = soup.find_all(class_='a-blue')
    #print(peo_names)
    peo_phones = soup.find_all(class_='m-t-xs')
    #tags = peo_phones[4].find(text = True).strip()
    #print(tags)
    #tttt = peo_phones[0].contents[5].get_text()
    #print (tttt)
    #else_comtent = peo_phones[0].find(class_='m-l')
    #print(else_comtent)
    global com_name_list
    global peo_name_list
    global peo_phone_list
    global com_place_list
    global zhuceziben_list
    global chenglishijian_list
    print('開始爬取數據，請勿打開excel')
    for i in range(0,len(com_names)):
        n = 1+3*i
        m = i+2*(i+1)
        peo_phone = peo_phones[n].find(text = True).strip()
        com_place = peo_phones[m].find(text = True).strip()
        zhuceziben = peo_phones[3*i].find(class_='m-l').get_text()
        chenglishijian = peo_phones[3*i].contents[5].get_text()
        peo_phone_list.append(peo_phone)
        com_place_list.append(com_place)   
        zhuceziben_list.append(zhuceziben)
        chenglishijian_list.append(chenglishijian)
    for com_name,peo_name in zip(com_names,peo_names):
        com_name = com_name.get_text()
        peo_name = peo_name.get_text()
        com_name_list.append(com_name)
        peo_name_list.append(peo_name)
        
    

        
if __name__ == '__main__':
    com_name_list = []
    peo_name_list = []
    peo_phone_list = []
    com_place_list = []
    zhuceziben_list = []
    chenglishijian_list = []
    key_word = input('請輸入您想搜索的關鍵詞：')
    print('正在搜索，請稍後')
    for x in range(1,11):
        url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)
        s1 = craw(url)
    workbook = xlwt.Workbook()
    #建立sheet對象，新建sheet
    sheet1 = workbook.add_sheet('xlwt', cell_overwrite_ok=True)
    #---設置excel樣式---
    #初始化樣式
    style = xlwt.XFStyle()
    #建立字體樣式
    font = xlwt.Font()
    font.name = 'Times New Roman'
    font.bold = True #加粗
    #設置字體
    style.font = font
    #使用樣式寫入數據
    # sheet.write(0, 1, "xxxxx", style)
    print('正在存儲數據，請勿打開excel')
    #向sheet中寫入數據
    name_list = ['公司名字','法定表明人','聯繫方式','註冊人資本','成立時間','公司地址']
    for cc in range(0,len(name_list)):
        sheet1.write(0,cc,name_list[cc],style)
    for i in range(0,len(com_name_list)):
        sheet1.write(i+1,0,com_name_list[i],style)#公司名字
        sheet1.write(i+1,1,peo_name_list[i],style)#法定表明人
        sheet1.write(i+1,2,peo_phone_list[i],style)#聯繫方式
        sheet1.write(i+1,3,zhuceziben_list[i],style)#註冊人資本
        sheet1.write(i+1,4,chenglishijian_list[i],style)#成立時間
        sheet1.write(i+1,5,com_place_list[i],style)#公司地址
    #保存excel文件，有同名的直接覆蓋
    workbook.save(r'F:\work\2017_08_02\xlwt.xls')
    print('the excel save success')

首先咱們引入requests、BeautifulSoup、lxml、xlwt四個模塊。

#-*- coding-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup
import xlwt

簡要說明一下四個模塊：

requests是一個第三方模塊，源碼位於Github上，它相對於urrllib/httplib更加的人性化，通常推薦使用這個，requests具備多種請求方式。

import requests
r1 = requests.get(r'http://www.baidu.com')
postdata = {'key':'value'}
r2 = requests.post(r'http://www.xxx.com/login',data=postdata)
r3 = requests.put(r'http://www.xxx.com/put',data={'key':'value'})
r4 = requests.delete(r'http://www.xxx.com/delete')
r5 = requests.head(r'http://www.xxx.com/get')
r6 = requests.options(r'http://www.xxx.com/get')

還要說明一點，就是其響應編碼：

import requests
r = requests.get(r'http://www.baidu.com')
print(r.content)#返回的是字節形式
print(r.text)#返回的是文本形式
print(r.encoding)#根據HTTP頭猜想的網頁編碼格式，能夠直接賦值修改

更多的requests後續找個機會補充。

BeautifulSoup這是一個能夠從HTML或XML文件中提取數據的python庫，它會把HTML轉換成文檔樹，既然是樹形結構，它必有節點概念，便於在爬蟲中使用它的查找提取功能，它的這個功能通常有兩種方法：1、find、find_all等方法；2、CSS選擇器。
lxml模塊，這是使用XPath技術查詢和處理處理HTML/XML文檔的庫，只會局部遍歷，因此速度會更快，佔用的內存開銷也會比較小。
xlwt模塊，這是一個寫成excel的模塊，可是它只能從新生成一個excel，也就是說，若是在這個路徑下，已經有這個excel了，那麼就會直接覆蓋掉這個excel，並且這個模塊不支持讀取。若是須要讀取功能的能夠用xlrd，而寫入功能能夠用xlutils模塊配合着xlrd模塊使用，具體我建議能夠看看這篇博客《Python excel讀寫》

接下來就很簡單了，定義函數，構造請求頭，requests訪問網頁，若是請求相應碼不是200，則輸出對應的相應碼以及‘ERROR’，用BeautifulSoup和lxml解析網頁，從網頁中選出所要的信息，定義6個全局變量列表，搜索到的數據經過列表的方法append加入列表。

def craw(url):
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'
    headers = {
'Host':'www.qichacha.com',
'User-Agent':r'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0',
'Accept':'*/*',
'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding':'gzip, deflate',
'Referer':'http://www.qichacha.com/',
'Cookie':r'UM_di**********1',
'Connection':'keep-alive',
'If-Modified-Since':'Wed, *********',
'If-None-Match':'"59****"',
'Cache-Control':'max-age=0',

}
    response = requests.get(url,headers = headers)
    if response.status_code != 200:
        response.encoding = 'utf-8'
        print(response.status_code)
        print('ERROR')    
    soup = BeautifulSoup(response.text,'lxml')
    #print(soup)
    com_names = soup.find_all(class_='ma_h1')
    #print(com_names)
    #com_name1 = com_names[1].get_text()
    #print(com_name1)
    peo_names = soup.find_all(class_='a-blue')
    #print(peo_names)
    peo_phones = soup.find_all(class_='m-t-xs')
    #tags = peo_phones[4].find(text = True).strip()
    #print(tags)
    #tttt = peo_phones[0].contents[5].get_text()
    #print (tttt)
    #else_comtent = peo_phones[0].find(class_='m-l')
    #print(else_comtent)
    global com_name_list
    global peo_name_list
    global peo_phone_list
    global com_place_list
    global zhuceziben_list
    global chenglishijian_list
    print('開始爬取數據，請勿打開excel')
    for i in range(0,len(com_names)):
        n = 1+3*i
        m = i+2*(i+1)
        peo_phone = peo_phones[n].find(text = True).strip()
        com_place = peo_phones[m].find(text = True).strip()
        zhuceziben = peo_phones[3*i].find(class_='m-l').get_text()
        chenglishijian = peo_phones[3*i].contents[5].get_text()
        peo_phone_list.append(peo_phone)
        com_place_list.append(com_place)   
        zhuceziben_list.append(zhuceziben)
        chenglishijian_list.append(chenglishijian)
    for com_name,peo_name in zip(com_names,peo_names):
        com_name = com_name.get_text()
        peo_name = peo_name.get_text()
        com_name_list.append(com_name)
        peo_name_list.append(peo_name)

經過不斷的調用函數craw，不斷的往list中添加數據，由於企查查非會員只能查看十頁的數據，因此咱們只須要重複十次便可，這邊的range()有一個須要注意的地方，由於通常range都是從0開始循環的，可是網頁的第一頁就是1（比較網站的url，尤爲是第一頁的url和第二頁的url更容易發現），因此若是咱們須要循環十次，那麼就須要從1開始，10是最後一次，11是截至，因此須要這麼寫：rang(1,11)。接下來的就是建立sheet對象，新建sheet，定義sheet的樣式，而後經過for循環不斷的往excel中存儲數據，最後再經過方法save()保存到某個路徑下。

if __name__ == '__main__':
    com_name_list = []
    peo_name_list = []
    peo_phone_list = []
    com_place_list = []
    zhuceziben_list = []
    chenglishijian_list = []
    key_word = input('請輸入您想搜索的關鍵詞：')
    print('正在搜索，請稍後')
    for x in range(1,11):
        url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)
        s1 = craw(url)
    workbook = xlwt.Workbook()
    #建立sheet對象，新建sheet
    sheet1 = workbook.add_sheet('xlwt', cell_overwrite_ok=True)
    #---設置excel樣式---
    #初始化樣式
    style = xlwt.XFStyle()
    #建立字體樣式
    font = xlwt.Font()
    font.name = 'Times New Roman'
    font.bold = True #加粗
    #設置字體
    style.font = font
    #使用樣式寫入數據
    # sheet.write(0, 1, "xxxxx", style)
    print('正在存儲數據，請勿打開excel')
    #向sheet中寫入數據
    name_list = ['公司名字','法定表明人','聯繫方式','註冊人資本','成立時間','公司地址']
    for cc in range(0,len(name_list)):
        sheet1.write(0,cc,name_list[cc],style)
    for i in range(0,len(com_name_list)):
        sheet1.write(i+1,0,com_name_list[i],style)#公司名字
        sheet1.write(i+1,1,peo_name_list[i],style)#法定表明人
        sheet1.write(i+1,2,peo_phone_list[i],style)#聯繫方式
        sheet1.write(i+1,3,zhuceziben_list[i],style)#註冊人資本
        sheet1.write(i+1,4,chenglishijian_list[i],style)#成立時間
        sheet1.write(i+1,5,com_place_list[i],style)#公司地址
    #保存excel文件，有同名的直接覆蓋
    workbook.save(r'F:\work\2017_08_02\xlwt.xls')
    print('the excel save success')

代碼基本上到這邊結束了，爬取效果也還能夠。以前只作了個半成品，只處理一頁的數據，並無完善整個功能，後續加了翻頁，完善了存儲功能。