Python爬蟲初探 - selenium+beautifulsoup4+chromedriver爬取須要登陸的網頁信息

時間 2019-11-14

標籤 python 爬蟲初探 selenium+beautifulsoup4+chromedriver selenium beautifulsoup chromedriver 須要登陸網頁信息欄目 Python 简体版

原文原文鏈接

目標

以前的自動答覆機器人須要從一個內部網頁上獲取的消息用於回覆一些問題，可是沒有對應的查詢api，因而想到了用腳本模擬瀏覽器訪問網站爬取內容返回給用戶。詳細介紹了第一次探索python爬蟲的坑。html

準備工做

requests模塊向網站發送http請求，BeautifulSoup模塊來從靜態HTML文本中提取咱們想要的數據，更高級的，對於動態加載頁面咱們須要用webdriver去模擬真實的網頁訪問，並解析內容。python

推薦使用Anaconda 這個科學計算版本，主要是由於它自帶一個包管理工具，能夠解決有些包安裝錯誤的問題。linux

安裝requests（anaconda自帶），selenium，beautifulsoup4，方法爲

pip install selenium
conda install beautifulsoup4
conda install lxml

使用Python3.5 的童鞋們直接使用pip install beautifulsoup4安裝會報錯（因此才推薦使用Anaconda版本），安裝教程看這裏。web

　　你可能須要安裝lxml，這是一個解析器，BeautifulSoup可使用它來解析HTML，而後提取內容。chrome

　　若是不安裝lxml，則BeautifulSoup會使用Python內置的解析器對文檔進行解析。之因此使用lxml，是由於它速度快。windows

參考Python爬蟲小白入門（三）BeautifulSoup庫api

https://www.cnblogs.com/Albert-Lee/p/6232745.html瀏覽器

關於webdriver的搭配網上一些舊帖子都說的是selenium+PhantomJS，可是目前selenium已經再也不支持PhantomJS（若是使用了會報錯syntax error，坑了好久才知道這個消息），只能使用chrome或者firefox的對應驅動，這裏咱們使用chromedriver，你也可使用firefoxdriver。接下來講說chromedriver的安裝bash

從http://chromedriver.storage.googleapis.com/index.html網址中下載與本機chrome瀏覽器對應的chromedriver驅動程序，chrome版本能夠打開瀏覽器右上角

　　　驅動版本對應參考以下，轉自chromedriver與chrome各版本及下載地址cookie

將下載的chromedriver解壓到chrome安裝目錄（右鍵chrome快捷方式查看屬性），再將chrome安裝目錄添加到電腦的Path環境變量中，並手動cmd刷新下path信息，相關操做百度一搜一大堆

開始工做

程序根據用戶輸入，在一個引導頁匹配查找對應的產品網址後綴，添加到url連接的請求參數部分的?product=後面，後續將訪問新的網址

import的模塊

首先介紹下import的模塊

import requests#發起靜態url請求
from bs4 import BeautifulSoup#BeautifulSoup解析
import re#正則匹配
from selenium import webdriver#下面是webdriver所需
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time#動態加載頁面須要延時再獲取信息

匹配搜索

定義了一個匹配搜索的函數，若是用戶輸入的字符（’a‘）被包含在搜索列表（[’bat‘，’link‘，’you‘]）中某項之中（好比‘a’被’bat‘包含），則返回包含的項（‘bat'），搜不到就返回’None‘，使用re正則加快匹配

1 def branchFinder(userInput, collection):
2     regex = re.compile(userInput)     # Compiles a regex.
3     for item in collection:
4         match = regex.search(item)  # Checks if the current item matches the regex.
5         if match:
6             return item
7     return 'None'

若是使用模糊搜索參考http://www.mamicode.com/info-detail-1601799.html

獲取並解析靜態頁面

使用requests.get訪問靜態頁面，使用BeautifulSoup處理頁面

 1     url1 = 'https://www.xxxx.com/?&product='
 2 
 3     r = requests.get(url1, cookies = cookies, headers = headers)
 4     # with open('main.html', 'wb+') as f:
 5     #     f.write(r.content)
 6 
 7     soup = BeautifulSoup(r.content, 'lxml')  #聲明BeautifulSoup對象
 8     findResult = soup.find_all('option')  #在頁面查找全部的option標籤
 9     optionList = []#option全部選項的value參數列表
10     for i in range(1,len(findResult)):#第一個option是默認selected的選項，爲空，因此這裏沒有添加進列表
11         optionList.append(findResult[i]['value'])
12     # print(optionList)
13     #已獲取主界面的value列表
14 
15 
16     #根據關鍵字查找對應的branch選項，生成新的訪問連接
17     branch = branchFinder(userInput,optionList)
18     if (branch == 'None'):
19         return 'Not Found. Please check your input.' #爲了實現return，實際上這些代碼整個寫在一個函數裏
20     print(branch+'\n')
21     url2 = url1 + branchFinder(userInput,optionList)#新的訪問連接

其中headers是你訪問頁面的瀏覽器信息，cookies包含了登陸信息，用於在網頁須要登陸的狀況下搞定訪問權限，示例以下

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

cookies = {'cookie': 'has_js=1; xxxxx=yyyyy'}

查看以上信息的方式爲瀏覽器訪問url1的連接，登陸後F12打開調試器，按照下圖尋找Name的第一個參數，找到橢圓圈出來的cookies和headers，填入上面的示例

關於BeautifulSoup對於網頁信息的獲取和處理函數參考Python爬蟲小白入門（三）BeautifulSoup庫，至此咱們已經完成了靜態爬取網頁信息的嘗試，當我嘗試如法炮製訪問url2的時候，使用BeautifulSoup一直獲取不到我想要的表格中的數據，查找<table>標籤後，裏面只有<thead>沒有<tbody>，查看requests獲取的網頁信息才發現根本沒有tbody，在瀏覽器中訪問url2，打開F12，發現表格使用的是datatable，由於以前作過使用datatable的項目，因此以爲這裏多是動態加載的tbody中的數據，靜態訪問是獲取不到表格數據的，這就須要動態加載了。

動態加載處理頁面

經歷了selenium+PhantomJS的失敗嘗試後，我轉而使用selenium+headless chrome，這也是在運行PhantomJS相關代碼後編譯器提示才知道的。

關於driver對於頁面的處理操做很是方便，不只能夠查找，還能夠模擬click等功能，詳見WebDriver--定位元素的8種方式和【Selenium2+Python】經常使用操做和webdriver（python）學習筆記一等等

 1     dcap = dict(DesiredCapabilities.PHANTOMJS)  #設置useragent，實際上只是使用了phantomjs的參數，各位能夠ctrl+鼠標點進去定義查看具體參數
 2     dcap['phantomjs.page.settings.userAgent'] = (headers['User-Agent'])  #根據須要設置具體的瀏覽器信息
 3 
 4     chrome_options = Options()
 5     chrome_options.add_argument('--no-sandbox')#解決DevToolsActivePort文件不存在的報錯
 6     chrome_options.add_argument('window-size=1920x3000') #指定瀏覽器分辨率
 7     chrome_options.add_argument('--disable-gpu') #谷歌文檔提到須要加上這個屬性來規避bug，不然會提示gpu開啓失敗
 8     chrome_options.add_argument('--hide-scrollbars') #隱藏滾動條, 應對一些特殊頁面
 9     chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加載圖片, 提高速度
10     chrome_options.add_argument('--headless') #瀏覽器不提供可視化頁面. linux下若是系統不支持可視化不加這條會啓動失敗，windows若是不加這條會
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　啓動瀏覽器GUI，而且不會返回句柄，只會等待操做，後面的代碼不會執行了
11     chrome_options.binary_location = r"C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" #手動指定使用的瀏覽器位置
12 
13     # driver=webdriver.Chrome(chrome_options=chrome_options)
14     # driver.get('https://www.baidu.com')
15     # print('hao123' in driver.page_source)
16     driver = webdriver.Chrome(chrome_options=chrome_options,desired_capabilities=dcap)#封裝瀏覽器信息
17     # driver = webdriver.Chrome(desired_capabilities=dcap)
18     driver.get(url2)# 訪問連接
19     # 添加cookies，注意和以前的格式不同，以前cookies的格式是xxxxx=yyyyy，這裏name對應的是=以前的xxxxx，value對應的是yyyyy
20     driver.add_cookie({'name' : 'xxxxx', 'value' : 'yyyyy'})
21     driver.refresh()#從新加載以登陸
22     driver.implicitly_wait(1)#等待1s加載數據，須要根據感受調整時長，若是1s不夠就要增長時間，詳情參考後面的for循環等待加載數據
23     time.sleep(0.1)
24     #顯示Product summary界面
25     print('Product summary.\n')
26     #點擊最新的Build連接
27     driver.find_element_by_tag_name("tbody").find_element_by_tag_name("a").click()#能夠順蔓摸瓜查找一個標籤下的其餘標籤，不管是一個標籤仍是標籤集
28     #已進入Build Viewer界面
29     print('Build Viewer.\n')
30     #點擊Tests
31     driver.find_element_by_id('1').click()#根據id查找並點擊
32     print(driver.find_element_by_class_name('table-responsive').text)#打印應該是空的，由於還沒獲取到數據
33     result = ''
34     for i in range(20):#循環加載20s，獲取表格數據，因爲find也須要時間，實際上加載不止20s
35         driver.implicitly_wait(1)#等待加載數據
36         result = driver.find_element_by_class_name('table-responsive').text
37         if(result == ''):#循環加載數據，直到非空
38             print( 'Waiting '+ str(i+1) + 's...')
39         else:
40             break
41     
42     driver.quit()#退出瀏覽器

最開始我driver.implicitly_wait(1)加載的時間很短，可是也能獲取到頁面內容，由於我是設置的斷點調試的！因此等待加載的時間比我設置的長多了！退出debug模式直接run的時候，有時候設置爲5s仍然獲取不到數據，發現這個坑的時候簡直驚呼！不過還好咱們可使用循環等待來判斷何時加載數據完畢。
以前沒有設置headless，使用cmd嘗試能夠打開chrome瀏覽器GUI，可是使用vscode打不開GUI，才知道須要管理員權限，因而使用管理員方式打開vscode便可

輸出結果

aiodnwebg


DevTools listening on ws://127.0.0.1:12133/devtools/browser/xxxx
Product summary.

Build Viewer.


Waiting 1s...
Waiting 2s...
Waiting 3s...
ID Job
123 aaa
245 bbb

完整代碼

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import re
 4 from selenium import webdriver
 5 from selenium.webdriver.chrome.options import Options
 6 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 7 import time
 8 
 9 ###################User Input##########################
10 userInput = '=====your input======'
11 
12 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
13 
14 cookies = {'cookie': 'has_js=1; xxxxx=yyyyy'}
15 ###################User Input##########################
16 
17 
18 def branchFinder(userInput, collection):
19     regex = re.compile(userInput)     # Compiles a regex.
20     for item in collection:
21         match = regex.search(item)  # Checks if the current item matches the regex.
22         if match:
23             return item
24     return 'None'
25     
26 def getResult(userInput):
27     url1 = 'https://www.xxx.com/？&product='
28 
29     r = requests.get(url1, cookies = cookies, headers = headers)
30     # with open('main.html', 'wb+') as f:
31     #     f.write(r.content)
32 
33     soup = BeautifulSoup(r.content, 'lxml')  #聲明BeautifulSoup對象
34     findResult = soup.find_all('option')  #查找option標籤
35     optionList = []
36     for i in range(1,len(findResult)):
37         optionList.append(findResult[i]['value'])
38     # print(optionList)
39     #已獲取主界面的value列表
40 
41 
42     #根據關鍵字查找對應的branch，生成新的訪問連接
43     branch = branchFinder(userInput,optionList)
44     if (branch == 'None'):
45         return 'Not Found. Please check your input.' 
46     print(branch+'\n')
47     url2 = url1 + branchFinder(userInput,optionList)
48     dcap = dict(DesiredCapabilities.PHANTOMJS)  #設置useragent
49     dcap['phantomjs.page.settings.userAgent'] = (headers['User-Agent'])  #根據須要設置具體的瀏覽器信息
50 
51     chrome_options = Options()
52     chrome_options.add_argument('--no-sandbox')#解決DevToolsActivePort文件不存在的報錯
53     chrome_options.add_argument('window-size=1920x3000') #指定瀏覽器分辨率
54     chrome_options.add_argument('--disable-gpu') #谷歌文檔提到須要加上這個屬性來規避bug
55     chrome_options.add_argument('--hide-scrollbars') #隱藏滾動條, 應對一些特殊頁面
56     chrome_options.add_argument('blink-settings=imagesEnabled=false') #不加載圖片, 提高速度
57     chrome_options.add_argument('--headless') #瀏覽器不提供可視化頁面. linux下若是系統不支持可視化不加這條會啓動失敗
58     chrome_options.binary_location = r"C:/Program Files (x86)/Google/Chrome/Application/chrome.exe" #手動指定使用的瀏覽器位置
59 
60     # driver=webdriver.Chrome(chrome_options=chrome_options)
61     # driver.get('https://www.baidu.com')
62     # print('hao123' in driver.page_source)
63     driver = webdriver.Chrome(chrome_options=chrome_options,desired_capabilities=dcap)#封裝瀏覽器信息
64     # driver = webdriver.Chrome(desired_capabilities=dcap)
65     driver.get(url2)
66 
67     driver.add_cookie({'name' : 'xxxxx', 'value' : 'yyyyy'})
68     driver.refresh()#從新加載以登陸
69     driver.implicitly_wait(1)#等待加載數據
70     time.sleep(0.1)
71     #顯示Product summary界面
72     print('Product summary.\n')
73     #點擊最新的Build連接
74     driver.find_element_by_tag_name("tbody").find_element_by_tag_name("a").click()
75     #已進入Build Viewer界面
76     print('Build Viewer.\n')
77     #點擊Tests
78     driver.find_element_by_id('1').click()
79     print(driver.find_element_by_class_name('table-responsive').text)
80     result = ''
81     for i in range(20):
82         driver.implicitly_wait(1)#等待加載數據
83         result = driver.find_element_by_class_name('table-responsive').text
84         if(result == ''):
85             print( 'Waiting '+ str(i+1) + 's...')
86         else:
87             break
88     driver.quit()
89     return result
90 
91 finalResult = getResult(userInput)
92 print(finalResult)