python爬蟲之初戀 selenium

時間 2019-11-16

原文原文鏈接

selenium 是一個web應用測試工具，可以真正的模擬人去操做瀏覽器。
用她來爬數據比較直觀，靈活，和傳統的爬蟲不一樣的是，
她真的是打開瀏覽器，輸入表單，點擊按鈕，模擬登錄，得到數據，樣樣行。徹底不用考慮異步請求，所見即所得。css

selenium語言方面支持java/python，瀏覽器方面支持各大主流瀏覽器谷歌，火狐，ie等。我選用的是python3.6+chrome組合html

chrome

寫python爬蟲程序以前，須要準備兩樣東西：java

1.[chrome][1]/瀏覽器              https://www.google.cn/chrome/
2.[chromedriver][2] /瀏覽器驅動   http://chromedriver.storage.googleapis.com/index.html

瀏覽器和瀏覽器驅動的搭配版本要求比較嚴格，不一樣的瀏覽器版本須要不一樣的驅動版本；個人版本信息：python

chrome info: chrome=66.0.3359.139
 Driver info: chromedriver=2.37.544315

其餘版本對照git

chromedriver版本	Chrome版本
v2.37	v64-66
v2.36	v63-65
v2.34	v61-63

chrome瀏覽器
這裏須要注意的是若是想更換對應的谷歌瀏覽器，要高版本的請務必直接升級處理，低版本的卸載時要完全！完全！完全！卸載，包括（Google升級程序，註冊表，殘留文件等），再安裝。不然爬蟲程序啓動不了瀏覽器。github

chromedriver瀏覽器驅動
chromedriver 放置的位置也很重要，把chromedriver放在等會要寫的.py文件旁邊是最方便的方法。固然也能夠不放這裏，可是須要配置chromedriver的路徑，我這裏就不介紹這種方法了。web

火狐驅動下載地址：https://github.com/mozilla/ge...chrome

python

終於開始敲代碼了segmentfault

打開網站

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://segmentfault.com/")

三行代碼便可自動完成啓動谷歌瀏覽器，輸出url，回車的騷操做。
此時的窗口地址欄下方會出現【Chrome 正在受到自動測試軟件的控制】字樣。api

提交表單

下面咱們來嘗試控制瀏覽器輸入並搜索關鍵字找到咱們這篇文章；
先打開segmentfault網站，F12查看搜索框元素

<input id="searchBox" name="q" type="text" placeholder="搜索問題或關鍵字" class="form-control" value="">

發現是一個id爲searchBox的input標籤，ok

from selenium import webdriver
browser = webdriver.Chrome()   #打開瀏覽器
browser.get("https://segmentfault.com/")   #輸入url

searchBox = browser.find_element_by_id("searchBox")  #經過id得到表單元素
searchBox.send_keys("python爬蟲之初戀 selenium")   #向表單輸入文字
searchBox.submit()    #提交

find_element_by_id()方法：根據id得到該元素。
一樣還有其餘方法好比

find_element_by_xpath()	經過路徑選擇元素
find_element_by_tag_name()	經過標籤名得到元素
find_element_by_css_selector()	經過樣式選擇元素
find_element_by_class_name()	經過class得到元素
find_elements_by_class_name()	經過class得到元素們，element加s的返回的都是集合

舉個栗子：

1.find_elements_by_css_selector("tr[bgcolor='#F2F2F2']>td")
  得到 style爲 bgcolor='#F2F2F2' 的tr的子元素td

2.find_element_by_xpath("/html/body/div[4]/div/div/div[2]/div[3]/div[1]/div[2]/div/h4/a")
  得到此路徑下的a元素。
  find_element_by_xpath方法使用谷歌瀏覽器F12選擇元素右鍵copy->copyXpath急速得到準確位置，很是好用，誰用誰知道
  
3.find_element_by_xpath("..")得到上級元素
4.find_element_by_xpath("following-sibling::")獲同級弟弟元素
5.find_element_by_xpath("preceding-sibling::")獲同級哥哥元素

抓取數據

得到元素後.text方法便可得到該元素的內容
咱們得到文章的簡介試試：

from selenium import webdriver
browser = webdriver.Chrome()   #打開瀏覽器

browser.get("https://segmentfault.com/")   #輸入url
searchBox = browser.find_element_by_id("searchBox")  #經過id得到表單元素
searchBox.send_keys("python爬蟲之初戀 selenium")   #向表單輸入文字
searchBox.submit()                                #提交

text = browser.find_element_by_xpath("//*[@id='searchPage']/div[2]/div/div[1]/section/p[1]").text
print(text)

除了捕獲元素還有其餘的方法:

refresh()	刷新
close()	關閉當前標籤頁 (若是隻有一個標籤頁就關閉瀏覽器)
quit()	關閉瀏覽器
title	得到當前頁面的title
window_handles	得到全部窗口選項卡id集合
current_window_handle	得到當前窗口選項卡id
switchTo().window()	根據選項卡id切換標籤頁
switch_to_frame("iframeId")	切換到iframe
execute_script('window.open("www.segmentfault.com")')	執行js腳本（打開新標籤）
maximize_window()	最大化
get_screenshot_as_file()	截圖（圖片保存路徑+名稱+後綴）
set_page_load_timeout(30)	設置加載時間
ActionChains(driver).move_to_element(ele).perform()	鼠標懸浮在ele元素上
value_of_css_property()	得到元素的樣式（不管行內式仍是內嵌式）

啓動前添加參數

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--proxy-server=http://101.236.23.202:8866")  //代理
chromeOptions.add_argument("headless")   //不啓動瀏覽器模式
browser = webdriver.Chrome(chrome_options=chromeOptions)

不加載圖片啓動

def openDriver_no_img():
    options = webdriver.ChromeOptions()
    prefs = {
        'profile.default_content_setting_values': {
            'images': 2
        }
    }
    options.add_experimental_option('prefs', prefs)
    browser = webdriver.Chrome(chrome_options=options)
    return browser