(六)爬蟲之使用selenium

  selenium是使用javascript編寫,主要用來進行web應用程序測試,在python爬蟲中能夠用來進行動態網頁爬取,解決爬蟲中的javascript渲染(執行js語句)。總結記錄下,以備後面學習javascript

1. selenium基本使用css

  安裝:pip install seleniumhtml

  查看支持的瀏覽器: 安裝完後,命令行輸入下面語句,能夠查看selenium支持的web瀏覽器,以下圖所示java

    from selenium import webdriverpython

    help(webdriver)git

  簡單使用:這裏使用Firefox瀏覽器,首先得下載Firefox的驅動geckodriver到本地,並將其路徑配置到環境變量,下面爲簡單使用。github

from selenium import webdriver browser = webdriver.Firefox() #firefox驅動 browser.get("https://www.zhihu.com/signup?next=%2F") #瀏覽器打開網頁 print(browser.page_source) #打印網頁源代碼 browser.close() # 關閉瀏覽器

  (報錯Message: 'geckodriver' executable needs to be in PATH時,  下載geckodriver,並將其路徑配置到環境變量)web

  簡單使用方法:ajax

#簡單方法 driver = = webdriver.Firefox() 初始化瀏覽器 driver.get("http://www.example.com") #打開網頁 driver.forward() #前進 driver.back() #後退 driver.close() #關閉瀏覽器

  使用cookie:canvas

# Go to the correct domain driver.get("http://www.example.com") # Now set the cookie. This one's valid for the entire domain
cookie = {‘name’ : ‘foo’, ‘value’ : ‘bar’} driver.add_cookie(cookie) #添加cookie,全局的cookie(對該域名下全部url) # And now output all the available cookies for the current URL driver.get_cookies() #獲取cookie

2.查找元素

        webdriver打開網頁後,能夠查找網頁中的單個或多個元素,並對其操做,下面爲經常使用方法

https://selenium-python.readthedocs.io/locating-elements.html
 #單個元素 find_element_by_id() 經過id屬性 find_element_by_name() 經過name屬性 find_element_by_xpath() xpath選擇器 find_element_by_link_text() 經過超連接文本定位 find_element_by_partial_link_text() 經過部分超連接文本定位 find_element_by_tag_name() 標籤名稱 find_element_by_class_name() 類選擇器 find_element_by_css_selector() css選擇器 #多個元素 (返回一個列表) find_elements_by_name find_elements_by_id find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector

  經過元素name屬性查找:

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html> username = driver.find_element_by_name('username') password = driver.find_element_by_name('password')
find_element_by_name

  經過超連接文本查找:

<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html> continue_link = driver.find_element_by_link_text('Continue') #Continue爲a標籤的文本 continue_link = driver.find_element_by_partial_link_text('Conti')
find_element_by_link_text()

  經過Xpath選擇器

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html> #查找form元素 login_form = driver.find_element_by_xpath("/html/body/form[1]") login_form = driver.find_element_by_xpath("//form[1]") login_form = driver.find_element_by_xpath("//form[@id='loginForm']") #查找form表單中的username輸入框input username = driver.find_element_by_xpath("//form[input/@name='username']") username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]") username = driver.find_element_by_xpath("//input[@name='username']") #查找form表單中的clear button clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']") clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")
find_element_by_xpath

  經過css_selector選擇器

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html> content = driver.find_element_by_css_selector('p.content')
find_element_by_css_selector()

  還能夠經過get_element()和get_elements()兩個方法查找,使用以下:

from selenium.webdriver.common.by import By driver.find_element(By.XPATH, '//button[text()="Some text"]') driver.find_elements(By.XPATH, '//button') #By支持的其餘屬性 ID = "id" (By.ID) XPATH = "xpath" LINK_TEXT = "link text" PARTIAL_LINK_TEXT = "partial link text" NAME = "name" TAG_NAME = "tag name" CLASS_NAME = "class name" CSS_SELECTOR = "css selector"
find_element()

3. 元素操控

   查找到元素後,能夠得到元素的相關屬性,並對元素進行操控,下面爲經常使用方法:

#元素操做 元素.click() 點擊元素 元素.clear() 清楚元素的內容 元素.send_keys(value) 給元素賦值 元素.get_attribute("class") 得到元素屬性 #元素屬性 元素.text 得到元素文本 元素.id 得到元素id 元素.location 得到元素 元素.tag_name 得到元素標籤名 元素.size 得到元素大小

  對於表單中的select標籤,selenium也能夠進行選中和反選,和獲取相應的options,方法以下:

https://blog.csdn.net/huilan_same/article/details/52246012
https://selenium-python.readthedocs.io/navigating.html
 #選中select中某一項 from selenium.webdriver.support.ui import Select select = Select(driver.find_element_by_name('name')) #初始化select select.select_by_index(index)    #經過options編號查找,第一個option的index=0
select.select_by_visible_text("text")  #經過選擇text=text的值,即在下拉時咱們能夠看到的文本 select.select_by_value(value) #經過options的value屬性選擇 #反選 select = Select(driver.find_element_by_id('id')) select.deselect_all() select.deselect_by_index(index) select.deselect_by_value(value) select.deselect_by_visible_text(text) #查看選擇的option select = Select(driver.find_element_by_xpath("//select[@name='name']")) all_selected_options = select.all_selected_options #全部選擇的options select.first_selected_option #第一個選擇的option options = select.options    #全部的options

4.window和frame切換

   通常大型的網頁window中都會包含frame,selenium能夠在多個window,多個frame,window和frame之間進行切換,使用以下:

#切換窗口 <a href="somewhere.html" target="windowName">Click here to open a new window</a> driver.switch_to_window("windowName")  #切換到windw::windowName
driver.switch_to_default_content() #切回到父窗口

#切換到frame
driver.switch_to_frame("frameName") #切換到frame
driver.switch_to.frame("app_canvas_frame") #切換到frame,和switch_to_frame()同樣
driver.switch_to.parent_frame() #切回到父frame

#切換到彈出窗口
alert = driver.switch_to_alert()

  selenium還支持移動一個元素到必定位置,主要利用drag_and_drop()方法:

element = driver.find_element_by_name("source") target = driver.find_element_by_name("target") from selenium.webdriver import ActionChains action_chains = ActionChains(driver) #添加責任鏈 action_chains.drag_and_drop(element, target).perform() #從element移動到target
drag_and_drop()

5. 顯示等待和隱式等待

       因爲ajax的使用,當打開網頁時,有些元素須要必定的時間經過ajax進行傳輸才顯示在網頁中,所以當利用selenium查找元素時,須要等待一段時間,否則會拋出錯誤(ElementNotVisibleException exception),selenium主要有顯示等待(Explicit waits)和隱式等待(implicit waits)兩種方式

  Explicit waits : 在某個條件知足前,瀏覽器會等待一段時間;若等待設定時間後,條件依舊沒知足,則拋出錯誤,若條件提早知足,則結束等待。使用以下:

from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Firefox() driver.get("http://somedomain/url_that_delays_loading") try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) #在ID爲myDynamicElement的元素出現前,等待10s ) finally: driver.quit()

  selenium還支持以下的等待事件:

#等待事件 title_is title_contains presence_of_element_located visibility_of_element_located visibility_of presence_of_all_elements_located text_to_be_present_in_element text_to_be_present_in_element_value frame_to_be_available_and_switch_to_it invisibility_of_element_located element_to_be_clickable staleness_of element_to_be_selected element_located_to_be_selected element_selection_state_to_be element_located_selection_state_to_be alert_is_present #使用示例 from selenium.webdriver.support import expected_conditions as EC wait = WebDriverWait(driver, 10) element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
等待事件

  自定義等待事件:

class element_has_css_class(object): """An expectation for checking that an element has a particular css class.
 locator - used to find the element returns the WebElement once it has the particular css class
  """  def __init__(self, locator, css_class): self.locator = locator self.css_class = css_class def __call__(self, driver): element = driver.find_element(*self.locator) # Finding the referenced element if self.css_class in element.get_attribute("class"): return element else: return False # Wait until an element with id='myNewInput' has class 'myCSSClass' wait = WebDriverWait(driver, 10) element = wait.until(element_has_css_class((By.ID, 'myNewInput'), "myCSSClass"))
View Code

   implicit waits: 在查找全部元素前都會等待設定時間,默認的等待時間爲0,使用以下:

from selenium import webdriver driver = webdriver.Firefox() driver.implicitly_wait(10) # 設定等待10s driver.get("http://somedomain/url_that_delays_loading") myDynamicElement = driver.find_element_by_id("myDynamicElement")  #先等待10s,再查找元素

 除上面介紹的方法外,selenium還支持不少API方法,參考:https://selenium-python.readthedocs.io/api.html

6. selenium爬取QQ空間

   練習下selenium的使用,爬取朋友qq空間的全部說說到本地,並生成一張雲圖。整個思路以下:

    1,登錄網頁版QQ,須要selenium輸入用戶名和密碼,並點擊登錄

    2,訪問朋友的QQ空間,爬取說說。須要selenium下滑進度條,並點擊翻頁

    3,利用wordcloud模塊,將爬取的說說生成一張雲圖

   爬取代碼以下: 

#coding:utf-8 #登錄qq,進入朋友qq空間說說界面,爬取其全部的說說 from selenium import webdriver import time from lxml import html from word_cloud import generate_wd def login_qq(browser,qq,password): browser.get("https://i.qq.com/") #切換到登錄的子frame browser.switch_to_frame("login_frame") #從frame切換到frame,或從父窗口切入iframe #time.sleep(3) #等待3秒,等待子frame出現徹底 (若是報錯:selenium.common.exceptions.ElementNotInteractableException) browser.find_element_by_id("switcher_plogin").click() #點擊帳號密碼登錄選項 browser.find_element_by_id("u").send_keys(str(qq)) browser.find_element_by_id("p").send_keys(password) browser.find_element_by_id("login_button").click() #點擊登陸 time.sleep(3) #登錄後等待3秒,頁面加載 browser.switch_to_default_content() #回到parent frame def crawel_qqzone(browser,friend_qq): browser.get("https://user.qzone.qq.com/%s/311"%friend_qq) #訪問朋友主頁說說界面,後面311表示說說界面 r = open("qq_content.txt","a+") page = 0
    while True: for i in range(1,6): #document.body.scrollHeight 能夠查看進度條的像素(高度) height = 2000*i movedown = "window.scrollBy(0,"+ str(height)+")" #滾動到指定像素值 browser.execute_script(movedown) time.sleep(3) #進入到說說內容界面的frame browser.switch_to.frame("app_canvas_frame") docu = browser.page_source.encode("utf-8") s = html.fromstring(docu) li_tags = s.xpath("//ol[@id='msgList']/li") #print(li_tags) for li_tag in li_tags: qq_name = li_tag.xpath("./div[3]/div[2]/a/text()") qq_content = li_tag.xpath("./div[3]/div[2]/pre/text()") qq_time = li_tag.xpath("./div[3]/div[4]/div[1]/span[1]/a/text()") qq_name = qq_name[0] if len(qq_name)>0 else "" qq_content = qq_content[0] if len(qq_content)>0 else "" qq_time = qq_time[0] if len(qq_time)>0 else "" print qq_name,qq_content,qq_time r.write(qq_content.encode("utf-8")+"\n") #最後一頁時,下一頁標籤沒有id值 if docu.find("pager_next_%s"%page)== -1: break #找到下一頁標籤,點擊 browser.find_element_by_id("pager_next_%s"%page).click() page = page+1 browser.switch_to.parent_frame() #回到父窗口,下次循環中往下滑頁面 r.close() browser.close() if __name__=="__main__": browser = webdriver.Firefox() login_qq(browser,1368884216,"密碼") crawel_qqzone(browser,993342902) #generate_wd("qq_content.txt")     #將爬取下來的QQ空間說說生成詞雲
爬取qq空間

     生成雲圖代碼以下:

#coding:utf-8

from wordcloud import WordCloud from matplotlib import pyplot as plt import jieba import chardet import os #使用jieba來解決中文亂碼 def generate_wd(filename): with open(filename) as f: text = f.read() wordlist = jieba.cut(text,cut_all=True) text = " ".join(wordlist) wc = WordCloud( background_color = "white", max_words = 2000, font_path =r"C:\Windows\Fonts\STFANGSO.TTF", #須要中文字體文件來顯示中文 height = 1200, width = 1600, max_font_size = 100, random_state = 30, ) myword = wc.generate(text) plt.imshow(myword) plt.axis("off") plt.show() dirname,basename = os.path.split(filename) fn,ext = basename.splitext(basename) wc.to_file(os.path.join(dirname,fn+".png") #使用字體編碼來解決中文亂碼 # def generate_wd(filename): # with open(filename) as f: # text = f.read() # code = chardet.detect(text) #檢測文件編碼格式 # wc = WordCloud( # background_color = "white", # max_words = 2000, # font_path =r"C:\Windows\Fonts\STFANGSO.TTF", #須要中文字體文件來顯示中文 # height = 1200, # width = 1600, # max_font_size = 100, # random_state = 30, # ) # myword = wc.generate(text.decode(code["encoding"])) #中文需傳入unicode字符 # plt.imshow(myword) # plt.axis("off") # plt.show() # dirname,basename = os.path.split(filename) # fn,ext = basename.splitext(basename) # wc.to_file(os.path.join(dirname,fn+".png")) if __name__=="__main__": generate_wd("qq_content.txt")
生成雲圖

 參考: https://www.cnblogs.com/zhaof/p/6953241.html

相關文章
相關標籤/搜索