[Python爬蟲] Selenium獲取百度百科旅遊景點的InfoBox消息盒

時間 2019-12-11

標籤 python 爬蟲 selenium 獲取百度百科旅遊景點 infobox 消息欄目 Python 简体版

原文原文鏈接

前面我講述過如何經過BeautifulSoup獲取維基百科的消息盒，一樣能夠經過Spider獲取網站內容，最近學習了Selenium+Phantomjs後，準備利用它們獲取百度百科的旅遊景點消息盒（InfoBox），這也是畢業設計實體對齊和屬性的對齊的語料庫前期準備工做。但願文章對你有所幫助~python

源代碼web

 1 # coding=utf-8  
 2 """ 
 3 Created on 2015-09-04 @author: Eastmount  
 4 """  
 5   
 6 import time          
 7 import re          
 8 import os  
 9 import sys
10 import codecs
11 from selenium import webdriver      
12 from selenium.webdriver.common.keys import Keys      
13 import selenium.webdriver.support.ui as ui      
14 from selenium.webdriver.common.action_chains import ActionChains  
15   
16 #Open PhantomJS  
17 driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")  
18 #driver = webdriver.Firefox()  
19 wait = ui.WebDriverWait(driver,10)
20 global info #全局變量
21 
22 #Get the infobox of 5A tourist spots  
23 def getInfobox(name):  
24     try:  
25         #create paths and txt files
26         global info
27         basePathDirectory = "Tourist_spots_5A"  
28         if not os.path.exists(basePathDirectory):  
29             os.makedirs(basePathDirectory)  
30         baiduFile = os.path.join(basePathDirectory,"BaiduSpider.txt")  
31         if not os.path.exists(baiduFile):  
32             info = codecs.open(baiduFile,'w','utf-8')  
33         else:  
34             info = codecs.open(baiduFile,'a','utf-8')  
35       
36         #locate input  notice: 1.visit url by unicode 2.write files  
37         print name.rstrip('\n') #delete char '\n'  
38         driver.get("http://baike.baidu.com/")  
39         elem_inp = driver.find_element_by_xpath("//form[@id='searchForm']/input")  
40         elem_inp.send_keys(name)  
41         elem_inp.send_keys(Keys.RETURN)  
42         info.write(name.rstrip('\n')+'\r\n')  #codecs不支持'\n'換行
43         #print driver.current_url  
44         time.sleep(5)  
45   
46         #load infobox  
47         elem_name = driver.find_elements_by_xpath("//div[@class='basic-info']/dl/dt")  
48         elem_value = driver.find_elements_by_xpath("//div[@class='basic-info']/dl/dd")  
49   
50         #create dictionary key-value
51         #字典是一種散列表結構,數據輸入後按特徵被散列,不記錄原來的數據,順序建議元組
52         elem_dic = dict(zip(elem_name,elem_value)) 
53         for key in elem_dic:  
54             print key.text,elem_dic[key].text  
55             info.writelines(key.text+" "+elem_dic[key].text+'\r\n')  
56         time.sleep(5)  
57           
58     except Exception,e: #'utf8' codec can't decode byte  
59         print "Error: ",e  
60     finally:  
61         print '\n'  
62         info.write('\r\n')  
63   
64 #Main function  
65 def main():
66     global info
67     #By function get information   
68     source = open("Tourist_spots_5A_BD.txt",'r')  
69     for name in source:  
70         name = unicode(name,"utf-8")  
71         if u'故宮' in name: #else add a '?'  
72             name = u'北京故宮'  
73         getInfobox(name)  
74     print 'End Read Files!'  
75     source.close()  
76     info.close()  
77     driver.close()  
78   
79 main()

運行結果
主要經過從F盤中txt文件中讀取國家5A級景區的名字，再調用Phantomjs.exe瀏覽器依次訪問獲取InfoBox值。同時若是存在編碼問題「'ascii' codec can't encode characters」則可經過下面代碼設置編譯器utf-8編碼，代碼以下：windows

#設置編碼utf-8
import sys 
reload(sys)  
sys.setdefaultencoding('utf-8')
#顯示當前默認編碼方式
print sys.getdefaultencoding()

對應源碼
其中對應的百度百科InfoBox源代碼以下圖，代碼中基礎知識能夠參考我前面的博文或個人Python爬蟲專利，Selenium不單單擅長作自動測試，一樣適合作簡單的爬蟲。

編碼問題
此時你仍然可能遇到「'ascii' codec can't encode characters」編碼問題。瀏覽器

它是由於你建立txt文件時默認是ascii格式，此時你的文字確實'utf-8'格式，因此須要轉換經過以下方法。python爬蟲

 1 import codecs
 2 
 3 #用codecs提供的open方法來指定打開的文件的語言編碼,它會在讀取的時候自動轉換爲內部unicode
 4 if not os.path.exists(baiduFile):  
 5     info = codecs.open(baiduFile,'w','utf-8')  
 6 else:  
 7     info = codecs.open(baiduFile,'a','utf-8')
 8     
 9 #該方法不是io故換行是'\r\n'
10 info.writelines(key.text+":"+elem_dic[key].text+'\r\n')

總結
你能夠代碼中學習基本的自動化爬蟲方法、同時能夠學會如何經過for循環顯示key-value鍵值對，對應的就是顯示的屬性和屬性值，經過以下代碼實現：
elem_dic = dict(zip(elem_name,elem_value))
但最後的輸出結果不是infobox中的順序，why?
最後但願文章對你有所幫助，還有一篇基礎介紹文章，可是發表時總會引起CSDN敏感系統自動鎖定，並且不知道哪裏引發的觸發。推薦你能夠閱讀~
[python爬蟲] Selenium常見元素定位方法和操做的學習介紹
（By:Eastmount 2015-9-6 深夜2點半 http://blog.csdn.net/eastmount/）

ide

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。