一,準備工做。html
工具:win10+Python3.6正則表達式
爬取目標:爬取圖中紅色方框的內容。app
原則:能在源碼中看到的信息都能爬取出來。工具
信息表現方式:CSV轉Excel。this
二,具體步驟。url
先給出具體代碼吧:spa
1 import requests 2 import re 3 from bs4 import BeautifulSoup 4 import pandas as pd 5 6 def gethtml(url): 7 try: 8 r = requests.get(url,timeout = 30) 9 r.raise_for_status() 10 r.encoding = r.apparent_encoding 11 return r.text 12 except: 13 return "It is failed to get html!" 14 15 def getcontent(url): 16 html = gethtml(url) 17 soup = BeautifulSoup(html,"html.parser") 18 # print(soup.prettify()) 19 div = soup.find("div",class_="indent") 20 tables = div.find_all("table") 21 22 price = [] 23 date = [] 24 nationality = [] 25 nation = [] #standard 26 bookname=[] 27 link = [] 28 score = [] 29 comment = [] 30 people = [] 31 peo = [] #standard 32 author = [] 33 for table in tables: 34 bookname.append(table.find_all("a")[1]['title']) #bookname 35 link.append(table.find_all("a")[1]['href']) #link 36 score.append(table.find("span",class_="rating_nums").string) #score 37 comment.append(table.find_all("span")[-1].string) #comment in a word 38 39 people_info = table.find_all("span")[-2].text 40 people.append(re.findall(r'\d+', people_info)) #How many people comment on this book? Note:But there are sublist in the list. 41 42 navistr = (table.find("p").string) #nationality,author,translator,press,date,price 43 infos = str(navistr.split("/")) #Note this info:The string has been interrupted. 44 infostr = str(navistr) #Note this info:The string has not been interrupted. 45 s = infostr.split("/") 46 if re.findall(r'\[', s[0]): # If the first character is "[",match the author. 47 w = re.findall(r'\s\D+', s[0]) 48 author.append(w[0]) 49 else: 50 author.append(s[0]) 51 52 #Find all infomations from infos.Just like price,nationality,author,translator,press,date 53 price_info = re.findall(r'\d+\.\d+', infos) 54 price.append((price_info[0])) #We can get price. 55 date.append(s[-2]) #We can get date. 56 nationality_info = re.findall(r'[[](\D)[]]', infos) 57 nationality.append(nationality_info) #We can get nationality.Note:But there are sublist in the list. 58 for i in nationality: 59 if len(i) == 1: 60 nation.append(i[0]) 61 else: 62 nation.append("中") 63 64 for i in people: 65 if len(i) == 1: 66 peo.append(i[0]) 67 68 print(bookname) 69 print(author) 70 print(nation) 71 print(score) 72 print(peo) 73 print(date) 74 print(price) 75 print(link) 76 77 # 字典中的key值即爲csv中列名 78 dataframe = pd.DataFrame({'書名': bookname, '做者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'連接': link,}) 79 80 # 將DataFrame存儲爲csv,index表示是否顯示行名,default=True 81 dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',') 82 83 84 if __name__ == '__main__': 85 url = "https://book.douban.com/top250?start=0" #If you want to add next pages,you have to alter the code. 86 getcontent(url)
1,爬取大體信息。code
選用以下輪子:htm
1 import requests 2 import re 3 from bs4 import BeautifulSoup 4 5 def gethtml(url): 6 try: 7 r = requests.get(url,timeout = 30) 8 r.raise_for_status() 9 r.encoding = r.apparent_encoding 10 return r.text 11 except: 12 return "It is failed to get html!" 13 14 def getcontent(url): 15 html = gethtml(url) 16 bsObj = BeautifulSoup(html,"html.parser") 17 18 19 if __name__ == '__main__': 20 url = "https://book.douban.com/top250?icn=index-book250-all" 21 getcontent(url)
這樣就能從bsObj獲取咱們想要的信息。blog
2,信息具體提取。
全部信息都在一個div中,這個div下有25個table,其中每一個table都是獨立的信息單元,咱們只用造出提取一個table的輪子(前提是確保這個輪子的兼容性)。咱們發現:一個div父節點下有25個table子節點,用以下方式提取:
div = soup.find("div",class_="indent") tables = div.find_all("table")
書名能夠直接在節點中的title中提取(原始代碼確實這麼醜,但不影響):
<a href="https://book.douban.com/subject/1770782/" onclick=""moreurl(this,{i:'0'})"" title="追風箏的人"> 追風箏的人 </a>
據以下代碼提取:
bookname.append(table.find_all("a")[1]['title']) #bookname
類似的不贅述。
評價人數打算用正則表達式提取:
people.append(re.findall(r'\d+', people_info)) #How many people comment on this book? Note:But there are sublist in the list.
people_info = 13456人評價。
在看其他信息:
<p class="pl">[美] 卡勒德·胡賽尼 / 李繼宏 / 上海人民出版社 / 2006-5 / 29.00元</p>
其中國籍有個「【】」符號,如何去掉?第一行給出回答。
nationality_info = re.findall(r'[[](\D)[]]', infos) nationality.append(nationality_info) #We can get nationality.Note:But there are sublist in the list. for i in nationality: if len(i) == 1: nation.append(i[0]) else: nation.append("中")
其中有國籍的都寫出了,可是沒寫出的咱們發現都是中國,因此咱們把國籍爲空白的改寫爲「中」:
for i in nationality: if len(i) == 1: nation.append(i[0]) else: nation.append("中")
還有list中存在list的問題也很好解決:
for i in people: if len(i) == 1: peo.append(i[0])
長度爲1證實不是空序列,就加上序號填寫處具體值,使變爲一個沒有子序列的序列。
打印結果以下圖:
基本是咱們想要的了。
而後寫入csv:
dataframe = pd.DataFrame({'書名': bookname, '做者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'連接': link,}) # 將DataFrame存儲爲csv,index表示是否顯示行名,default=True dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',')
注意:若是沒有加上encoding='utf-8-sig'會存在亂碼問題,因此這裏必須得加,固然你用其餘方法也可。
最後一個翻頁的問題,這裏因爲我沒作好兼容性問題,因此後面的頁碼中提取信息總是出問題,可是這裏仍是寫一下方法:
for i in range(10): url = "https://book.douban.com/top250?start=" + str(i*25) getcontent(url)
注意要加上str。
效果圖:
其實這裏的效果圖與我寫入csv的傳人順序不一致,後期我會看看緣由。
三,總結。
大膽細心,這裏必定要細心,不少細節很差好深究後面會有不少東西修改。