Python爬蟲爬取豆瓣讀書

時間 2019-11-13

標籤 python 爬蟲豆瓣讀書欄目 Python 简体版

原文原文鏈接

一，準備工做。html

工具：win10+Python3.6正則表達式

爬取目標：爬取圖中紅色方框的內容。app

原則：能在源碼中看到的信息都能爬取出來。工具

信息表現方式：CSV轉Excel。this

二，具體步驟。url

先給出具體代碼吧：spa

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4 import pandas as pd
 5 
 6 def gethtml(url):
 7     try:
 8         r = requests.get(url,timeout = 30)
 9         r.raise_for_status()
10         r.encoding = r.apparent_encoding
11         return r.text
12     except:
13         return "It is failed to get html!"
14 
15 def getcontent(url):
16     html = gethtml(url)
17     soup = BeautifulSoup(html,"html.parser")
18     # print(soup.prettify())
19     div = soup.find("div",class_="indent")
20     tables = div.find_all("table")
21 
22     price = []
23     date = []
24     nationality = []
25     nation = []  #standard
26     bookname=[]
27     link = []
28     score = []
29     comment = []
30     people = []
31     peo = []  #standard
32     author = []
33     for table in tables:
34         bookname.append(table.find_all("a")[1]['title'])   #bookname
35         link.append(table.find_all("a")[1]['href'])    #link
36         score.append(table.find("span",class_="rating_nums").string)   #score
37         comment.append(table.find_all("span")[-1].string)   #comment in a word
38 
39         people_info = table.find_all("span")[-2].text
40         people.append(re.findall(r'\d+', people_info))  #How many people comment on this book? Note:But there are sublist in the list.
41 
42         navistr = (table.find("p").string)   #nationality,author,translator,press,date,price
43         infos = str(navistr.split("/"))   #Note this info:The string has been interrupted.
44         infostr = str(navistr)            #Note this info:The string has not been interrupted.
45         s = infostr.split("/")
46         if re.findall(r'\[', s[0]):  # If the first character is "[",match the author.
47             w = re.findall(r'\s\D+', s[0])
48             author.append(w[0])
49         else:
50             author.append(s[0])
51 
52         #Find all infomations from infos.Just like price,nationality,author,translator,press,date
53         price_info = re.findall(r'\d+\.\d+', infos)
54         price.append((price_info[0]))   #We can get price.
55         date.append(s[-2])  #We can get date.
56         nationality_info = re.findall(r'[[](\D)[]]', infos)
57         nationality.append(nationality_info)   #We can get nationality.Note:But there are sublist in the list.
58     for i in nationality:
59         if len(i) == 1:
60             nation.append(i[0])
61         else:
62             nation.append("中")
63 
64     for i in people:
65         if len(i) == 1:
66             peo.append(i[0])
67 
68     print(bookname)
69     print(author)
70     print(nation)
71     print(score)
72     print(peo)
73     print(date)
74     print(price)
75     print(link)
76 
77     # 字典中的key值即爲csv中列名
78     dataframe = pd.DataFrame({'書名': bookname, '做者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'連接': link,})
79 
80     # 將DataFrame存儲爲csv,index表示是否顯示行名，default=True
81     dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',')
82 
83 
84 if __name__ == '__main__':
85     url = "https://book.douban.com/top250?start=0"   #If you want to add next pages,you have to alter the code.
86     getcontent(url)

1，爬取大體信息。code

選用以下輪子：htm

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4 
 5 def gethtml(url):
 6     try:
 7         r = requests.get(url,timeout = 30)
 8         r.raise_for_status()
 9         r.encoding = r.apparent_encoding
10         return r.text
11     except:
12         return "It is failed to get html!"
13 
14 def getcontent(url):
15     html = gethtml(url)
16     bsObj = BeautifulSoup(html,"html.parser")
17 
18 
19 if __name__ == '__main__':
20     url = "https://book.douban.com/top250?icn=index-book250-all"
21     getcontent(url)

這樣就能從bsObj獲取咱們想要的信息。blog

2，信息具體提取。

全部信息都在一個div中，這個div下有25個table，其中每一個table都是獨立的信息單元，咱們只用造出提取一個table的輪子（前提是確保這個輪子的兼容性）。咱們發現：一個div父節點下有25個table子節點，用以下方式提取：

    div = soup.find("div",class_="indent")
    tables = div.find_all("table")

書名能夠直接在節點中的title中提取（原始代碼確實這麼醜，但不影響）：

<a href="https://book.douban.com/subject/1770782/" onclick="&quot;moreurl(this,{i:'0'})&quot;" title="追風箏的人">
                追風箏的人

                
              </a>

據以下代碼提取：

bookname.append(table.find_all("a")[1]['title'])   #bookname

類似的不贅述。

評價人數打算用正則表達式提取：

people.append(re.findall(r'\d+', people_info))  #How many people comment on this book? Note:But there are sublist in the list.

people_info = 13456人評價。
在看其他信息：

<p class="pl">[美] 卡勒德·胡賽尼 / 李繼宏 / 上海人民出版社 / 2006-5 / 29.00元</p>

其中國籍有個「【】」符號，如何去掉？第一行給出回答。

nationality_info = re.findall(r'[[](\D)[]]', infos)  
        nationality.append(nationality_info)   #We can get nationality.Note:But there are sublist in the list.
    for i in nationality:
        if len(i) == 1:
            nation.append(i[0])
        else:
            nation.append("中")

其中有國籍的都寫出了，可是沒寫出的咱們發現都是中國，因此咱們把國籍爲空白的改寫爲「中」：

    for i in nationality:
        if len(i) == 1:
            nation.append(i[0])
        else:
            nation.append("中")

還有list中存在list的問題也很好解決：

    for i in people:
        if len(i) == 1:
            peo.append(i[0])

長度爲1證實不是空序列，就加上序號填寫處具體值，使變爲一個沒有子序列的序列。

打印結果以下圖：

基本是咱們想要的了。

而後寫入csv：

    dataframe = pd.DataFrame({'書名': bookname, '做者': author,'國籍': nation,'評分': score,'評分人數': peo,'出版時間': date,'價格': price,'連接': link,})

    # 將DataFrame存儲爲csv,index表示是否顯示行名，default=True
    dataframe.to_csv("C:/Users/zhengyong/Desktop/test.csv", index=False, encoding='utf-8-sig',sep=',')

注意：若是沒有加上encoding='utf-8-sig'會存在亂碼問題，因此這裏必須得加，固然你用其餘方法也可。

最後一個翻頁的問題，這裏因爲我沒作好兼容性問題，因此後面的頁碼中提取信息總是出問題，可是這裏仍是寫一下方法：

    for i in range(10):
        url = "https://book.douban.com/top250?start=" + str(i*25)
        getcontent(url)

注意要加上str。

效果圖：

其實這裏的效果圖與我寫入csv的傳人順序不一致，後期我會看看緣由。

三，總結。

大膽細心，這裏必定要細心，不少細節很差好深究後面會有不少東西修改。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。