1、爬取的對象html
豆瓣圖書的書籍,總共獲取1萬8千條數據。api
2、保存數據爲excel格式。app
3、數據分析dom
1.經過在excel進行數據處理,篩選出20部評分高,評論多的做品,以下圖所示。url
因而,推薦閱讀的書籍爲:spa
閒暇時分,可經過了解做品詳情,若感興趣,能夠閱讀該做品,省去篩選做品的時間。excel
2.有部分做者,做品多,並且評分也高,好比:code
若讀者們感興趣,也能夠找這些做者的做品來閱讀。orm
運行代碼:htm
import re import time import pandas as pd import random import requests from bs4 import BeautifulSoup user = ["Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",\ "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",\ "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",\ "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",\ "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",\ "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",\ "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",\ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"] def get_user(): use = random.choice(user) return use def get_soup(url): headers = {'user-agent':get_user()} res = requests.get(url,headers=headers) res.encoding = 'utf-8' soup=BeautifulSoup(res.text,"html.parser") time.sleep(random.random()*3) book_list = [] for i in range(0,20): book_dict = {} try: book_dict['做品']= soup.find_all("h2", {"class": ""})[i].find_all('a')[0].text.strip().replace("\n","") book_dict['做者']= soup.select('.pub')[i].text.strip().split('/')[0] book_dict['評分']=soup.findAll("span", {"class": "rating_nums"})[i].text.strip() book_dict['評論次數']=soup.select('.pl')[i].text.strip().lstrip("(").rstrip(")人評價") book_dict['價格']=soup.select('.pub')[i].text.strip().split('/')[-1] book_dict['詳情']= soup.find_all("div", {"class": "info"})[i].find_all('p')[0].text.strip().replace("\n","") except Exception as e: book_dict['做品']='' book_dict['做者']='' book_dict['評分']='' book_dict['評論次數']='' book_dict['價格']='' book_dict['詳情']='' else: book_list.append(book_dict) return book_list allbook_list =[] for i in range(0,980,20): url='https://book.douban.com/tag/%E6%AD%A6%E4%BE%A0?start={}&type=T'.format(i) allbook_list.extend(get_soup(url)) bookdf= pd.DataFrame(allbook_list) bookdf bookdf.to_csv(r'C:\Users\Administrator\Desktop\book.csv', encoding="utf_8_sig")