基於python的scrapy框架爬取豆瓣電影及其可視化

1.Scrapy框架介紹node

scrapy

主要介紹,spiders,engine,scheduler,downloader,Item pipelinepython

scrapy常見命令以下:chrome

 

對應在scrapy文件中有,本身增長爬蟲文件,系統生成items,pipelines,setting的配置文件就這些。數據庫

items寫須要爬取的屬性名,pipelines寫一些數據流操做,寫入文件,仍是導入數據庫中。主要爬蟲文件寫domain,屬性名的xpath,在每頁添加屬性對應的信息等。json

movieRank = scrapy.Field() movieName = scrapy.Field() Director = scrapy.Field() movieDesc = scrapy.Field() movieRate = scrapy.Field() peopleCount = scrapy.Field() movieDate = scrapy.Field() movieCountry = scrapy.Field() movieCategory = scrapy.Field() moviePost = scrapy.Field()
import json class DoubanPipeline(object): def __init__(self): self.f = open("douban.json","w",encoding='utf-8') def process_item(self, item, spider): content = json.dumps(dict(item),ensure_ascii = False)+"\n" self.f.write(content) return item def close_spider(self,spider): self.f.close()

 

這裏xpath使用過程當中,安利一個chrome插件xpathHelper。框架

allowed_domains = ['douban.com'] baseURL = "https://movie.douban.com/top250?start=" offset = 0 start_urls = [baseURL + str(offset)] def parse(self, response): node_list = response.xpath("//div[@class='item']") for node in node_list: item = DoubanItem() item['movieName'] = node.xpath("./div[@class='info']/div[1]/a/span/text()").extract()[0] item['movieRank'] = node.xpath("./div[@class='pic']/em/text()").extract()[0] item['Director'] = node.xpath("./div[@class='info']/div[@class='bd']/p[1]/text()[1]").extract()[0] if len(node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")): item['movieDesc'] = node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0] else: item['movieDesc'] = "" item['movieRate'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0] item['peopleCount'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[4]/text()").extract()[0] item['movieDate'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[0] item['movieCountry'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[1] item['movieCategory'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[2] item['moviePost'] = node.xpath("./div[@class='pic']/a/img/@src").extract()[0] yield item if self.offset <250: self.offset += 25 url = self.baseURL+str(self.offset) yield scrapy.Request(url,callback = self.parse) 

 

這裏基本能夠爬蟲,產生須要的json文件。dom

接下來是可視化過程。scrapy

咱們先梳理一下,咱們掌握的數據狀況。ide

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') douban.info()

 基本咱們能夠分析,電影國家產地,電影拍攝年份,電影類別以及一些導演在TOP250中影響力。函數

先作個簡單瞭解,可使用value_counts()函數。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Country = douban['movieCountry'].copy() for i in range(len(df_Country)): item = df_Country.iloc[i].strip() df_Country.iloc[i] = item[0] print(df_Country.value_counts())

美國電影占半壁江山,122/250,能夠反映好萊塢電影工業之強大。一樣,日本電影和香港電影在中國也有着重要地位。使人意外是,中國大陸地區電影數量不是使人滿意。豆瓣影迷對於國內電影仍是很是挑剔的。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Date = douban['movieDate'].copy() for i in range(len(df_Date)): item = df_Date.iloc[i].strip() df_Date.iloc[i] = item[2] print(df_Date.value_counts())

2000年以來電影數目在70%以上,考慮10代纔過去9年和打分滯後性,整體來講越新的電影越能獲得受衆喜好。這可能和豆瓣top250選取機制有關,必須人數在必定數量以上。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Cate = douban['movieCategory'].copy() for i in range(len(df_Cate)): item = df_Cate.iloc[i].strip() df_Cate.iloc[i] = item[0] print(df_Cate.value_counts())

劇情電影情節起伏更容易獲得觀衆承認。

下面展現幾張可視化圖片

 

 

 

 不太會用python進行展現,有些難看。其實,推薦用Echarts等插件,或者用Excel,BI軟件來處理圖片,比較方便和美觀。

第一次作這種爬蟲和可視化,多有不足之處,懇請指出。

相關文章
相關標籤/搜索