Scrapy爬蟲進階操做之CrawlSpider（二）

時間 2019-11-09

標籤 scrapy 爬蟲進階 crawlspider 欄目 Python 简体版

原文原文鏈接

開頭再來波小程序搖一搖：php

上一章節，咱們講到了經過Rules來獲取下一個待爬頁面的URL，那麼咱們今天就來說講具體的怎麼爬取一個頁面。html

由於咱們的目的是爬取整個36頁的所有美劇列表，可是在36頁數據裏面，除了第一頁，其餘35也的網頁都是很規律很整齊的，爲啥第一頁不同？由於第一頁有前三甲的圖片，這個咱們須要特殊處理一下。因此，今天這節，咱們主要來講爬取剩下35頁內榮怎麼爬。shell

http://www.ttmeiju.me/index.php/summary/index/p/2.html編程

0x00_承前

前一篇文章『Scrapy爬蟲進階操做之CrawlSpider（一）』，咱們主要講到了編寫 Rule 來提取下一頁的URL：json

rules = (        Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]//a[@class="next"]'), callback='parse_item', follow=True),    )複製代碼

今天，咱們就來完善咱們這個 Rule 裏面的 callback : parse_item() 方法。小程序

0x01_分析

咱們要爬取的頁面，大概長上圖的那個樣子。經過觀察，看到這裏面每項都是很規則的排布，在看網頁源碼裏面，bash

咱們就發現，每一條數據，其實都是一個 tr 標籤，這個標籤的特色就是 class 裏面含有 Scontent 或者 Scontent1 這兩個字段。dom

接着在看 tr 標籤內部，每一條美劇信息裏面有 7 個內容，即：序號，劇名，分類，狀態，更新日期，迴歸日期和倒計時。咱們的目的就是要把這 7 個都摘出來。因此，咱們在 items.py 裏面建立一個 scrapy.item 類：scrapy

class TtmjTvPlayItem(scrapy.Item):    tv_play_name = scrapy.Field()    tv_play_rank = scrapy.Field()    tv_play_category = scrapy.Field()    tv_play_state = scrapy.Field()    tv_play_update_day = scrapy.Field()    tv_play_return_date = scrapy.Field()    tv_play_counting_data = scrapy.Field()    tv_play_url = scrapy.Field()複製代碼

這個 scrapy.Item 就是scrapy將爬取結果封裝到一個bean中，以後再由用戶隨便處置這個bean。這些變量就是咱們前面說的那 7 個變量。ide

0x02_爬頁面第一步

爲了更好的調試，咱們先把 Rule 裏面的 follow 去掉，或者改成 False。這個作的目的是咱們先用：

http://www.ttmeiju.me/index.php/summary/index/p/2.html複製代碼

頁面進行調試。將 follow 改成 False,這個時候咱們運行下面的代碼：

class PlayrankingSpider(CrawlSpider):    name = 'playranking'    allowed_domains = ['www.ttmeiju.me']    # start_urls = ['http://www.ttmeiju.me/'] root_url = "http://www.ttmeiju.me/" start_urls = ['http://www.ttmeiju.me/index.php/summary/index/p/1.html'] rules = ( # Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]//a[@class="next"]'), callback='parse_item',follow=True), Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]//a[@class="next"]'), callback='parse_item'), ) def parse_item(self, response): print(response.url)複製代碼

看到結果顯示的是這個樣子：

咱們看到，去掉follow以後，果真只是爬取兩頁內容，可是這裏 parse_item() 只打印出來第二頁的 URL，第一頁的URL沒有打印出來。這其中的過程在這裏簡單說一下：

CrawlSpider首先是經過 startUrl 來做爲爬蟲開始爬取的頁面的，全部的 Rules 生效的頁面都是在第一個頁面以及以後的頁面，因此這裏，當 Scrapy 向 startUrl 發送第一個請求並獲得 response，咱們開始調用 Rules 來萃取出來接下來須要爬取的 URL ，因此， parse_item() 這個回調生效的頁面就是除了 startUrl 頁面的其餘頁面。

但是咱們也要爬取 startUrl 頁面裏面的東西的啊，別急，咱們下一節說這個問題。這一節咱們先來搞定除第一個頁面的其餘頁面的爬取。

那麼接下來咱們就是要從 URL 裏面來獲取到咱們上面 TtmjTvPlayItem 裏面各個變量的值了。

0x03_採坑

頁面數據很規律，咱們仍是經過 xPath 來定位元素信息。

每一條美劇的信息，都包裹在 tr 標籤下，並且 tr 標籤的特色就是 class 裏面含有 Scontent 字段。

因此，咱們這裏能夠按照這樣的思路來操做：先將全部美劇的 tr 標籤爬取出來，成一個 list，而後再遍歷 list 裏面的每個變量，再從中獲取出來各個變量賦值到 TtmjTvPlayItem 裏面。

咱們使用在上一節提到的 Scrapy Shell 來作 xpath 的定位：

$ scrapy shell http://www.ttmeiju.me/index.php/summary/index/p/2.html$ response.xpath('//tr[contains(@class,"Scontent")]') 複製代碼

能夠看到結果倒是是全部的美劇列表：

而後咱們只須要遍歷這每個 tr 標籤便可，因此咱們的 parse_item() 方法寫成這樣：

def parse_item(self, response):    tr_list = response.xpath('//tr[contains(@class, "Scontent")]')    for content in tr_list:        name = content.xpath('//td[@align="left"]//a/text()').extract()        print(name)複製代碼

這裏咱們要答應全部的名字，結果發現：

WTF！這尼瑪好像把全部的名字都打印出來了，這不是咱們想要的結果啊。

這裏就是坑！！！

xpath選出來 selectorList 以後，若是想要針對每個 selector 進行操做，就須要把每個 selector 先 extract() 出來，再封裝成 Selector，以後的操做，就對這個 Selector 操做就能夠。

咱們將代碼改爲下面這個樣子：

def parse_item(self, response):        tr_list = response.xpath('//tr[contains(@class, "Scontent")]').extract()        for content in tr_list:            selector = Selector(text=content)            name = selector.xpath('//td[@align="left"]//a/text()').extract()            print(name)複製代碼

注意第二行和第四行，第二行調用了extract()，先提取成 string，而後第四行咱們在針對每一個 string 封裝成 selector供下面的使用。這樣咱們再看打印結果：

看到每一個名字都答應出來了。這一步成功了，咱們接下來把其餘數據搞定就能夠了。

0x05_完善數據

針對每一條美劇，名字分佈頗有規律，就是一個有align屬性的 a 標籤，其餘的內容都是在 td 標籤裏面。

咱們按照上面的思路，把代碼改一下，找到全部的td來看一下：

def parse_item(self, response):    tr_list = response.xpath('//tr[contains(@class, "Scontent")]').extract()    for content in tr_list:        selector = Selector(text=content)        td_list = selector.xpath('//td/text()').extract()        tv_name = selector.xpath('//td[@align="left"]//a/text()').extract_first()        print(tv_name)        for td_item in enumerate(td_list):            print(td_item)複製代碼

結果很讓咱們意外：

找到是找到了，結果裏面有好多換行符和空格，那麼咱們只須要簡單處理一下：

def parse_item(self, response):    tr_list = response.xpath('//tr[contains(@class, "Scontent")]').extract()    for content in tr_list:        selector = Selector(text=content)        td_list = selector.xpath('//td/text()').extract()        tv_name = selector.xpath('//td[@align="left"]//a/text()').extract_first()        print(tv_name)        for index, td_item in enumerate(td_list):            td_item = td_item.replace('\t', '').strip()            print(td_item)複製代碼

這樣，咱們就能夠打印出來正確的結果了：

0x06_找到URL

上面的數據，咱們還差一個美劇的URL，這個URL咱們分析，是在美劇名字裏面：

因此，這裏咱們用 xpath 定位出來這個標籤，而後把數據讀取出來就好：

# xpath 取屬性用 @+屬性名tv_play_url = self.root_url + selector.xpath('//td[@align="left"]//a/@href').extract_first()[1:]複製代碼

這裏咱們發現， xpath 讀取 tag 的標籤裏面某個屬性的值，就用 @屬性名 就能夠。

0x07_整合itmem

咱們最後在把 TtmjTvPlayItem 整合到咱們的 parse_item() 裏面：

def parse_item(self, response):    tr_list = response.xpath('//tr[contains(@class, "Scontent")]').extract()    for content in tr_list:        tv_item = TtmjTvPlayItem()        selector = Selector(text=content)        td_list = selector.xpath('//td/text()').extract()        tv_name = selector.xpath('//td[@align="left"]//a/text()').extract_first()        tv_item['tv_play_name'] = tv_name        for index, td_item in enumerate(td_list):            td_item = td_item.replace('\t', '').strip()            if index == 0:                tv_item['tv_play_rank'] = td_item            elif index == 1:                tv_item['tv_play_category'] = td_item            elif index == 2:                tv_item['tv_play_state'] = td_item            elif index == 3:                tv_item['tv_play_update_day'] = td_item            elif index == 4:                tv_item['tv_play_return_date'] = td_item            elif index == 5:                tv_item['tv_play_counting_data'] = td_item        # xpath 取屬性用 @+屬性名 tv_item['tv_play_url'] = self.root_url + selector.xpath('//td[@align="left"]//a/@href').extract_first()[1:] print(tv_item)複製代碼