python 全棧開發，Day137(爬蟲系列之第4章-scrapy框架)

時間 2019-11-18

標籤 python 開發 day137 day 爬蟲系列 scrapy 框架欄目 Python 简体版

原文原文鏈接

1、scrapy框架簡介

1. 介紹

Scrapy一個開源和協做的框架，其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的，使用它能夠以快速、簡單、可擴展的方式從網站中提取所需的數據。但目前Scrapy的用途十分普遍，可用於如數據挖掘、監測和自動化測試等領域，也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy 是基於twisted框架開發而來，twisted是一個流行的事件驅動的python網絡框架。所以Scrapy使用了一種非阻塞（又名異步）的代碼來實現併發。css

它是爬蟲界最知名的框架。就比如web框架中的djangohtml

Scrapy之因此能實現異步，得益於twisted框架
twisted有事件隊列，哪個事件有活動，就會執行！python

總體架構大體以下：linux

藍條部分，表示中間件！git

能夠將SPIDERS，SCHEDULER，DOWNLOADER，ITEM PIPELINE理解爲4我的。github

它們不直接通信，都是經過ENGINE來完成通信的！web

'''
Components：

一、引擎(EGINE)
引擎負責控制系統全部組件之間的數據流，並在某些動做發生時觸發事件。有關詳細信息，請參見上面的數據流部分。

二、調度器(SCHEDULER)
用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL的優先級隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址

三、下載器(DOWLOADER)
用於下載網頁內容, 並將網頁內容返回給EGINE，下載器是創建在twisted這個高效的異步模型上的

四、爬蟲(SPIDERS)
SPIDERS是開發人員自定義的類，用來解析responses，而且提取items，或者發送新的請求

五、項目管道(ITEM PIPLINES)
在items被提取後負責處理它們，主要包括清理、驗證、持久化（好比存到數據庫）等操做
下載器中間件(Downloader Middlewares)位於Scrapy引擎和下載器之間，主要用來處理從EGINE傳到DOWLOADER的請求request，已經從DOWNLOADER傳到EGINE的響應response，
你可用該中間件作如下幾件事：
  　　(1) process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
  　　(2) change received response before passing it to a spider;
  　　(3) send a new Request instead of passing received response to a spider;
  　　(4) pass response to a spider without fetching a web page;
  　　(5) silently drop some requests.

六、爬蟲中間件(Spider Middlewares)
位於EGINE和SPIDERS之間，主要工做是處理SPIDERS的輸入（即responses）和輸出（即requests）
'''

官網連接ajax

在調度器中，能夠設置將重複的網址去重。若是爬取失敗，須要再次爬取，那麼就不能去重。mongodb

因此要不要去重，取決於項目需求而定。shell

若是訪問5個url，當有一個url卡住了。那麼它就會切換到隊列中其餘的url。由於它是異步請求，這裏就能夠高效利用CPU

2. scrapy 的工做流程

以前學習的爬蟲的基本流程

那麼 scrapy是如何幫助咱們抓取數據的呢？

3. 數據流

Scrapy中的數據流由執行引擎控制，其過程以下:

1.引擎打開一個網站(open adomain)，找處處理該網站的Spider並向該spider請求第一個要爬取的URL(s)。

2.引擎從Spider中獲取到第一個要爬取的URL並在調度器(Scheduler)以Request調度。

3.引擎向調度器請求下一個要爬取的URL。

4.調度器返回下一個要爬取的URL給引擎，引擎將URL經過下載中間件(請求(request)方向)轉發給下載器(Downloader)。

5.一旦頁面下載完畢，下載器生成一個該頁面的Response，並將其經過下載中間件(返回(response)方向)發送給引擎。

6.引擎從下載器中接收到Response並經過Spider中間件(輸入方向)發送給Spider處理。

7.Spider處理Response並返回爬取到的Item及(跟進的)新的Request給引擎。

8.引擎將(Spider返回的)爬取到的Item給ItemPipeline，將(Spider返回的)Request給調度器。

9.(從第二步)重複直到調度器中沒有更多地request，引擎關閉該網站。

上面的內容，參考連接：

http://www.javashuo.com/article/p-vitmtisf-hn.html

每訪問一個url，就會經歷上面列舉的9個步驟，它是一個反覆過程。

若是訪問失敗，能夠再次請求訪問！

spiders包含了爬取和解析這2步。

爲何要有中間件？

請求一個網頁，有幾十個cookie，每一次發送，都得帶上cookie，很麻煩。
好比使用代理，須要換IP訪問。封鎖一個，換一個！
加上代理訪問，須要使用中間件！那麼不管任何請求，均可以帶上cookie。

item就是解析好的數據

2. 安裝

Windows平臺

windows平臺安裝比較難搞，它不像linux同樣，一條命令就搞定了！不少人學習scrapy，光安裝就卡了半天。

爲了避免讓學習如癡如醉的你放棄scrapy，這裏列舉出詳細的安裝過程！

先說明一下安裝環境。使用的是windows 10 家庭版，64位操做系統。已經安裝好了Python 3.6

搜索cmd，必須使用管理員身份運行

安裝 wheel

C:\Windows\system32>pip3 install wheel

安裝 lxml

C:\Windows\system32>pip3 install lxml

安裝 pyopenssl

C:\Windows\system32>pip3 install pyopenssl

安裝 pywin32

打開網址：

https://sourceforge.net/projects/pywin32/files/pywin32/

點擊 Build 221

下載pywin32-221.win-amd64-py3.6.exe

直接運行

點擊下一步

最後就能夠安裝完成了！

安裝 twisted

打開網頁

https://pypi.org/project/Twisted/#files

這裏不要下載，爲何呢？看文件名，它只支持Python 2.7。但個人是Python 3.6啊！

打開另一個網頁：

https://github.com/zerodhatech/python-wheels

下載文件 Twisted-17.9.0-cp36-cp36m-win_amd64.whl

點擊下載

點擊文件屬性，複製路徑

安裝

C:\Windows\system32>pip3 install C:\Users\vr\Downloads\Twisted-17.9.0-cp36-cp36m-win_amd64.whl

安裝 scrapy

C:\Windows\system32>pip3 install scrapy

到這裏，scrapy就安裝完成了！

Linux平臺

Linux平臺安裝比較簡單，一條命令就能夠搞定了

pip3 install scrapy

3. 命令行工具

1. 查看幫助

scrapy -h
scrapy <command> -h

2. 兩種命令

有兩種命令：其中Project-only必須切到項目文件夾下才能執行，而Global的命令則不須要

Global commands

startproject #建立項目
genspider    #建立爬蟲程序
settings     #若是是在項目目錄下，則獲得的是該項目的配置
runspider    #運行一個獨立的python文件，沒必要建立項目
shell        #scrapy shell url地址  在交互式調試，如選擇器規則正確與否
fetch        #獨立於程單純地爬取一個頁面，能夠拿到請求頭
view         #下載完畢後直接彈出瀏覽器，以此能夠分辨出哪些數據是ajax請求
version      #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依賴庫的版本

Project-only commands

crawl        #運行爬蟲，必須建立項目才行，確保配置文件中ROBOTSTXT_OBEY = False
check        #檢測項目中有無語法錯誤
list         #列出項目中所包含的爬蟲名
edit         #編輯器，通常不用
parse        #scrapy parse url地址 --callback 回調函數  #以此能夠驗證咱們的回調函數是否正確
bench        #scrapy bentch壓力測試

舉例：

建立一個爬蟲項目DianShang

C:\Users\xiao>e:
E:\>cd E:\python_script\爬蟲\day3
E:\python_script\爬蟲\day3>scrapy startproject DianShang

生成一個爬蟲程序。注意：左邊的jd表示模塊名，右邊的jd.com表示要訪問的url。

一個項目還能夠訪問多個url

E:\python_script\爬蟲\day3>cd DianShang
E:\python_script\爬蟲\day3\DianShang>scrapy genspider jd jd.com

注意：項目名和爬蟲程序名不能重複！

運行scrapy項目。注意：這裏指定的模塊名必須和上面一致！

E:\python_script\爬蟲\day3\DianShang>scrapy crawl jd

一個項目，也能夠啓動多個模塊

主要用到命令，就是上面演示的3個！

1 scrapy startproject DianShang   # 建立爬蟲項目
2 scrapy genspider jd jd.com      # 生成一個爬蟲程序
3 scrapy crawl jd                 # 運行scrapy項目

3. 官網連接

https://docs.scrapy.org/en/latest/topics/commands.html

4. 目錄結構

project_name/
    scrapy.cfg              #  項目的主配置信息，用來部署scrapy時使用
    project_name/           #  爬蟲項目名稱
       __init__.py 
       items.py            #  數據存儲模板，用於結構化數據，如：Django的Model 
       pipelines.py        #  數據持久化
       settings.py         #  用戶級配置文件
       spiders/            #  爬蟲程序的目錄
           __init__.py
           爬蟲1.py        # 爬蟲程序1
           爬蟲2.py 
           爬蟲3.py

文件說明：

scrapy.cfg 項目的主配置信息，用來部署scrapy時使用，爬蟲相關的配置信息在settings.py文件中。
items.py 設置數據存儲模板，用於結構化數據，如：Django的Model
pipelines 數據處理行爲，如：通常結構化的數據持久化
settings.py 配置文件，如：遞歸的層數、併發數，延遲下載等。強調:配置文件的選項必須大寫不然視爲無效，正確寫法USER_AGENT='xxxx'
spiders 爬蟲目錄，如：建立文件，編寫爬蟲規則

注意：

一、通常建立爬蟲文件時，以網站域名命名

二、默認只能在終端執行命令，爲了更便捷操做：

#在項目根目錄下新建：entrypoint.py
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'xiaohua'])

框架基礎：spider類，選擇器

舉例：

上面已經建立好了DianShang項目，修改settings.py，將下面的配置改成False

ROBOTSTXT_OBEY = False

這是配置是否遵循robots協議，通常狀況下，是不遵循的。爲啥呢？你要真的遵循的話，還怎麼爬呀！

涉及到金融，影響國家安全，影響社會和諧穩定...的信息，切莫爬取，不然後果不堪設想！

每次使用命令行運行 scrapy crawl jd 太麻煩了。

在項目根目錄新建一個文件 bin.py。注意，它和scrapy.cfg是同級的！

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 最後一個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'jd'])

直接執行bin.py，就能夠啓動爬蟲程序了！

執行以後，會輸出一段紅色字符串。注意：它不是報錯！

也能夠關閉掉，修改 bin.py

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'jd','--nolog'])

再次執行，就不會輸出了！

2、Spider類

Spiders是定義如何抓取某個站點（或一組站點）的類，包括如何執行爬行（即跟隨連接）以及如何從其頁面中提取結構化數據（即抓取項目）。換句話說，Spiders是您爲特定站點（或者在某些狀況下，一組站點）爬網和解析頁面定義自定義行爲的地方。

1、 生成初始的Requests來爬取第一個URLS，而且標識一個回調函數
     第一個請求定義在start_requests()方法內默認從start_urls列表中得到url地址來生成Request請求，
     默認的回調函數是parse方法。回調函數在下載完成返回response時自動觸發

2、 在回調函數中，解析response而且返回值
     返回值能夠4種：
          包含解析數據的字典
          Item對象
          新的Request對象（新的Requests也須要指定一個回調函數）
          或者是可迭代對象（包含Items或Request）

3、在回調函數中解析頁面內容
   一般使用Scrapy自帶的Selectors，但很明顯你也可使用Beutifulsoup，lxml或其餘你愛用啥用啥。

4、最後，針對返回的Items對象將會被持久化到數據庫
   經過Item Pipeline組件存到數據庫：https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline）
   或者導出到不一樣的文件（經過Feed exports：https://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports）

舉例：

打開DianShang項目，修改DianShang-->spiders-->jd.py

# -*- coding: utf-8 -*-
import scrapy


class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['jd.com']
    start_urls = ['http://jd.com/']

    def parse(self, response):
        print(response,type(response))

執行bin.py，輸出以下：

<200 https://www.jd.com/> <class 'scrapy.http.response.html.HtmlResponse'>

start_requests

使用Ctrl+鼠標左鍵，點擊這一段代碼中的Spider，查看源代碼

class JdSpider(scrapy.Spider):

查看start_requests方法，看最後2行代碼。

def start_requests(self):
    cls = self.__class__
    if method_is_overridden(cls, Spider, 'make_requests_from_url'):
        warnings.warn(
            "Spider.make_requests_from_url method is deprecated; it "
            "won't be called in future Scrapy releases. Please "
            "override Spider.start_requests method instead (see %s.%s)." % (
                cls.__module__, cls.__name__
            ),
        )
        for url in self.start_urls:
            yield self.make_requests_from_url(url)
    else:
        for url in self.start_urls:
            yield Request(url, dont_filter=True)

它執行了for循環，self.start_urls就是在JdSpider類中定義的start_urls變量，它是一個列表！

若是列表爲空，不會執行yield

最後使用生成器返回了一個Request對象

再查看Request源碼

class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None, flags=None):

        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        assert isinstance(priority, int), "Request priority not an integer: %r" % priority
        self.priority = priority

        if callback is not None and not callable(callback):
            raise TypeError('callback must be a callable, got %s' % type(callback).__name__)
        if errback is not None and not callable(errback):
            raise TypeError('errback must be a callable, got %s' % type(errback).__name__)
        assert callback or not errback, "Cannot use errback without a callback"
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None
        self.flags = [] if flags is None else list(flags)

參數解釋：

url（string） - 此請求的網址
callback（callable） - 將使用此請求的響應（一旦下載）做爲其第一個參數調用的函數。有關更多信息，請參閱下面的將附加數據傳遞給回調函數。若是請求沒有指定回調，parse()將使用spider的方法。請注意，若是在處理期間引起異常，則會調用errback。
method（string） - 此請求的HTTP方法。默認爲'GET'。
meta（dict） - 屬性的初始值Request.meta。若是給定，在此參數中傳遞的dict將被淺複製。
body（str或unicode） - 請求體。若是unicode傳遞了a，那麼它被編碼爲 str使用傳遞的編碼（默認爲utf-8）。若是 body沒有給出，則存儲一個空字符串。無論這個參數的類型，存儲的最終值將是一個str（不會是unicode或None）。
headers（dict） - 這個請求的頭。dict值能夠是字符串（對於單值標頭）或列表（對於多值標頭）。若是 None做爲值傳遞，則不會發送HTTP頭。
cookie（dict或list） - 請求cookie。

encoding（string） - 此請求的編碼（默認爲'utf-8'）。此編碼將用於對URL進行百分比編碼，並將正文轉換爲str（若是給定unicode）。
priority（int） - 此請求的優先級（默認爲0）。調度器使用優先級來定義用於處理請求的順序。具備較高優先級值的請求將較早執行。容許負值以指示相對低優先級。
dont_filter（boolean） - 表示此請求不該由調度程序過濾。當您想要屢次執行相同的請求時忽略重複過濾器時使用。當心使用它，或者你會進入爬行循環。默認爲False。
errback（callable） - 若是在處理請求時引起任何異常，將調用的函數。這包括失敗的404 HTTP錯誤等頁面。它接收一個Twisted Failure實例做爲第一個參數。有關更多信息，請參閱使用errbacks在請求處理中捕獲異常。

flags（list） - 是一個包含屬性初始值的 Request.flags列表。若是給定，列表將被淺複製。

通常狀況下，要本身定義start_requests方法。爲何呢？由於它不必定可以訪問目標網站。

好比訪問 https://github.com/ 要獲取個人我的信息。直接用GET請求，不帶任何信息，是獲取不到的。必須登陸才行！

舉例：

進入DianShang項目，修改 DianShang-->spiders-->jd.py，增長start_requests方法

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['jd.com']
    # start_urls = ['http://jd.com/']  # 佔時沒有用了

    def start_requests(self):
        r1 = Request(url="http://jd.com/")
        yield r1  # 務必使用yield 返回

    def parse(self, response):
        print(response,type(response))

View Code

從新執行bin.py，Pycharm輸出以下：

<200 https://www.jd.com/> <class 'scrapy.http.response.html.HtmlResponse'>

注意：推薦使用yield返回，這樣能夠節省內存。它最後仍是會調用for循環！

start_requests方法，雖然沒有寫callback參數，指定回調函數。它默認的回調函數就是parse

start_requests執行以後，並無發送請求。它只是一個返回一個Request對象，放到一個請求列表中。由twisted進行異步請求！若是獲得了響應信息，則調用parse函數！

上面的parse函數，尚未寫return。因此它不會走架構圖中的7,8步驟

3、選擇器

爲了解釋如何使用選擇器，咱們將使用Scrapy shell（提供交互式測試）和Scrapy文檔服務器中的示例頁面，

這是它的HTML代碼：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

首先，讓咱們打開shell：

scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html

而後，在shell加載以後，您將得到響應做爲response shell變量，並在response.selector屬性中附加選擇器。

讓咱們構建一個XPath來選擇title標籤內的文本：

>>> response.selector.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]

使用XPath和CSS查詢響應很是常見，響應包括兩個便捷快捷方式：response.xpath()和response.css()：

>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

正如你所看到的，.xpath()而且.css()方法返回一個 SelectorList實例，這是新的選擇列表。此API可用於快速選擇嵌套數據：

>>> response.css('img').xpath('@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']

要實際提取文本數據，必須調用selector .extract() 方法，以下所示：

>>> response.xpath('//title/text()').extract()
['Example website']

若是隻想提取第一個匹配的元素，能夠調用選擇器 .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
'Name: My image 1 '

如今咱們將得到基本URL和一些圖像連接：

>>> response.xpath('//base/@href').extract()
['http://example.com/']
>>>
>>> response.css('base::attr(href)').extract()
['http://example.com/']
>>>
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>>
>>> response.css('a[href*=image]::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>>
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
>>>
>>> response.css('a[href*=image] img::attr(src)').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']
>>>

4、DupeFilter(去重)

默認使用方式：

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
Request(...,dont_filter=False) ，若是dont_filter=True則告訴Scrapy這個URL不參與去重。

源碼解析：

from scrapy.core.scheduler import Scheduler
見Scheduler下的enqueue_request方法：self.df.request_seen(request)

自定義去重規則：

from scrapy.dupefilter import RFPDupeFilter，看源碼，仿照BaseDupeFilter
 
#步驟一：在項目目錄下自定義去重文件dup.py
class UrlFilter(object):
    def __init__(self):
        self.visited = set() #或者放到數據庫
 
    @classmethod
    def from_settings(cls, settings):
        return cls()
 
    def request_seen(self, request):
        if request.url in self.visited:
            return True
        self.visited.add(request.url)
 
    def open(self):  # can return deferred
        pass
 
    def close(self, reason):  # can return a deferred
        pass
 
    def log(self, request, spider):  # log that a request has been filtered
        pass

5、Item(項目)

抓取的主要目標是從非結構化源（一般是網頁）中提取結構化數據。Scrapy蜘蛛能夠像Python同樣返回提取的數據。雖然方便和熟悉，但P很容易在字段名稱中輸入拼寫錯誤或返回不一致的數據，尤爲是在具備許多蜘蛛的較大項目中。

爲了定義通用輸出數據格式，Scrapy提供了Item類。 Item對象是用於收集抓取數據的簡單容器。它們提供相似字典的 API，並具備用於聲明其可用字段的方便語法。

1. 聲明項目

使用簡單的類定義語法和Field 對象聲明項。這是一個例子：

import scrapy
 
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)

注意那些熟悉Django的人會注意到Scrapy Items被宣告相似於Django Models，除了Scrapy Items更簡單，由於沒有不一樣字段類型的概念。

2. 項目字段

Field對象用於指定每一個字段的元數據。例如，last_updated上面示例中說明的字段的序列化函數。

您能夠爲每一個字段指定任何類型的元數據。Field對象接受的值沒有限制。出於一樣的緣由，沒有全部可用元數據鍵的參考列表。

Field對象中定義的每一個鍵能夠由不一樣的組件使用，只有那些組件知道它。您也能夠根據Field本身的須要定義和使用項目中的任何其餘鍵。

Field對象的主要目標是提供一種在一個地方定義全部字段元數據的方法。一般，行爲取決於每一個字段的那些組件使用某些字段鍵來配置該行爲。

3. 使用項目

如下是使用上面聲明的Product項目對項目執行的常見任務的一些示例。您會注意到API與dict API很是類似。

建立項目
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
獲取字段值
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
 
>>> product['price']
1000
 
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
 
>>> product.get('last_updated', 'not set')
not set
 
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
 
>>> product.get('lala', 'unknown field')
'unknown field'
 
>>> 'name' in product # is name field populated?
True
 
>>> 'last_updated' in product # is last_updated populated?
False
 
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
 
>>> 'lala' in product.fields # is lala a declared field?
False
設定字段值
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
 
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
訪問全部填充值
要訪問全部填充值，只需使用典型的dict API：
 
>>> product.keys()
['price', 'name']
 
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
其餘常見任務
複製項目：
 
>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)
 
>>> product3 = product2.copy()
>>> print product3
Product(name='Desktop PC', price=1000)
從項目建立dicts：
 
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
從dicts建立項目：
 
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
 
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'

View Code

4. 擴展項目

您能夠經過聲明原始Item的子類來擴展Items（以添加更多字段或更改某些字段的某些元數據）。

例如：

class DiscountedProduct(Product):
      discount_percent = scrapy.Field(serializer=str)
      discount_expiration_date = scrapy.Field()

6、Item PipeLine

在一個項目被蜘蛛抓取以後，它被髮送到項目管道，該項目管道經過順序執行的幾個組件處理它。

每一個項目管道組件（有時簡稱爲「項目管道」）是一個實現簡單方法的Python類。他們收到一個項目並對其執行操做，同時決定該項目是否應該繼續經過管道或被丟棄而且再也不處理。

項目管道的典型用途是：

cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database

1. 編寫本身的項目管道

每一個項管道組件都是一個必須實現如下方法的Python類：

process_item（self，項目，蜘蛛）
爲每一個項目管道組件調用此方法。process_item() 

必需要麼：返回帶數據的dict，返回一個Item （或任何後代類）對象，返回Twisted Deferred或引起 DropItem異常。丟棄的項目再也不由其餘管道組件處理。

此外，他們還能夠實現如下方法：

open_spider（self，蜘蛛）
打開蜘蛛時會調用此方法。

close_spider（self，蜘蛛）
當蜘蛛關閉時調用此方法。

from_crawler（cls，crawler ）
若是存在，則調用此類方法以從a建立管道實例Crawler。它必須返回管道的新實例。Crawler對象提供對全部Scrapy核心組件的訪問，
如設置和信號; 它是管道訪問它們並將其功能掛鉤到Scrapy的一種方式。

2. 項目管道示例

(1) 價格驗證和丟棄物品沒有價格

讓咱們看看下面的假設管道，它調整 price那些不包含增值稅（price_excludes_vat屬性）的項目的屬性，並刪除那些不包含價格的項目：

from scrapy.exceptions import DropItem
 
class PricePipeline(object):
 
    vat_factor = 1.15
 
    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

View Code

(2) 將項目寫入JSON文件

如下管道將全部已刪除的項目（來自全部蜘蛛）存儲到一個items.jl文件中，每行包含一個以JSON格式序列化的項目：

import json
 
class JsonWriterPipeline(object):
 
    def open_spider(self, spider):
        self.file = open('items.jl', 'w')
 
    def close_spider(self, spider):
        self.file.close()
 
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

View Code

注意JsonWriterPipeline的目的只是介紹如何編寫項目管道。若是您確實要將全部已刪除的項目存儲到JSON文件中，則應使用Feed導出。

(3) 將項目寫入數據庫

在這個例子中，咱們將使用pymongo將項目寫入MongoDB。MongoDB地址和數據庫名稱在Scrapy設置中指定; MongoDB集合以item類命名。

這個例子的要點是展現如何使用from_crawler() 方法以及如何正確地清理資源：

import pymongo
 
class MongoPipeline(object):
 
    collection_name = 'scrapy_items'
 
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
 
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
 
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
 
    def close_spider(self, spider):
        self.client.close()
 
    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

View Code

(4) 重複過濾

一個過濾器，用於查找重複項目，並刪除已處理的項目。假設咱們的項目具備惟一ID，但咱們的蜘蛛會返回具備相同ID的多個項目：

from scrapy.exceptions import DropItem
 
class DuplicatesPipeline(object):
 
    def __init__(self):
        self.ids_seen = set()
 
    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

View Code

3. 激活項目管道組件

要激活Item Pipeline組件，必須將其類添加到 ITEM_PIPELINES設置中，以下例所示：

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

您在此設置中爲類分配的整數值決定了它們運行的順序：項目從較低值到較高值類進行。習慣上在0-1000範圍內定義這些數字。

pymongo操做，請參考連接：

https://www.cnblogs.com/yuanchenqi/articles/9602847.html

本文參考連接：

https://www.cnblogs.com/yuanchenqi/articles/9509793.html

7、爬取亞馬遜

新建一個爬蟲程序

scrapy genspider amazon amazon.cn

修改 DianShang-->spiders-->amazon.py，打印響應信息

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']

    def start_requests(self):
        r1 = Request(url="http://amazon.cn/")
        yield r1

    def parse(self, response):
        print(response.text)

View Code

修改bin.py

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'amazon'])

View Code

執行bin.py，發現有一個503錯誤

2018-10-01 14:47:19 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://www.amazon.cn/> (referer: None)

說明，被亞馬遜給攔截了！怎麼辦呢？加1個請求頭

修改 DianShang-->spiders-->amazon.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="http://amazon.cn/",headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        print(response.text)

View Code

再次運行bin.py，發現拿到了網頁html

<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">
...
</html>

先來搜索ipone x，跳轉頁面以下：

在地址欄中的url太長了，它真正的地址應該是

https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x

修改 DianShang-->spiders-->amazon.py，更改url

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        print(response.text)

View Code

修改bin.py，關閉日誌

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'amazon',"--nolog"])

View Code

再次運行bin.py，輸出一段html

先來看搜索到ipone x的頁面。咱們須要爬取 ipone x的名稱，價格，配送方式。

可是這個頁面並無配送方式，點擊一個具體的商品。

刪掉url後面多餘參數，地址以下：

https://www.amazon.cn/dp/B075LGPY95

效果以下：

能夠發現，這個頁面，咱們所須要的3個信息全有。那麼直接訪問這地址獲取就能夠了！

獲取商品詳情連接

第一步，應該先獲取全部商品的詳情連接。

看搜索到ipone x的頁面，好比第一個商品，它有不少能夠點擊的地方。好比圖片，標題，價格等等。均可以連接到商品詳情頁！

這裏我只選擇標題部分，來獲取商品的詳情連接！

先copy Xpath規則

修改 DianShang-->spiders-->amazon.py，使用xpath規則解析

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//*[@id="result_0"]/div/div[3]/div[1]/a/h2').extract()
        print(detail_urls)

View Code

執行bin.py，輸出以下：

['<h2 data-attribute="Apple 蘋果 手機 iPhone X 銀色 64G" data-max-rows="0" class="a-size-base s-inline  s-access-title  a-text-normal">Apple 蘋果 手機 iPhone X 銀色 64G</h2>']

商品名都拿到了，獲取href屬性使用@href就能夠了

修改 DianShang-->spiders-->amazon.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//*[@id="result_0"]/div/div[3]/div[1]/a/@href').extract()
        print(detail_urls)

View Code

執行bin.py，輸出以下：

['https://www.amazon.cn/dp/B075LGPY95']

這只是獲取一個商品，獲取整頁的商品呢？

先來找規律，每個商品都是一個li標籤。而且還有一個id屬性爲result_數字。

能夠發現，id屬性的前綴都是result_ 。那麼使用xpath的模擬匹配就能夠了！

修改 DianShang-->spiders-->amazon.py，使用模糊匹配

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        print(detail_urls)

View Code

再次執行bin.py，輸出了一堆連接，效果以下：

['https://www.amazon.cn/dp/B075LGPY95', 'https://www.amazon.cn/dp/B0763KX27G', ...]

隨便點擊一個，就是商品詳情連接。那麼如何讓scrapy去訪問這些連接呢？

注意：只要parse函數返回一個Request對象，那麼就會放到異步請求列表裏面，由twisted發送異步請求！

修改 DianShang-->spiders-->amazon.py，yield Request對象

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        print(response.text)

View Code

執行bin.py，輸出一堆字符串。那些都是商品詳情頁的信息！

獲取商品名

先來獲取商品的名字

修改 DianShang-->spiders-->amazon.py，修改parse_detail，增長XPath規則

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名
        name = response.xpath('//*[@id="productTitle"]/text()').extract()
        print(name)

View Code

執行輸出：

\n                        Apple 蘋果 手機 iPhone X 深空灰色 64G\n

發現它有不少換行符，怎麼去除呢？使用strip

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名
        name = response.xpath('//*[@id="productTitle"]/text()').extract()
        if name:
            name = name.strip()
            
        print(name)

View Code

執行bin.py，發現輸出爲空！

修改bin.py，去除關閉日誌

#在項目根目錄下新建：bin.py
from scrapy.cmdline import execute
# 第三個參數是：爬蟲程序名
execute(['scrapy', 'crawl', 'amazon'])

View Code

再次執行，發現有一個錯誤

AttributeError: 'list' object has no attribute 'strip'

哦，原來name是一個list對象。那麼使用extract就不合適了，修改成extract_first。匹配第一個結果！

修改 DianShang-->spiders-->amazon.py，修改parse_detail，使用extract_first

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名,獲取第一個結果
        name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if name:
            name = name.strip()

        print(name)

View Code

再次執行，輸出：

...
Apple 蘋果 手機 iPhone X 深空灰色 64G
...

這下就正常了！

獲取商品價格

商品價格，和商品名稱，也是一樣的道理。使用谷歌瀏覽器直接複製XPath規則，就能夠了！

修改 DianShang-->spiders-->amazon.py，修改parse_detail

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名,獲取第一個結果
        name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if name:
            name = name.strip()
            
        # 商品價格
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        print(name,price)

View Code

執行bin.py，輸出以下：

...
Apple 蘋果 手機 iPhone X 深空灰色 64G ￥7,288.00
...

獲取商品配送方式

使用谷歌瀏覽器直接複製XPath規則

修改 DianShang-->spiders-->amazon.py，修改parse_detail

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名,獲取第一個結果
        name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if name:
            name = name.strip()

        # 商品價格
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
        # 配送方式
        delivery = response.xpath('//*[@id="ddmMerchantMessage"]/a/text()').extract_first()
        print(name,price,delivery)

View Code

執行bin.py，輸出以下：

...
Apple iPhone X 全網通 移動聯通電信4G 熱銷中 (深空灰, 64G) ￥7,288.00 新思惟官方旗艦店
X-Doria 461238 Apple iPhone X，14.73 釐米（5.8 英寸）灰色 ￥173.70 瞭解更多。
...

發現了一個問題，若是是手機殼，好比這個連接：

https://www.amazon.cn/dp/B0787HGXQX

使用Xpath獲取時，輸出：瞭解更多

爲何呢？由於咱們獲取的是a標籤的值！

因此獲取時，就是了解更多了！看上面，有一個亞馬遜中國，這個纔是咱們想要的！

在span標籤id=ddmMerchantMessage，這標籤中，它裏面包含了2個標籤，分別是b標籤和a標籤。

那麼咱們取b標籤的text()值，就能夠了！

修改 DianShang-->spiders-->amazon.py，修改parse_detail

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名,獲取第一個結果
        name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if name:
            name = name.strip()

        # 商品價格
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
        # 配送方式,*[1]表示取第一個標籤,也就是b標籤
        delivery = response.xpath('//*[@id="ddmMerchantMessage"]/*[1]/text()').extract_first()
        print(name,price,delivery)

View Code

執行bin.py，輸出以下：

...
hayder iphone X 保護套透明 TPU 超薄電鍍防禦防撞手機殼適用於蘋果 iPhone X 10 ￥149.09 亞馬遜美國
Apple iPhone X 移動聯通電信4G手機 (256GB, 銀色) ￥8,448.00 新思惟官方旗艦店
...

發現手機殼的配送方式，也是正確的！

保存到MongoDB

Item

要想將數據保存到MongoDB中，須要使用item

修改 DianShang-->items.py，默認的函數名能夠直接更改

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class AmazonItem(scrapy.Item):

    name = scrapy.Field()  # 商品名
    price= scrapy.Field()  # 價格
    delivery=scrapy.Field()  # 配送方式

View Code

修改 DianShang-->spiders-->amazon.py，使用item

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request  # 導入模塊
from DianShang.items import AmazonItem  # 導入item

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.cn']
    # start_urls = ['http://amazon.cn/']
    # 自定義配置，注意：變量名必須是custom_settings
    custom_settings = {
        'REQUEST_HEADERS': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
        }
    }

    def start_requests(self):
        r1 = Request(url="https://www.amazon.cn/s/ref=nb_sb_ss_i_3_6?field-keywords=iphone+x",
                     headers=self.settings.get('REQUEST_HEADERS'),)
        yield r1

    def parse(self, response):
        # 商品詳細連接
        detail_urls = response.xpath('//li[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
        # print(detail_urls)
        for url in detail_urls:
            yield Request(url=url,
                          headers=self.settings.get('REQUEST_HEADERS'),  # 請求頭
                          callback=self.parse_detail,  # 回調函數
                          dont_filter=True  # 不去重
                          )

    def parse_detail(self, response):  # 獲取商品詳細信息
        # 商品名,獲取第一個結果
        name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if name:
            name = name.strip()

        # 商品價格
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
        # 配送方式,*[1]表示取第一個標籤,也就是b標籤
        delivery = response.xpath('//*[@id="ddmMerchantMessage"]/*[1]/text()').extract_first()
        print(name,price,delivery)

        # 生成標準化數據
        item = AmazonItem()  # 執行函數,默認是一個空字典
        # 增長鍵值對
        item["name"] = name
        item["price"] = price
        item["delivery"] = delivery

        return item  # 必需要返回

View Code

Item PipeLine

上面的item只是返回了一個字典，並無真正存儲到MongoDB中。須要PipeLine將數據儲存到MongoDB中

修改 DianShang-->pipelines.py。注意：from_crawler，open_spider，close_spider，process_item這4個方法，必需要定義才行！

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


from pymongo import MongoClient

class MongodbPipeline(object):

    def __init__(self, host, port, db, table):
        self.host = host
        self.port = port
        self.db = db
        self.table = table

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy會先經過getattr判斷咱們是否自定義了from_crawler,有則調它來完
        成實例化
        """
        HOST = crawler.settings.get('HOST')
        PORT = crawler.settings.get('PORT')
        DB = crawler.settings.get('DB')
        TABLE = crawler.settings.get('TABLE')
        return cls(HOST, PORT, DB, TABLE)

    def open_spider(self, spider):
        """
        爬蟲剛啓動時執行一次
        """
        # self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))
        self.client = MongoClient(host=self.host, port=self.port)

    def close_spider(self, spider):
        """
        爬蟲關閉時執行一次
        """
        self.client.close()

    def process_item(self, item, spider):
        # 操做並進行持久化
        d = dict(item)
        if all(d.values()):
            self.client[self.db][self.table].insert(d)
            print("添加成功一條")

View Code

open_spider，close_spider這2個方法，只會執行一次。

process_item呢？就不必定了。有多少次請求，就執行多少次！每次會插入一條記錄

修改 DianShang\settings.py，增長MongoDB鏈接信息。

最後一行增長如下信息：

# MongoDB鏈接信息
HOST="127.0.0.1"
PORT=27017
DB="amazon"  # 數據庫名
TABLE="goods"  # 表名

再修改 DianShang\settings.py，開啓MongoDB的PipeLine

這裏的300，表示優先級

ITEM_PIPELINES = {
    'DianShang.pipelines.MongodbPipeline': 300,
}

注意：因爲在pipelines.py中，定義的是MongodbPipeline。因此配置文件這裏，必需要一一匹配，不然報錯：

raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))

先啓動MongoDB，打開cmd窗口，輸入命令

C:\Users\xiao>mongod

若是輸出：

[thread1] waiting for connections on port 27017

則表示啓動成功了！

Pipeline是不會幫你自動建立數據庫的，因此須要手動建立。

進入數據庫，建立數據庫amazon

C:\Users\xiao>mongo

> use amazon
switched to db amazon
>

從新運行 bin.py，輸出：

...
Apple 蘋果 手機 iPhone X 深空灰色 64G ￥7,288.00 新思惟官方旗艦店
添加成功一條
...

使用 MongoDB客戶端打開表goods，效果以下：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。