Scrapy框架及組件描述

時間 2019-11-10

原文原文鏈接

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架，用途很是普遍。css

框架的力量，用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲，用來抓取網頁內容以及各類圖片，很是之方便。html

Scrapy 使用了 Twisted異步網絡框架來處理網絡通信，能夠加快咱們的下載速度，不用本身去實現異步框架，而且包含了各類中間件接口，能夠靈活的完成各類需求。web

一 Scrapy框架流程圖

(1) 組件描述

　　Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信，信號、數據傳遞等。api

　　Scheduler(調度器): 它負責接受引擎發送過來的Request請求，並按照必定的方式進行整理排列，入隊，當引擎須要時，交還給引擎。網絡

　　Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的全部Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，框架

　　Spider（爬蟲）：它負責處理全部Responses,從中分析提取數據，獲取Item字段須要的數據，並將須要跟進的URL提交給引擎，再次進入Scheduler(調度器)，dom

　　Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方.異步

　　Downloader Middlewares（下載中間件）：你能夠看成是一個能夠自定義擴展下載功能的組件。scrapy

　　Spider Middlewares（Spider中間件）：你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件（好比進入Spider的Responses;和從Spider出去的Requests）ide

(2) 數據流描述

引擎打開一個網站(open a domain)，找處處理該網站的Spider並向該spider請求第一個要爬取的URL(s)。
引擎從Spider中獲取到第一個要爬取的URL並在調度器(Scheduler)以Request調度。
引擎向調度器請求下一個要爬取的URL。
調度器返回下一個要爬取的URL給引擎，引擎將URL經過下載中間件(請求(request)方向)轉發給下載器(Downloader)。
一旦頁面下載完畢，下載器生成一個該頁面的Response，並將其經過下載中間件(返回(response)方向)發送給引擎。
引擎從下載器中接收到Response並經過Spider中間件(輸入方向)發送給Spider處理。
Spider處理Response並返回爬取到的Item及(跟進的)新的Request給引擎。
引擎將(Spider返回的)爬取到的Item給Item Pipeline，將(Spider返回的)Request給調度器。
(從第二步)重複直到調度器中沒有更多地request，引擎關閉該網站。

二建立項目及相關組件說明

　　Scrapy庫的安裝、項目建立及簡單使用參考以前的博客Python網絡爬蟲之scrapy(一)，下面主要對項目各組件進行說明

(1) 項目目錄結構

　　D:\scrapy_project>scrapy genspider country example.webscraping.com

　　item.py:相似Django中的models.py，用於聲明數據類型，未來報錯數據

　　middlewares.py:爬蟲中間件，能夠對請求和響應進行處理

　　pipelines.py：管道，做用是將每個Item對象進行存儲，MySql/MongoDB

　　settings.py：對爬蟲項目進行配置

　　spiders：管理對各爬蟲項目，具體的爬蟲邏輯在各自的項目爬蟲文件中

　　country.py：建立的爬蟲項目

三 Scrapy重要類說明及部分源碼分析

1. Response類

（1）查看Response類的屬性

from scrapy.http import Response

for key,value in Response.__dict__.items():
            print("{0}:{1}".format(key,value))

__module__:scrapy.http.response
__init__:<function Response.__init__ at 0x00000257D64B1C80>
meta:<property object at 0x00000257D64B2458>
_get_url:<function Response._get_url at 0x00000257D64B40D0>
_set_url:<function Response._set_url at 0x00000257D64B4158>
url:<property object at 0x00000257D64B24A8>
_get_body:<function Response._get_body at 0x00000257D64B4268>
_set_body:<function Response._set_body at 0x00000257D64B42F0> body:<property object at 0x00000257D64B2728>
__str__:<function Response.__str__ at 0x00000257D64B4400>
__repr__:<function Response.__str__ at 0x00000257D64B4400>
copy:<function Response.copy at 0x00000257D64B4488>
replace:<function Response.replace at 0x00000257D64B4510>
urljoin:<function Response.urljoin at 0x00000257D64B4598> text:<property object at 0x00000257D64B2778>
css:<function Response.css at 0x00000257D64B46A8>
xpath:<function Response.xpath at 0x00000257D64B4730>
follow:<function Response.follow at 0x00000257D64B47B8>
__dict__:<attribute '__dict__' of 'Response' objects>
__weakref__:<attribute '__weakref__' of 'Response' objects>
__doc__:None

　　從上面咱們會看到三個重要屬性（url、body和text），再查看下Response類源碼會發現以下代碼

url = property(_get_url, obsolete_setter(_set_url, 'url'))
body = property(_get_body, obsolete_setter(_set_body, 'body'))

@property
    def text(self):
        """For subclasses of TextResponse, this will return the body
        as text (unicode object in Python 2 and str in Python 3)
        """
        raise AttributeError("Response content isn't text")

　　url、body、text這就是咱們在爬蟲分析中須要用到的三個重要屬性，均可與經過Response對象得到

　　例子：　

import scrapy
from lxml import etree

class CountrySpider(scrapy.Spider):
    name = 'country'
    allowed_domains = ['example.webscraping.com']
    start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1']

    #該函數名不能改變，由於scrapy源碼中默認callback函數的函數名就是parse
    def parse(self, response):
        from bs4 import BeautifulSoup as bs
        print(response.url)
        soup = bs(response.body)
        names = [i.string for i in soup.select('td.w2p_fl')]
        values = [j.string for j in soup.select('td.w2p_fw')]
        dic = dict(zip(names, values))
        print(dic)

2. Spider類

（1）樣的方法，線查看Spider類提供的屬性

import scrapy

for key,val in scrapy.Spider.__dict__.items():
            print("{}:{}".format(key,val))

__module__:scrapy.spiders
__doc__:Base class for scrapy spiders. All spiders must inherit from this
    class.
name:None
custom_settings:None
__init__:<function Spider.__init__ at 0x000001E161FFFD90>
logger:<property object at 0x000001E161785D18> log:<function Spider.log at 0x000001E161FFFEA0>
from_crawler:<classmethod object at 0x000001E16178B208>
set_crawler:<function Spider.set_crawler at 0x000001E161FF8048>
_set_crawler:<function Spider._set_crawler at 0x000001E161FF80D0> start_requests:<function Spider.start_requests at 0x000001E161FF8158> make_requests_from_url:<function Spider.make_requests_from_url at 0x000001E161FF81E0>
parse:<function Spider.parse at 0x000001E161FF8268>
update_settings:<classmethod object at 0x000001E16178B240>
handles_request:<classmethod object at 0x000001E16178B278> close:<staticmethod object at 0x000001E161FF7E80>
__str__:<function Spider.__str__ at 0x000001E161FF8488>
__repr__:<function Spider.__str__ at 0x000001E161FF8488>
__dict__:<attribute '__dict__' of 'Spider' objects>
__weakref__:<attribute '__weakref__' of 'Spider' objects>

（2）接下來對其中幾個重要的屬性和方法進行說明：

start_requests()

　　該方法會默認讀取start_urls屬性中定義的網址，爲每個網址生成一個Request請求對象，並返回可迭代對象

make_request_from_url(url)

　　該方法會被start_request()調用，該方法負責實現生成Request請求對象

close(reason)

　　關閉Spider時，該方法會被調用

log(message[,level,component])

　　使用該方法能夠實如今Spider中添加log

（3）上面幾個函數對應的源碼

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

    def make_requests_from_url(self, url):
        """ This method is deprecated. """
        return Request(url, dont_filter=True)

    def log(self, message, level=logging.DEBUG, **kw):
        """Log the given message at the given log level

        This helper wraps a log call to the logger within the spider, but you
        can use it directly (e.g. Spider.logger.info('msg')) or use any other
        Python logger too.
        """
        self.logger.log(level, message, **kw)

（4）例子：重寫start_request()方法

import scrapy
from lxml import etree

class CountrySpider(scrapy.Spider):
    name = 'country'
    allowed_domains = ['example.webscraping.com']
    start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1',
                "http://example.webscraping.com/places/default/view/Aland-Islands-2"]

    #重寫start_request()方法
    def start_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

    #該函數名不能改變，由於scrapy源碼中默認callback函數的函數名就是parse
    def parse(self, response):
        from bs4 import BeautifulSoup as bs
        print(response.url)
        soup = bs(response.body)
        names = [i.string for i in soup.select('td.w2p_fl')]
        values = [j.string for j in soup.select('td.w2p_fw')]
        dic = dict(zip(names, values))
        print(dic)

3. pipines的編寫　

　　在項目被蜘蛛抓取後，它被髮送到項目管道，它經過順序執行的幾個組件來處理它。

　　每一個項目管道組件（有時稱爲「Item Pipeline」）是一個實現簡單方法的Python類。他們接收一個項目並對其執行操做，還決定該項目是否應該繼續經過流水線或被丟棄而且再也不被處理。

　　簡單理解就是將item的內容進行處理或保存

class CrawlerPipeline(object):
    def process_item(self, item, spider):
   
        country_name = item["country_name"]
        country_area = item["country_area"]
        # 後續處理，能夠寫進文件
        return item

　　新手必遇到文件，發現process_item沒有被調用，解決方案：

（1）在setting.py中進行配置

ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline':300,
}
#後面的數字爲0-1000，決定執行的優先級

（2）在爬蟲項目的回調函數中def parse(self, response)中記得返回item

　　yield item

若要了解更詳細的使用方法，能夠參考博客： https://www.jianshu.com/p/b8bd95348ffe

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Scrapy框架及組件描述

一 Scrapy框架流程圖

(1) 組件描述

(2) 數據流描述

二 建立項目及相關組件說明

(1) 項目目錄結構

三 Scrapy重要類說明及部分源碼分析

1. Response類

2. Spider類

3. pipines的編寫

二建立項目及相關組件說明

3. pipines的編寫