Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架,用途很是普遍。css
框架的力量,用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲,用來抓取網頁內容以及各類圖片,很是之方便。html
Scrapy 使用了 Twisted異步網絡框架來處理網絡通信,能夠加快咱們的下載速度,不用本身去實現異步框架,而且包含了各類中間件接口,能夠靈活的完成各類需求。web
Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信,信號、數據傳遞等。api
Scheduler(調度器): 它負責接受引擎發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎須要時,交還給引擎。網絡
Downloader(下載器):負責下載Scrapy Engine(引擎)發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,框架
Spider(爬蟲):它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎,再次進入Scheduler(調度器),dom
Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方.異步
Downloader Middlewares(下載中間件):你能夠看成是一個能夠自定義擴展下載功能的組件。scrapy
Spider Middlewares(Spider中間件):你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件(好比進入Spider的Responses;和從Spider出去的Requests)ide
Scrapy庫的安裝、項目建立及簡單使用參考以前的博客Python網絡爬蟲之scrapy(一),下面主要對項目各組件進行說明
D:\scrapy_project>scrapy genspider country example.webscraping.com
item.py:相似Django中的models.py,用於聲明數據類型,未來報錯數據
middlewares.py:爬蟲中間件,能夠對請求和響應進行處理
pipelines.py:管道,做用是將每個Item對象進行存儲,MySql/MongoDB
settings.py:對爬蟲項目進行配置
spiders:管理對各爬蟲項目,具體的爬蟲邏輯在各自的項目爬蟲文件中
country.py:建立的爬蟲項目
(1)查看Response類的屬性
from scrapy.http import Response for key,value in Response.__dict__.items(): print("{0}:{1}".format(key,value))
__module__:scrapy.http.response __init__:<function Response.__init__ at 0x00000257D64B1C80> meta:<property object at 0x00000257D64B2458> _get_url:<function Response._get_url at 0x00000257D64B40D0> _set_url:<function Response._set_url at 0x00000257D64B4158> url:<property object at 0x00000257D64B24A8> _get_body:<function Response._get_body at 0x00000257D64B4268> _set_body:<function Response._set_body at 0x00000257D64B42F0> body:<property object at 0x00000257D64B2728> __str__:<function Response.__str__ at 0x00000257D64B4400> __repr__:<function Response.__str__ at 0x00000257D64B4400> copy:<function Response.copy at 0x00000257D64B4488> replace:<function Response.replace at 0x00000257D64B4510> urljoin:<function Response.urljoin at 0x00000257D64B4598> text:<property object at 0x00000257D64B2778> css:<function Response.css at 0x00000257D64B46A8> xpath:<function Response.xpath at 0x00000257D64B4730> follow:<function Response.follow at 0x00000257D64B47B8> __dict__:<attribute '__dict__' of 'Response' objects> __weakref__:<attribute '__weakref__' of 'Response' objects> __doc__:None
從上面咱們會看到三個重要屬性(url、body和text),再查看下Response類源碼會發現以下代碼
url = property(_get_url, obsolete_setter(_set_url, 'url')) body = property(_get_body, obsolete_setter(_set_body, 'body')) @property def text(self): """For subclasses of TextResponse, this will return the body as text (unicode object in Python 2 and str in Python 3) """ raise AttributeError("Response content isn't text")
url、body、text這就是咱們在爬蟲分析中須要用到的三個重要屬性,均可與經過Response對象得到
例子:
import scrapy from lxml import etree class CountrySpider(scrapy.Spider): name = 'country' allowed_domains = ['example.webscraping.com'] start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1'] #該函數名不能改變,由於scrapy源碼中默認callback函數的函數名就是parse def parse(self, response): from bs4 import BeautifulSoup as bs print(response.url) soup = bs(response.body) names = [i.string for i in soup.select('td.w2p_fl')] values = [j.string for j in soup.select('td.w2p_fw')] dic = dict(zip(names, values)) print(dic)
(1)樣的方法,線查看Spider類提供的屬性
import scrapy for key,val in scrapy.Spider.__dict__.items(): print("{}:{}".format(key,val))
__module__:scrapy.spiders __doc__:Base class for scrapy spiders. All spiders must inherit from this class. name:None custom_settings:None __init__:<function Spider.__init__ at 0x000001E161FFFD90> logger:<property object at 0x000001E161785D18> log:<function Spider.log at 0x000001E161FFFEA0> from_crawler:<classmethod object at 0x000001E16178B208> set_crawler:<function Spider.set_crawler at 0x000001E161FF8048> _set_crawler:<function Spider._set_crawler at 0x000001E161FF80D0> start_requests:<function Spider.start_requests at 0x000001E161FF8158> make_requests_from_url:<function Spider.make_requests_from_url at 0x000001E161FF81E0> parse:<function Spider.parse at 0x000001E161FF8268> update_settings:<classmethod object at 0x000001E16178B240> handles_request:<classmethod object at 0x000001E16178B278> close:<staticmethod object at 0x000001E161FF7E80> __str__:<function Spider.__str__ at 0x000001E161FF8488> __repr__:<function Spider.__str__ at 0x000001E161FF8488> __dict__:<attribute '__dict__' of 'Spider' objects> __weakref__:<attribute '__weakref__' of 'Spider' objects>
(2)接下來對其中幾個重要的屬性和方法進行說明:
start_requests()
該 方法會默認讀取start_urls屬性中定義的網址,爲每個網址生成一個Request請求對象,並返回可迭代對象
make_request_from_url(url)
該方法會被start_request()調用,該方法負責實現生成Request請求對象
close(reason)
關閉Spider時,該方法會被調用
log(message[,level,component])
使用該方法能夠實如今Spider中添加log
(3)上面幾個函數對應的源碼
def start_requests(self): cls = self.__class__ if method_is_overridden(cls, Spider, 'make_requests_from_url'): warnings.warn( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls.__module__, cls.__name__ ), ) for url in self.start_urls: yield self.make_requests_from_url(url) else: for url in self.start_urls: yield Request(url, dont_filter=True) def make_requests_from_url(self, url): """ This method is deprecated. """ return Request(url, dont_filter=True) def log(self, message, level=logging.DEBUG, **kw): """Log the given message at the given log level This helper wraps a log call to the logger within the spider, but you can use it directly (e.g. Spider.logger.info('msg')) or use any other Python logger too. """ self.logger.log(level, message, **kw)
(4)例子:重寫start_request()方法
import scrapy from lxml import etree class CountrySpider(scrapy.Spider): name = 'country' allowed_domains = ['example.webscraping.com'] start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1', "http://example.webscraping.com/places/default/view/Aland-Islands-2"] #重寫start_request()方法 def start_requests(self): for url in self.start_urls: yield self.make_requests_from_url(url) #該函數名不能改變,由於scrapy源碼中默認callback函數的函數名就是parse def parse(self, response): from bs4 import BeautifulSoup as bs print(response.url) soup = bs(response.body) names = [i.string for i in soup.select('td.w2p_fl')] values = [j.string for j in soup.select('td.w2p_fw')] dic = dict(zip(names, values)) print(dic)
在項目被蜘蛛抓取後,它被髮送到項目管道,它經過順序執行的幾個組件來處理它。
每一個項目管道組件(有時稱爲「Item Pipeline」)是一個實現簡單方法的Python類。他們接收一個項目並對其執行操做,還決定該項目是否應該繼續經過流水線或被丟棄而且再也不被處理。
簡單理解就是將item的內容進行處理或保存
class CrawlerPipeline(object): def process_item(self, item, spider): country_name = item["country_name"] country_area = item["country_area"] # 後續處理,能夠寫進文件 return item
新手必遇到文件,發現process_item沒有被調用,解決方案:
(1)在setting.py中進行配置
ITEM_PIPELINES = { 'crawler.pipelines.CrawlerPipeline':300, } #後面的數字爲0-1000,決定執行的優先級
(2)在爬蟲項目的回調函數中def parse(self, response)中記得返回item
yield item
若要了解更詳細的使用方法,能夠參考博客: https://www.jianshu.com/p/b8bd95348ffe