Scrapy 爬蟲使用指南徹底教程

時間 2019-11-21

原文原文鏈接

scrapy note

command

全局命令:

startproject ：在 project_name 文件夾下建立一個名爲 project_name 的Scrapy項目。

scrapy startproject myproject

settings：在項目中運行時，該命令將會輸出項目的設定值，不然輸出Scrapy默認設定。
runspider：在未建立項目的狀況下，運行一個編寫在Python文件中的spider。
shell：以給定的URL(若是給出)或者空(沒有給出URL)啓動Scrapy shell。
fetch：使用Scrapy下載器(downloader)下載給定的URL，並將獲取到的內容送到標準輸出。

scrapy fetch --nolog --headers http://www.example.com/

view：在瀏覽器中打開給定的URL，並以Scrapy spider獲取到的形式展示。

scrapy view http://www.example.com/some/page.html

version：輸出Scrapy版本。

項目(Project-only)命令:

crawl：使用spider進行爬取。
scrapy crawl myspider
check：運行contract檢查。
scrapy check -l
list：列出當前項目中全部可用的spider。每行輸出一個spider。
edit
parse：獲取給定的URL並使用相應的spider分析處理。若是您提供 --callback 選項，則使用spider的該方法處理，不然使用 parse 。

--spider=SPIDER: 跳過自動檢測spider並強制使用特定的spider
--a NAME=VALUE: 設置spider的參數(可能被重複)
--callback or -c: spider中用於解析返回(response)的回調函數
--pipelines: 在pipeline中處理item
--rules or -r: 使用 CrawlSpider 規則來發現用來解析返回(response)的回調函數
--noitems: 不顯示爬取到的item
--nolinks: 不顯示提取到的連接
--nocolour: 避免使用pygments對輸出着色
--depth or -d: 指定跟進連接請求的層次數(默認: 1)
--verbose or -v: 顯示每一個請求的詳細信息
scrapy parse http://www.example.com/ -c parse_item

genspider：在當前項目中建立spider。

scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com

deploy：將項目部署到Scrapyd服務。
bench：運行benchmark測試。

使用選擇器(selectors)

body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()

response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

Scrapy提供了兩個實用的快捷方式: response.xpath() 及 response.css()css

>>> response.xpath('//base/@href').extract()
>>> response.css('base::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
>>> response.css('a[href*=image]::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
>>> response.css('a[href*=image] img::attr(src)').extract()

嵌套選擇器(selectors)

選擇器方法( .xpath() or .css() )返回相同類型的選擇器列表，所以你也能夠對這些選擇器調用選擇器方法。下面是一個例子:html

links = response.xpath('//a[contains(@href, "image")]')
for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

結合正則表達式使用選擇器(selectors)

Selector 也有一個 .re() 方法，用來經過正則表達式來提取數據。然而，不一樣於使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。因此你沒法構造嵌套式的 .re() 調用。node

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

使用相對XPaths

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()
>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()
>>> for p in divs.xpath('p'): #gets all <p> from the whole document
...     print p.extract()

例如在XPath的 starts-with() 或 contains() 沒法知足需求時， test() 函數能夠很是有用。python

>>> sel.xpath('//li//@href').extract()
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.
Beware of the difference between //node[1] and (//node)[1]
When selecting by class, be as specific as necessary，When querying by class, consider using CSS
Learn to use all the different axes
Useful trick to get text content

Item Loaders

populate items

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存到數據庫中

編寫你本身的item pipeline

每一個item pipeline組件都須要調用該方法，這個方法必須返回一個 Item (或任何繼承類)對象，或是拋出 DropItem 異常，被丟棄的item將不會被以後的pipeline組件所處理。
參數:react

item (Item 對象) – 被爬取的item
spider (Spider 對象) – 爬取該item的spider

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

爲了啓用一個Item Pipeline組件，你必須將它的類添加到 ITEM_PIPELINES 配置，就像下面這個例子:正則表達式

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配給每一個類的整型值，肯定了他們運行的順序，item按數字從低到高的順序，經過pipeline，一般將這些數字定義在0-1000範圍內。mongodb

實踐經驗

同一進程運行多個spider

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished