爬蟲:Scrapy17 - Common Practices

在腳本中運行 Scrapy

除了經常使用的 scrapy crawl 來啓動 Scrapy,也可使用 API 在腳本中啓動 Scrapy。html

須要注意的是,Scrapy 是在 Twisted 異步網絡庫上構建的,所以其必須在 Twisted reactor 裏運行。react

另外,在 spider 運行結束後,必須自行關閉 Twisted reactor。這能夠經過 CrawlerRunner.crawl 所返回的對象中添加回調函數來實現。json

示例:api

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
d = runner.crawl('followall', domain='scrapinghub.com')
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
#Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內置設定參考手冊 for the available settings), instead of using the configuration returned by get_project_settings.

#Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough.

from twisted.internet import reactor
from scrapy.spider import Spider
from scrapy.crawler import CrawlerRunner
from scrapy.settings import Settings

class MySpider(Spider):
    # Your spider definition
    ...

settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
runner = CrawlerRunner(settings)

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

同一進程運行多個 spider

默認狀況下,當執行 scrapy crawl 時,Scrapy 每一個進程運行一個 spider。固然,Scrapy經過內部(internal)API 也支持單進程多個 spider。瀏覽器

示例:服務器

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

相同的例子,不經過連接(chaining)deferred 來線性運行 spider:cookie

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())

@defer.inlineCallbacks
def crawl():
    for domain in ['scrapinghub.com', 'insophia.com']:
        yield runner.crawl('followall', domain=domain)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

分佈式爬蟲(Distributed crawls)

Scrapy 並無提供內置的機制支持分佈式(多服務器)爬取。不過仍是有辦法進行分佈式爬取,取決於要怎麼分佈了。網絡

若是有不少 spider,那分佈負載最簡單的方法就是啓動過個 Scrapyd,並分配到不一樣機器上。併發

若是想要在多個機器上運行一個單獨的 spider,能夠將要爬取的 url 進行分塊,併發送給 spider。例如:dom

首先,準備要爬取的 url 列表,並分配到不一樣的文件 url 裏:

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

接着在3個不一樣的 Scrapyd 服務器中啓動 spider。spider 會接受一個(spider)參數part,該參數表示要爬取的分塊:

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止

有些網站實現了特定的機制,以必定的規則來避免被爬蟲爬取。與這些規則打交道並不容易,須要技巧,有時候也須要特別的基礎。

下面是處理這些站點的建議:

  • 使用 user agent 池,輪流選擇之一來做爲 user agent。池中包含常見的瀏覽器的 user agent
  • 禁止 cookies(參考 COOKIES_ENABLED),有些站點會使用 cookies 來發現爬蟲的軌跡。
  • 設置下載延遲(2 或更高)。參考 DOWNLOAD_DELAY 設置。
  • 若是可行,使用 Google cache 來爬取數據,而不是直接訪問站點。
  • 使用 IP 池。例如免費的 Tor 項目或付費服務(ProxyMesh)。
  • 使用高度分佈式的下載器(downloader)來繞過禁止(ban),您就只須要專一分析處理頁面。這樣的例子有:Crawlera

動態建立 Item 類

對於有些應用,item 的結構由用戶輸入或者其它變化的狀況所控制。你能夠動態建立 class。

from scrapy.item import DictItem, Field

def create_item_class(class_name, field_list):
fields = {field_name: Field() for field_name in field_list}

return type(class_name, (DictItem,), {'fields': fields})
相關文章
相關標籤/搜索