除了經常使用的 scrapy crawl 來啓動 Scrapy,也可使用 API 在腳本中啓動 Scrapy。html
須要注意的是,Scrapy 是在 Twisted 異步網絡庫上構建的,所以其必須在 Twisted reactor 裏運行。react
另外,在 spider 運行結束後,必須自行關閉 Twisted reactor。這能夠經過 CrawlerRunner.crawl 所返回的對象中添加回調函數來實現。json
示例:api
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) # 'followall' is the name of one of the spiders of the project. d = runner.crawl('followall', domain='scrapinghub.com') d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished #Running spiders outside projects it’s not much different. You have to create a generic Settings object and populate it as needed (See 內置設定參考手冊 for the available settings), instead of using the configuration returned by get_project_settings. #Spiders can still be referenced by their name if SPIDER_MODULES is set with the modules where Scrapy should look for spiders. Otherwise, passing the spider class as first argument in the CrawlerRunner.crawl method is enough. from twisted.internet import reactor from scrapy.spider import Spider from scrapy.crawler import CrawlerRunner from scrapy.settings import Settings class MySpider(Spider): # Your spider definition ... settings = Settings({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}) runner = CrawlerRunner(settings) d = runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
默認狀況下,當執行 scrapy crawl 時,Scrapy 每一個進程運行一個 spider。固然,Scrapy經過內部(internal)API 也支持單進程多個 spider。瀏覽器
示例:服務器
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) dfs = set() for domain in ['scrapinghub.com', 'insophia.com']: d = runner.crawl('followall', domain=domain) dfs.add(d) defer.DeferredList(dfs).addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
相同的例子,不經過連接(chaining)deferred 來線性運行 spider:cookie
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) @defer.inlineCallbacks def crawl(): for domain in ['scrapinghub.com', 'insophia.com']: yield runner.crawl('followall', domain=domain) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
Scrapy 並無提供內置的機制支持分佈式(多服務器)爬取。不過仍是有辦法進行分佈式爬取,取決於要怎麼分佈了。網絡
若是有不少 spider,那分佈負載最簡單的方法就是啓動過個 Scrapyd,並分配到不一樣機器上。併發
若是想要在多個機器上運行一個單獨的 spider,能夠將要爬取的 url 進行分塊,併發送給 spider。例如:dom
首先,準備要爬取的 url 列表,並分配到不一樣的文件 url 裏:
http://somedomain.com/urls-to-crawl/spider1/part1.list http://somedomain.com/urls-to-crawl/spider1/part2.list http://somedomain.com/urls-to-crawl/spider1/part3.list
接着在3個不一樣的 Scrapyd 服務器中啓動 spider。spider 會接受一個(spider)參數part,該參數表示要爬取的分塊:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1 curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2 curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些網站實現了特定的機制,以必定的規則來避免被爬蟲爬取。與這些規則打交道並不容易,須要技巧,有時候也須要特別的基礎。
下面是處理這些站點的建議:
COOKIES_ENABLED
),有些站點會使用 cookies 來發現爬蟲的軌跡。DOWNLOAD_DELAY
設置。對於有些應用,item 的結構由用戶輸入或者其它變化的狀況所控制。你能夠動態建立 class。
from scrapy.item import DictItem, Field def create_item_class(class_name, field_list): fields = {field_name: Field() for field_name in field_list} return type(class_name, (DictItem,), {'fields': fields})