Scrapy學習(二) 入門python
有了前兩篇的基礎,就能夠開始互聯網上爬取咱們感興趣的信息了。由於暫時尚未學到如何模擬登錄,因此我就先抓像豆瓣這樣不須要登錄的網站上的內容。
個人開發環境是 Win7 + PyChram + Python3.5 + MongoDB
爬蟲的目標是豆瓣的日本文學標籤下的全部書籍基本信息git
scrapy startproject doubangithub
接着移動到douban
目錄下數據庫
scrapy genspider book book.douban.comjson
在spider目錄下生成相應的BookSpider模板瀏覽器
在items.py中編寫咱們須要的數據模型服務器
class BookItem(scrapy.Item): book_name = scrapy.Field() book_star = scrapy.Field() book_pl = scrapy.Field() book_author = scrapy.Field() book_publish = scrapy.Field() book_date = scrapy.Field() book_price = scrapy.Field()
訪問豆瓣的日本文學標籤,將url的值寫到start_urls
中。接着在Chrome的幫助下,能夠看到每本圖書是在ul#subject-list > li.subject-item
dom
class BookSpider(scrapy.Spider): ... def parse(self, response): sel = Selector(response) book_list = sel.css('#subject_list > ul > li') for book in book_list: item = BookItem() item['book_name'] = book.xpath('div[@class="info"]/h2/a/text()').extract()[0].strip() item['book_star'] = book.xpath("div[@class='info']/div[2]/span[@class='rating_nums']/text()").extract()[ 0].strip() item['book_pl'] = book.xpath("div[@class='info']/div[2]/span[@class='pl']/text()").extract()[0].strip() pub = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split('/') item['book_price'] = pub.pop() item['book_date'] = pub.pop() item['book_publish'] = pub.pop() item['book_author'] = '/'.join(pub) yield item
測試一下代碼是否有問題scrapy
scrapy crawl book -o items.json
奇怪的發現,items.json內並無數據,後頭看控制檯中的DEBUG信息
2017-02-04 16:15:38 [scrapy.core.engine] INFO: Spider opened
2017-02-04 16:15:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-02-04 16:15:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-02-04 16:15:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/robot... (referer: None)
2017-02-04 16:15:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://book.douban.com/tag/%... (referer: None)
爬取網頁時狀態碼是403。這是由於服務器判斷出爬蟲程序,拒絕咱們訪問。
咱們能夠在settings中設定USER_AGENT
的值,假裝成瀏覽器訪問頁面。
USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
再試一次,就發現items.json有值了。但仔細只有第一頁的數據,若是咱們想要爬取全部的數據,就須要爬完當前頁後自動得到下一頁的url,以此類推爬完全部數據。
因此咱們對spider進行改造。
... def parse(self, response): sel = Selector(response) book_list = sel.css('#subject_list > ul > li') for book in book_list: item = BookItem() try: item['book_name'] = book.xpath('div[@class="info"]/h2/a/text()').extract()[0].strip() item['book_star'] = book.xpath("div[@class='info']/div[2]/span[@class='rating_nums']/text()").extract()[0].strip() item['book_pl'] = book.xpath("div[@class='info']/div[2]/span[@class='pl']/text()").extract()[0].strip() pub = book.xpath('div[@class="info"]/div[@class="pub"]/text()').extract()[0].strip().split('/') item['book_price'] = pub.pop() item['book_date'] = pub.pop() item['book_publish'] = pub.pop() item['book_author'] = '/'.join(pub) yield item except: pass nextPage = sel.xpath('//div[@id="subject_list"]/div[@class="paginator"]/span[@class="next"]/a/@href').extract()[0].strip() if nextPage: next_url = 'https://book.douban.com'+nextPage yield scrapy.http.Request(next_url,callback=self.parse)
其中scrapy.http.Request
會回調parse函數,用try...catch是由於豆瓣圖書並非格式一致的。遇到有問題的數據,就拋棄不用。
通常來講,若是爬蟲速度過快。會致使網站拒絕咱們的訪問,因此咱們須要在settings設置爬蟲的間隔時間,並關掉COOKIES
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = False
或者,咱們能夠設置不一樣的瀏覽器UA或者IP地址來回避網站的屏蔽
下面用更改UA來做爲例子。
在middlewares.py,編寫一個隨機替換UA的中間件,每一個request都會通過middleware。
其中process_request
,返回None
,Scrapy將繼續到其餘的middleware進行處理。
class RandomUserAgent(object): def __init__(self,agents): self.agents = agents @classmethod def from_crawler(cls,crawler): return cls(crawler.settings.getlist('USER_AGENTS')) def process_request(self,request,spider): request.headers.setdefault('User-Agent',random.choice(self.agents))
接着道settings
中設置
DOWNLOADER_MIDDLEWARES = { 'douban.middlewares.RandomUserAgent': 1, } ... USER_AGENTS = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", ... ]
再次運行程序,顯然速度快了很多。
接下來咱們要將數據保存到數據庫作持久化處理(這裏用MongoDB舉例,保存到其餘數據庫同理)。
這部分處理是寫在pipelines
中。在此以前咱們還要先安裝鏈接數據庫的驅動。
pip install pymongo
咱們在settings
寫下配置
# MONGODB configure MONGODB_SERVER = 'localhost' MONGODB_PORT = 27017 MONGODB_DB = 'douban' MONGODB_COLLECTION = "book"
class MongoDBPipeline(object): def __init__(self): connection = MongoClient( host=settings['MONGODB_SERVER'], port=settings['MONGODB_PORT'] ) db = connection[settings['MONGODB_DB']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): self.collection.insert(dict(item)) log.msg("Book added to MongoDB database!", level=log.DEBUG, spider=spider) return item
將運行項目的時候控制檯中輸出的DEBUG信息保存到log文件中。只須要在settings
中設置
LOG_FILE = "logs/book.log"
項目代碼地址:豆瓣圖書爬蟲