最近真是忙的吐血。。。css
上篇寫的是直接在爬蟲中使用mongodb,這樣不是很好,scrapy下使用item纔是正經方法。
在item中定義須要保存的內容,而後在pipeline處理item,爬蟲流程就成了這樣:html
定義item,在items.py中定義抓取內容mongodb
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class GetquotesItem(scrapy.Item): # define the fields for your item here like: # 定義咱們須要抓取的內容: # 1.名言內容 # 2.做者 # 3.標籤 content = scrapy.Field() author = scrapy.Field() tags = scrapy.Field()
咱們將數據庫的配置信息保存在setting.py文件中,方便調用數據庫
MONGODB_HOST = 'localhost' MONGODB_PORT = 27017 MONGODB_DBNAME = 'store_quotes2' MONGODB_TABLE = 'quotes2'
另外,在setting.py文件中一點要將pipeline註釋去掉,要否則pipeline不會起做用:數組
#ITEM_PIPELINES = { # 'getquotes.pipelines.SomePipeline': 300, #}
改爲scrapy
ITEM_PIPELINES = { 'getquotes.pipelines.GetquotesPipeline': 300, }
如今在pipeline.py中定義處理item方法:ide
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html # 將setting導入,以使用定義內容 from scrapy.conf import settings import pymongo class GetquotesPipeline(object): # 鏈接數據庫 def __init__(self): # 獲取數據庫鏈接信息 host = settings['MONGODB_HOST'] port = settings['MONGODB_PORT'] dbname = settings['MONGODB_DBNAME'] client = pymongo.MongoClient(host=host, port=port) # 定義數據庫 db = client[dbname] self.table = db[settings['MONGODB_TABLE']] # 處理item def process_item(self, item, spider): # 使用dict轉換item,而後插入數據庫 quote_info = dict(item) self.table.insert(quote_info) return item
相應的,myspider.py中的代碼變化一下函數
import scrapy import pymongo # 別忘了導入定義的item from getquotes.items import GetquotesItem class myspider(scrapy.Spider): # 設置爬蟲名稱 name = "get_quotes" # 設置起始網址 start_urls = ['http://quotes.toscrape.com'] ''' # 配置client,默認地址localhost,端口27017 client = pymongo.MongoClient('localhost',27017) # 建立一個數據庫,名稱store_quote db_name = client['store_quotes'] # 建立一個表 quotes_list = db_name['quotes'] ''' def parse(self, response): #使用 css 選擇要素進行抓取,若是喜歡用BeautifulSoup之類的也能夠 #先定位一整塊的quote,在這個網頁塊下進行做者、名言,標籤的抓取 for quote in response.css('.quote'): ''' # 將頁面抓取的數據存入mongodb,使用insert yield self.quotes_list.insert({ 'author' : quote.css('small.author::text').extract_first(), 'tags' : quote.css('div.tags a.tag::text').extract(), 'content' : quote.css('span.text::text').extract_first() }) ''' item = GetquotesItem() item['author'] = quote.css('small.author::text').extract_first() item['content'] = quote.css('span.text::text').extract_first() item['tags'] = quote.css('div.tags a.tag::text').extract() yield item # 使用xpath獲取next按鈕的href屬性值 next_href = response.xpath('//li[@class="next"]/a/@href').extract_first() # 判斷next_page的值是否存在 if next_href is not None: # 若是下一頁屬性值存在,則經過urljoin函數組合下一頁的url: # www.quotes.toscrape.com/page/2 next_page = response.urljoin(next_href) #回調parse處理下一頁的url yield scrapy.Request(next_page,callback=self.parse)
能夠再scrapy輸出信息中看到pipeline啓用url
再來看看數據庫保存狀況spa
完美保存