首先,在items.py中定義幾個字段用來保存網頁數據(網址,標題,網頁源碼)css
以下所示:html
import scrapy
class MycnblogsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() page_title = scrapy.Field() page_url = scrapy.Field() page_html = scrapy.Field()
最重要的是咱們的spider,咱們這裏的spider繼承自CrawlSpider,方便咱們定義正則來提示爬蟲須要抓取哪些頁面。數據庫
如:爬去下一頁,爬去各個文章dom
在spdier中,咱們使用parse_item方法來解析目標網頁,從而獲得文章的網址,標題和內容。scrapy
注:在parse_item方法中,咱們在獲得的html源碼中,新增了base標籤,這樣打開下載後的html文件,不至於頁面錯亂,而是使用博客園的css樣式ide
spdier源碼以下:url
# -*- coding: utf-8 -*- from mycnblogs.items import MycnblogsItem from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CnblogsSpider(CrawlSpider): name = "cnblogs" allowed_domains = ["cnblogs.com"] start_urls = ['http://www.cnblogs.com/hongfei/'] rules = ( # 爬取下一頁,沒有callback,意味着follow爲True Rule(LinkExtractor(allow=('default.html\?page=\d+',))), # 爬取全部的文章,並使用parse_item方法進行解析,獲得文章網址,文章標題,文章內容 Rule(LinkExtractor(allow=('hongfei/p/',)), callback='parse_item'), Rule(LinkExtractor(allow=('hongfei/articles/',)), callback='parse_item'), Rule(LinkExtractor(allow=('hongfei/archive/\d+/\d+/\d+/\d+.html',)), callback='parse_item'), ) def parse_item(self, response): item = MycnblogsItem() item['page_url'] = response.url item['page_title'] = response.xpath("//title/text()").extract_first() html = response.body.decode("utf-8") html = html.replace("<head>", "<head><base href='http://www.cnblogs.com/'>") item['page_html'] = html yield item
在pipelines.py文件中,咱們使用process_item方法來處理返回的itemspa
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import codecs class MycnblogsPipeline(object): def process_item(self, item, spider): file_name = './blogs/' + item['page_title'] + '.html' with codecs.open(filename=file_name, mode='wb', encoding='utf-8') as f: f.write(item['page_html']) return item
如下是item pipeline的一些典型應用:code
爲了啓用一個Item Pipeline組件,你必須將它的類添加到 ITEM_PIPELINES
配置,就像下面這個例子:htm
ITEM_PIPELINES = { 'mycnblogs.pipelines.MycnblogsPipeline': 300, }
程序運行後,將採集全部的文章到本地,以下所示:
原文地址:http://www.cnblogs.com/hongfei/p/6659934.html