學習任何東西若是有目的性老是更有效率,這裏我目的是抓取oschina的博客列表:http://www.oschina.net/bloghtml
這種須要有url跟進的爬蟲,scrapy使用CrawlSpider十分方便,CrawlSpider的文檔連接CrawlSpiderpython
環境:python版本2.7.10, scrapy版本1.1.1git
首先建立項目:github
scrapy startproject blogscrapy
生成目錄結構以下: json
編輯items.py,建立BlogScrapyItem存儲博客信息dom
import scrapy class BlogScrapyItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field() url = scrapy.Field()
在spiders目錄下新建spider文件blog.py編寫爬蟲的邏輯scrapy
# coding=utf-8 from scrapy.linkextractors import LinkExtractor from scrapy.loader import ItemLoader from scrapy.spiders import Request, CrawlSpider, Rule from blogscrapy.items import BlogScrapyItem class WendaSpider(CrawlSpider): # 爬蟲惟一標示 name = 'oschina' # 容許的domain allowed_domains = ['oschina.net'] # 種子url start_urls = [ 'http://www.oschina.net/blog', ] rules = ( # 解析博客詳情url地址callback到parse_page, follow爲false, 則url不會跟進爬取了 Rule(LinkExtractor(allow=('my\.oschina\.net/.+/blog/\d+$',)), callback='parse_page', follow=False,), ) # 博客詳情頁面解析 def parse_page(self, response): loader = ItemLoader(BlogScrapyItem(), response=response) loader.add_xpath('title', "//div[@class='title']/text()") loader.add_xpath('content', "//div[@class='BlogContent']") loader.add_value('url', response.url) return loader.load_item()
在項目目錄下使用命令scrapy crawl oschina -o blogs.json便可在當前目錄聲稱blogs.json文件,裏面爲獲取到的BlogScrapyItem的json數據。ide
由於獲得的數據格式不合理,能夠經過在item的field上增長input_processor來在錄入時處理數據 items.py學習
import scrapy from scrapy.loader.processors import MapCompose from w3lib.html import remove_tags def filter_title(value): return value.strip() class BlogScrapyItem(scrapy.Item): title = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title)) content = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title)) url = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title))
github: https://github.com/chenglp1215/scrapy_demo/tree/master/blogscrapyurl