Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Scrapy是Python開發的一個快速,高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途普遍,能夠用於數據挖掘、信息處理和歷史檔案。html
Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
儘管Scrapy本來是設計用來屏幕抓取(更精確的說,是網絡抓取)的目的,但它也能夠用來訪問API來提取數據,好比Amazon的AWS或者用來看成一般目的應用的網絡蜘蛛python
The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
本文檔的目的是介紹一下Scrapy背後的概念,這樣你會了解它是如何工做的,以決定它是否是你須要的web
When you’re ready to start a project, you can start with the tutorial.
當你準備啓動一個項目時,能夠從這個教程開始正則表達式
So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
若是你須要從某個網站提取一些信息,可是網站不提供API或者其餘可編程的訪問機制,那麼Scrapy能夠幫助你提取信息shell
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.數據庫
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
讓咱們看下Mininova網站今天增長的torrent文件,咱們須要提取網址,名稱,描述和文件大小express
The list of all torrents added today can be found on this page:
下面這個列表是全部今天新增的torrents文件的頁面編程
The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定義你要抓取的數據,在Scrapy這個是經過定義Scrapy Items來實現的(本例是BT文件)json
This would be our Item:這就是要定義的Item後端
from scrapy.item import Item, Field class Torrent(Item): url = Field() name = Field() description = Field() size = Field()
The next thing is to write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.下一步是寫一個指定起始網址的蜘蛛,這個蜘蛛的規則包含follow連接規則和數據提取規則
If we take a look at that page content we’ll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 若是你看一眼頁面內容,就會發現全部的torrent網址都是相似http://www.mininova.org/tor/NUMBER,其中Number是一個整數,咱們將用正則表達式,例如 /tor/\d+. 來提取規則
We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 咱們將使用Xpath,從頁面的HTML Source裏面選取要要抽取的數據,咱們 選中一個頁面
And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根據頁面HTML 源碼,創建XPath,選取咱們所要的:torrent name, description和size
By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 經過頁面HTML源代碼能夠看到name屬性包含在H1 標籤內
<h1>Home[2009][Eng]XviD-ovd</h1>
An XPath expression to extract the name could be: 使用 XPath expression提取的表達式:
//h1/text()
<h2>Description:</h2> <div id="description"> "HOME" - a documentary film by Yann Arthus-Bertrand <br/> <br/> *** <br/> <br/> "We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate. ...
//div[@id='description']
<div id="specifications"> <p> <strong>Category:</strong> <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a> </p> <p> <strong>Total size:</strong> 699.79 megabyte</p>
//div[@id='specifications']/p[2]/text()[2]
Finally, here’s the spider code: 最後,蜘蛛代碼以下:
class MininovaSpider(CrawlSpider): name = 'mininova.org' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] def parse_torrent(self, response): x = HtmlXPathSelector(response) torrent = TorrentItem() torrent['url'] = response.url torrent['name'] = x.select("//h1/text()").extract() torrent['description'] = x.select("//div[@id='description']").extract() torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract() return torrent
Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format: 最後,咱們運行蜘蛛來爬取這個網站,輸出爲json格式 scraped_data.json
scrapy crawl mininova.org -o scraped_data.json -t json
You can also write an item pipeline to store the items in a database very easily.
你也能夠寫一段item pipeline,把數據直接寫入數據庫,很簡單
If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
要運行結束之後,查看一下數據:scraped_data.json,內容大體以下
[{"url": "http://www.mininova.org/tor/2657665", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]}, # ... other items ... ]
You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.
關注一下數據,你會發現,全部字段都是lists(除了url是直接賦值),這是由於selectors返回的就是lists格式,若是你想存儲單獨數據或者在數據上增長一些解釋或者清洗,可使用Item Loaders
You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
你也看到了如何使用Scrapy從一個網站提取和存儲數據,但這只是表象,實際上,Scrapy提供了許多強大的特性,讓它更容易和高效的抓取:
The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明顯啦,下一步就是下載Scrapy,而後閱讀教程,加入社區,感謝你對Scrapy感興趣~!
T:\mininova\mininova\items.py 源碼
# Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/topics/items.html from scrapy.item import Item, Field class MininovaItem(Item): # define the fields for your item here like: # name = Field() url = Field() name = Field() description = Field() size = Field()
T:\mininova\mininova\spiders\spider_mininova.py 源碼
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from mininova.items import MininovaItem class MininovaSpider(CrawlSpider): name = 'mininova.org' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] #start_urls = ['http://www.mininova.org/yesterday'] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')] # def parse_item(self, response): # filename = response.url.split("/")[-1] + ".html" # open(filename, 'wb').write(response.body) def parse_item(self, response): x = HtmlXPathSelector(response) item = MininovaItem() item['url'] = response.url #item['name'] = x.select('''//*[@id="content"]/h1''').extract() item['name'] = x.select("//h1/text()").extract() #item['description'] = x.select("//div[@id='description']").extract() item['description'] = x.select('''//*[@id="specifications"]/p[7]/text()''').extract() #download #item['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract() item['size'] = x.select('''//*[@id="specifications"]/p[3]/text()''').extract() return item