Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
When you’re ready to start a project, you can start with the tutorial.
So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.數據庫
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
The list of all torrents added today can be found on this page:
The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定義你要抓取的數據,在Scrapy這個是經過定義Scrapy Items來實現的(本例是BT文件)json
This would be our Item:這就是要定義的Item後端
from scrapy.item import Item, Field class Torrent(Item): url = Field() name = Field() description = Field() size = Field()
The next thing is to write a Spider which defines the start URL (, the rules for following links and the rules for extracting the data from pages.下一步是寫一個指定起始網址的蜘蛛,這個蜘蛛的規則包含follow連接規則和數據提取規則
If we take a look at that page content we’ll see that all torrent URLs are like where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 若是你看一眼頁面內容,就會發現全部的torrent網址都是相似,其中Number是一個整數,咱們將用正則表達式,例如 /tor/\d+. 來提取規則
We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 咱們將使用Xpath,從頁面的HTML Source裏面選取要要抽取的數據,咱們 選中一個頁面
And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根據頁面HTML 源碼,創建XPath,選取咱們所要的:torrent name, description和size
By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 經過頁面HTML源代碼能夠看到name屬性包含在H1 標籤內
An XPath expression to extract the name could be: 使用 XPath expression提取的表達式:
<h2>Description:</h2> <div id="description"> "HOME" - a documentary film by Yann Arthus-Bertrand <br/> <br/> *** <br/> <br/> "We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate. ...
<div id="specifications"> <p> <strong>Category:</strong> <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a> </p> <p> <strong>Total size:</strong> 699.79 megabyte</p>
Finally, here’s the spider code: 最後,蜘蛛代碼以下:
class MininovaSpider(CrawlSpider): name = '' allowed_domains = [''] start_urls = [''] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] def parse_torrent(self, response): x = HtmlXPathSelector(response) torrent = TorrentItem() torrent['url'] = response.url torrent['name'] ="//h1/text()").extract() torrent['description'] ="//div[@id='description']").extract() torrent['size'] ="//div[@id='info-left']/p[2]/text()[2]").extract() return torrent
Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format: 最後,咱們運行蜘蛛來爬取這個網站,輸出爲json格式 scraped_data.json
scrapy crawl -o scraped_data.json -t json
You can also write an item pipeline to store the items in a database very easily.
你也能夠寫一段item pipeline,把數據直接寫入數據庫,很簡單
If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
[{"url": "", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]}, # ... other items ... ]
You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.
關注一下數據,你會發現,全部字段都是lists(除了url是直接賦值),這是由於selectors返回的就是lists格式,若是你想存儲單獨數據或者在數據上增長一些解釋或者清洗,可使用Item Loaders
You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明顯啦,下一步就是下載Scrapy,而後閱讀教程,加入社區,感謝你對Scrapy感興趣~!
T:\mininova\mininova\ 源碼
# Define here the models for your scraped items # # See documentation in: # from scrapy.item import Item, Field class MininovaItem(Item): # define the fields for your item here like: # name = Field() url = Field() name = Field() description = Field() size = Field()
T:\mininova\mininova\spiders\ 源碼
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from mininova.items import MininovaItem class MininovaSpider(CrawlSpider): name = '' allowed_domains = [''] start_urls = [''] #start_urls = [''] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')] # def parse_item(self, response): # filename = response.url.split("/")[-1] + ".html" # open(filename, 'wb').write(response.body) def parse_item(self, response): x = HtmlXPathSelector(response) item = MininovaItem() item['url'] = response.url #item['name'] ='''//*[@id="content"]/h1''').extract() item['name'] ="//h1/text()").extract() #item['description'] ="//div[@id='description']").extract() item['description'] ='''//*[@id="specifications"]/p[7]/text()''').extract() #download #item['size'] ="//div[@id='info-left']/p[2]/text()[2]").extract() item['size'] ='''//*[@id="specifications"]/p[3]/text()''').extract() return item