Scrapy簡介

Scrapy at a glance(Scrapy簡介)

 

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. 
Scrapy是Python開發的一個快速,高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途普遍,能夠用於數據挖掘、信息處理和歷史檔案。html

 

Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
儘管Scrapy本來是設計用來屏幕抓取(更精確的說,是網絡抓取)的目的,但它也能夠用來訪問API來提取數據,好比Amazon的AWS或者用來看成一般目的應用的網絡蜘蛛python

 

The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
本文檔的目的是介紹一下Scrapy背後的概念,這樣你會了解它是如何工做的,以決定它是否是你須要的web

 

When you’re ready to start a project, you can start with the tutorial.
當你準備啓動一個項目時,能夠從這個教程開始正則表達式

 

Pick a website(選擇一個網站)

So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
若是你須要從某個網站提取一些信息,可是網站不提供API或者其餘可編程的訪問機制,那麼Scrapy能夠幫助你提取信息shell

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.數據庫

Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
讓咱們看下Mininova網站今天增長的torrent文件,咱們須要提取網址,名稱,描述和文件大小express

The list of all torrents added today can be found on this page:
下面這個列表是全部今天新增的torrents文件的頁面編程

Define the data you want to scrape(定義你要抓取的數據)

The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定義你要抓取的數據,在Scrapy這個是經過定義Scrapy Items來實現的(本例是BT文件)json

This would be our Item:這就是要定義的Item後端

from scrapy.item import Item, Field

class Torrent(Item):
    url = Field()
    name = Field()
    description = Field()
    size = Field()

 

 

Write a Spider to extract the data(撰寫一個蜘蛛來抓取數據)

The next thing is to write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.下一步是寫一個指定起始網址的蜘蛛,這個蜘蛛的規則包含follow連接規則和數據提取規則

If we take a look at that page content we’ll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 若是你看一眼頁面內容,就會發現全部的torrent網址都是相似http://www.mininova.org/tor/NUMBER,其中Number是一個整數,咱們將用正則表達式,例如 /tor/\d+. 來提取規則

We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 咱們將使用Xpath,從頁面的HTML Source裏面選取要要抽取的數據,咱們 選中一個頁面

And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根據頁面HTML 源碼,創建XPath,選取咱們所要的:torrent name, description和size

By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 經過頁面HTML源代碼能夠看到name屬性包含在H1 標籤內

<h1>Home[2009][Eng]XviD-ovd</h1>

 

An XPath expression to extract the name could be: 使用 XPath expression提取的表達式:

//h1/text()

 

And the description is contained inside a <div> tag with id="description":  同時description被包含在id=」description「的div中
<h2>Description:</h2>

<div id="description">
"HOME" - a documentary film by Yann Arthus-Bertrand
<br/>
<br/>
***
<br/>
<br/>
"We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate.

...

 

An XPath expression to select the description could be:使用 XPath expression提取的表達式:
//div[@id='description']

 

Finally, the file size is contained in the second <p> tag inside the <div> tag with id=specifications: size屬性在第二個<p>tag,id=specifications的div內
<div id="specifications">

<p>
<strong>Category:</strong>
<a href="/cat/4">Movies</a> &gt; <a href="/sub/35">Documentary</a>
</p>

<p>
<strong>Total size:</strong>
699.79&nbsp;megabyte</p>

 

An XPath expression to select the description could be:使用 XPath expression提取的表達式:
//div[@id='specifications']/p[2]/text()[2]

 

For more information about XPath see the XPath reference. 若是要了解更多的XPath 參考這裏 XPath reference.

Finally, here’s the spider code: 最後,蜘蛛代碼以下:

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):
        x = HtmlXPathSelector(response)

        torrent = TorrentItem()
        torrent['url'] = response.url
        torrent['name'] = x.select("//h1/text()").extract()
        torrent['description'] = x.select("//div[@id='description']").extract()
        torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        return torrent

 

For brevity’s sake, we intentionally left out the import statements. The Torrent item is defined above.由於很簡單的緣由,咱們有意把重要的數據定義放在了上面(torrent數據定義),

Run the spider to extract the data(運行蜘蛛來抓取數據)

Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format:  最後,咱們運行蜘蛛來爬取這個網站,輸出爲json格式 scraped_data.json

scrapy crawl mininova.org -o scraped_data.json -t json

 

This uses feed exports to generate the JSON file. You can easily change the export format (XML or CSV, for example) or the storage backend (FTP or Amazon S3, for example).
這個使用了feed exports,來生成json格式,固然,你能夠很簡單的更改輸出格式爲csv,xml,或者存儲在後端(ftp或者Amazon S3)

You can also write an item pipeline to store the items in a database very easily.

你也能夠寫一段item pipeline,把數據直接寫入數據庫,很簡單

Review scraped data(檢查抓取的數據)

If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
要運行結束之後,查看一下數據:scraped_data.json,內容大體以下

[{"url": "http://www.mininova.org/tor/2657665", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]},
# ... other items ...
]

 

You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.

關注一下數據,你會發現,全部字段都是lists(除了url是直接賦值),這是由於selectors返回的就是lists格式,若是你想存儲單獨數據或者在數據上增長一些解釋或者清洗,可使用Item Loaders

 

What else?(更多)

You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

你也看到了如何使用Scrapy從一個網站提取和存儲數據,但這只是表象,實際上,Scrapy提供了許多強大的特性,讓它更容易和高效的抓取:

  • Built-in support for selecting and extracting data from HTML and XML sources   內建 selecting and extracting,支持從HTML,XML提取數據
  • Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders. 內建Item Loaders,支持數據清洗和過濾消毒,使用預約義的一個過濾器集合,能夠在全部蜘蛛間公用
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)內建多格式generating feed exports支持(JSON, CSV, XML),能夠在後端存儲爲多種方式(FTP, S3, local filesystem)
  • A media pipeline for automatically downloading images (or any other media) associated with the scraped items針對抓取對象,具備自動圖像(或者任何其餘媒體)下載automatically downloading images的管道線
  • Support for extending Scrapy by plugging your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).支持擴展抓取extending Scrap,使用signals來自定義插入函數或者定義好的API(middlewares, extensions, and pipelines)
  • Wide range of built-in middlewares and extensions for:大範圍的內建中間件和擴展:
    • cookies and session handling
    • HTTP compression
    • HTTP authentication
    • HTTP cache
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.強壯的編碼支持和自動識別機制,能夠處理多種國外的、非標準的、不完整的編碼聲明等等
  • Support for creating spiders based on pre-defined templates, to speed up spider creation and make their code more consistent on large projects. See genspider command for more details.支持根據預約義的模板建立蜘蛛,在大型項目中用來加速蜘蛛並使其代碼更一致。查看genspider命令瞭解更多細節
  • Extensible stats collection for multiple spider metrics, useful for monitoring the performance of your spiders and detecting when they get broken可擴展的統計採集stats collection,針對數十個採集蜘蛛,在監控蜘蛛性能和識別斷線斷路方面頗有用處
  • An Interactive shell console for trying XPaths, very useful for writing and debugging your spiders一個可交互的XPaths腳本命令平臺接口Interactive shell console,在調試撰寫蜘蛛是上很是有用
  • A System service designed to ease the deployment and run of your spiders in production.一個系統服務級別的設計,能夠在產品中很是容易的部署和運行你的蜘蛛
  • A built-in Web service for monitoring and controlling your bot內建的Web service,能夠監視和控制你的機器人
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler一個Telnet控制檯Telnet console,能夠鉤入一個Python的控制檯在你的抓取進程中,以便內視或者調試你的爬蟲
  • Logging facility that you can hook on to for catching errors during the scraping process. Logging功能使得能夠在抓取過程當中提取捕獲的錯誤
  • Support for crawling based on URLs discovered through Sitemaps支持基於Sitemap的網址發現的爬行抓取
  • A caching DNS resolver 具有緩存DNS解析功能

What’s next?(下一步)

The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明顯啦,下一步就是下載Scrapy,而後閱讀教程,加入社區,感謝你對Scrapy感興趣~!

 

T:\mininova\mininova\items.py 源碼

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field

class MininovaItem(Item):
    # define the fields for your item here like:
    # name = Field()
    url = Field()
    name = Field()
    description = Field()
    size = Field()
        

 T:\mininova\mininova\spiders\spider_mininova.py 源碼

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule   
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from mininova.items import MininovaItem

class MininovaSpider(CrawlSpider):

    name = 'mininova.org'
    allowed_domains = ['mininova.org']
    start_urls = ['http://www.mininova.org/today']
    #start_urls = ['http://www.mininova.org/yesterday']
    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')]

    # def parse_item(self, response):
        # filename = response.url.split("/")[-1] + ".html"
        # open(filename, 'wb').write(response.body)

    
    def parse_item(self, response):
        x = HtmlXPathSelector(response)
        item = MininovaItem()
        item['url'] = response.url
        #item['name'] = x.select('''//*[@id="content"]/h1''').extract()
        item['name'] = x.select("//h1/text()").extract()
        #item['description'] = x.select("//div[@id='description']").extract()
        item['description'] = x.select('''//*[@id="specifications"]/p[7]/text()''').extract() #download
        #item['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
        item['size'] = x.select('''//*[@id="specifications"]/p[3]/text()''').extract()
        return item
相關文章
相關標籤/搜索