scrapy入門

安裝完scrapy後,建立一個新的工程:html

scrapy startproject tutorial

會建立一個tutorial文件夾有如下的文件:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

These are basically:node

  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code from here.
  • tutorial/items.py: the project’s items file.
  • tutorial/pipelines.py: the project’s pipelines file.
  • tutorial/settings.py: the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

 

Defining our Item

Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protecting against populating undeclared fields, to prevent typos. python

items是咱們將會裝入的數據的容器。他們相似python 字典可是提供了附加的保護如防止填充未聲明的段等。web

They are declared by creating an scrapy.item.Item class an defining its attributes as scrapy.item.Field objects, like you will in an ORM (don’t worry if you’re not familiar with ORMs, you will see that this is an easy task).shell

We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org, as we want to capture the name, url and description of the sites, we define fields for each of these three attributes. To do that, we edit items.py, found in the tutorial directory. Our Item class looks like this:express

它經過創建一個scrapy.item.Item的類來生命,定義它的屬性爲scrpiy.item.Field對象,就像你在一個ORM中. 

咱們經過將咱們須要的條目模型化來控制從dmoz.org得到的數據,好比咱們要得到網站的名字,url和網站描述,咱們定義這三種屬性的範圍,爲了達到目的,咱們編輯在dmoz目錄下的items.py文件,咱們的Item類將會是這樣django

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

This may seem complicated at first, but defining the item allows you to use other handy components of Scrapy that need to know how your item looks like.json

這個首先看起來有點複雜,可是定義這些條目讓你用其餘Scrapy的組件的時候你可以知道你的 items究竟是如何定義。api

Our first Spiderapp

Spiders are user-written classes used to scrape information from a domain (or group of domains). spider是用戶寫的類用來scrapy信息。

They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.

它定義了一個初始的url列表來下載,如何follow link,如何解析頁面提取到items。

To create a Spider, you must subclass scrapy.spider.BaseSpider, and define the three main, mandatory, attributes:

爲了建立一個spider,你必須建立一個scrapy.spider.BaseSpider的子類,而後定義3個主要的必須的屬性。

  • name: identifies the Spider. It must be unique, that is, you can’t set the same name for different Spiders.

  • start_urls: is a list of URLs where the Spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

  • 這是一個URL列表,爬蟲從這裏開始抓取數據,因此,第一次下載的數據將會從這些URLS開始。 後來計算的全部子URL將會從這些URL中開始計算
  • parse() is a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more URLs to follow.

  • 是一個爬蟲的方法,調用時候傳入從每個URL傳回的Response對象做爲參數,response將會是parse方法的惟一的一個參數,這個方法負責解析返回的response數據匹配抓取的數據(解析爲items)和其餘的URL.

    The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Requestobjects). parse()方法負責處理response,返回scrapy的數據(做爲item對象)。

This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

name #爬蟲的ID,該屬性必須惟一

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

:51:13-0300 [scrapy] INFO: Started project: dmoz
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL defined in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end of the log line, where it says (referer: <None>).

注意有 [dmoz.org]的輸出 ,對咱們的爬蟲作出的結果(identified by the domain "dmoz.org"). 你能夠看見在start_urls中定義的一些URL的一些輸出。由於這些URL是起始頁面,因此他們沒有引用(referrers),因此在每行的末尾你會看到 (referer: <None>).

But more interesting, as our parse method instructs, two files have been created: Books and Resources, with the content of both URLs.

有趣的是,在咱們的 parse  方法的做用下,兩個文件被建立 Books 和 Resources, 這兩個文件中有着URL的頁面內容。

(在頂層目錄下有了2個新文件。 Books 和 Resources,分別是2個url網頁的內容)

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed back to the spider, through the parse()method.

發生了什麼事情?Scrapy爲爬蟲屬性中的 start_urls中的每一個URL建立了一個 scrapy.http.Request 對象 , 爲他們指定爬蟲的 parse 方法做爲回調。 

這些 Request首先被計劃,而後被執行, 以後 scrapy.http.Response 對象經過parse() 方法返回給爬蟲.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors documentation.

Here are some examples of XPath expressions and their meanings:

  • /html/head/title: selects the <title> element, inside the <head> element of a HTML document
  • /html/head/title/text(): selects the text inside the aforementioned <title> element.
  • //td: selects all the <td> elements
  • //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two flavours, HtmlXPathSelector (for HTML data) andXmlXPathSelector (for XML data). In order to use them you must instantiate the desired class with a Response object.

You can see selectors as objects that represent nodes in the document structure. So, the first instantiated selectors are associated to the root node, or the entire document.

爲了方便使用XPaths, Scrapy提供XPathSelector 類, 一共有兩種, HtmlXPathSelector (HTML數據解析) 和XmlXPathSelector (XML數據解析). 爲了使用他們你必須經過一個 Response 對象對他們進行實例化操做. 

你會發現Selector對象展現了文檔的節點結構.因此,首先被實例化的selector與跟節點或者是整個目錄有關 。 
 

Selectors have three methods (click on the method to see the complete API documentation).

  • select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given as argument.

  • extract(): returns a unicode string with

    the data selected by the XPath selector. 提取unicode字符串。

  • re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended Python console) installed on your system.

嘗試在交互環境中使用Selectors爲了舉例說明Selectors的用法咱們將用到 Scrapy shell, 須要使用ipython (一個擴展python交互環境) 。 

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

 

輸出:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s]   hxs        <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s]   item       Item()
[s]   request    <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   response   <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s]   spider     <BaseSpider 'default' at 0x1b6c2d0>
[s]   xxs        <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] Useful shortcuts:
[s]   shelp()           Print this help
[s]   fetch(req_or_url) Fetch a new request or URL and update shell objects
[s]   view(response)    View response in a browser

In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type response.body you will see the body of the response, or you can type response.headers to see its headers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable) with this response. So let’s try them:

交互環境載入後,你將會有一個在本地變量 response 中提取的response , 因此若是你輸入 response.body 你將會看到response的body部分,或者你能夠輸入 response.headers 來查看它的 headers. 

交互環境也實例化了兩種selectors, 一個是解析HTML的  hxs 變量 一個是解析 XML 的 xxs 變量 :

,這裏使用到XPath選擇器,使用HtmlXPathSelector,就能選擇到相應的數據。
scrapy提供了一種至關好的方法(django裏面也有相似的方法)給咱們測試XPath,只要在terminal輸入
 
  1. scrapy shell url #url表示你要提取網頁的URL  
就能進入一個交互模式,使用sel對象就能進行測試。


  1. hxs = HtmlXPathSelector(response)  
  2. self.title = hxs.select('//title/text()').extract()[0].strip().replace(' ', '_')  
  3. sites = hxs.select('//ul/li/div/a/img/@src').extract()  
 
In [1]: hxs.select('//title')
Out[1]: [<HtmlXPathSelector (title) xpath=//title>]

In [2]: hxs.select('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']

In [3]: hxs.select('//title/text()')
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>]

In [4]: hxs.select('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']

In [5]: hxs.select('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to figure out the XPaths you need to use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task, you can use some Firefox extensions like Firebug. For more information seeUsing Firebug for scraping and Using Firefox for scraping.

After inspecting the page source, you’ll find that the web sites information is inside a <ul> element, in fact the second <ul> element.

So we can select each <li> element belonging to the sites list with this code:

檢查源代碼後,你會發現咱們須要的數據在一個 <ul>元素中 事實是第二個<ul>元素。 

咱們能夠經過以下命令選擇每一個在網站中的 <li> 元素:

hxs.select('//ul/li')
And from them, the sites descriptions:

hxs.select('//ul/li/text()').extract()
The sites titles:

hxs.select('//ul/li/a/text()').extract()
And the sites links:

hxs.select('//ul/li/a/@href').extract()

As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to dig deeper into a node. We are going to use that property here, so:每一個 select() 調用返回一個selectors列表, 因此咱們能夠結合 select() 調用去查找更深的節點. 咱們將會用到這些特性,因此:

sites = hxs.select('//ul/li')
for site in sites:
    title = site.select('a/text()').extract()
    link = site.select('a/@href').extract()
    desc = site.select('text()').extract()
    print title, link, desc

Let’s add this code to our spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            print title, link, desc

Now try crawling the dmoz.org domain again and you’ll see sites being printed in your output, run:

scrapy crawl dmoz

Using our item

Item objects are custom python dicts; you can access the values of their fields (attributes of the class we defined earlier) using the standard dict syntax like:

>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'

Spiders are expected to return their scraped data inside Item objects. So, in order to return the data we’ve scraped so far, the final code for our Spider would be like this:Spiders將會返回在 Item 中抓取的信息 ,因此爲了返回咱們抓取的信息,spider的內容應該是這樣:

rom scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["dmoz.org"]
   start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.select('a/text()').extract()
           item['link'] = site.select('a/@href').extract()
           item['desc'] = site.select('text()').extract()
           items.append(item)
       return items

發現上面有個錯誤,如今個人文件組織結構以下:

tutorial
--tutorial
----spiders
------__init__
------dmoz_spider

----__init__
----items
----pipelines
----setting

我最開始用

from tutorial.items import DmozItem 
pydev 提示錯誤,tutorial沒有items模塊。
改成from tutorial.tutorial.items import DmozItem pydev正常。
可是當我在cmd下scrapy crawl dmoz時就出錯了。提示tutorial沒有tutorial.items模塊。
改成
from tutorial.items import DmozItem 正常。
 
  

Now doing a crawl on the dmoz.org domain yields DmozItem‘s:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
      'link': [u'http://gnosis.cx/TPiP/'],
      'title': [u'Text Processing in Python']}
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
     {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
      'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
      'title': [u'XML Processing with Python']}

Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json file containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipeline if you just want to store the scraped items.

Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glancechapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the section Basic concepts.

 說明;

0.24的scrapy版本

做了

之前是:

from scrapy.item import Item, Field
class DmozItem(Item):
# define the fields for your item here like:
# name = Field()
title=Field()
link=Field()
desc=Field()

如今是:

import scrapy class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()作了相似的改動。
相關文章
相關標籤/搜索