Scrapy -- Tutorial

時間 2019-11-25

原文原文鏈接

1. 安裝

# 會自動解決依賴.
$ pip install scrapy

相關依賴的庫介紹:css

lxml : XML 和 HTML 解析器.
parsel : 基於 lxml 的 HTML/XML 數據提取器
w3lib : a multi-purpose helper for dealing with URLs and web page encodings.
twisted : an asynchronous networking framework.
cryptography and pyOpenSSL : to deal with various network-level security needs.

2. Tutorial

2.1. Creating a project

$ scrapy startproject tutorial
$ tree

  tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

2.2. First Spider

Spider are classes that you define and that Scrapy uses to scrape information from a website or a group of websites.html

Spider must subclass scrapy.Spider and define the initial requests to make, optionally how to follow linbks in the pages, and how to parse the downloaded page content to extract data.web

$ vim tutorial/spider/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename) as f:
            f.write(response.body)
        self.log("Save file %s" % filename)

name : Identifies the Spider, It must be unique within a project.
start_requests() : must return an iterable of Requests (you can return a list of requests or write a generator(生成器)) which the Spider will begin to crawl from.shell

you can just define a starts_urls class attribute with a list of URLs, which will use the default implementation of start_requests() and use parse() as the default callback of Request, like this:express
```
start_urls = [
          'http://quotes.toscrape.com/page/1/',
          'http://quotes.toscrape.com/page/2/',
      ]
```
parse() : a method that will be called to handle the response downloaded fro each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.json

The parse() method usually parses the response , extracting the scraped data as dicts and also finding new URLS to follow and creating new requests (Request) from them.vim

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case the parse method) passing the response as argument.app

2.3. Run spider

# go to the project's top level directory 
$ cd tutorial

$ scrapy crawl quotes

2.4. Extracting data

The best way to learn how to extract data with Scrapy is trying selectors useing the shell Scrapy shell.異步

$ scrapy shell 'http://quotes.toscrape.com/page/1/'
  [s] Available Scrapy objects:
  [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
  [s]   crawler    <scrapy.crawler.Crawler object at 0x3e0add0>
  [s]   item       {}
  [s]   request    <GET http://quotes.toscrape.com/page/1/>
  [s]   response   <200 http://quotes.toscrape.com/page/1/>
  [s]   settings   <scrapy.settings.Settings object at 0x3e0ae50>
  [s]   spider     <DefaultSpider 'default' at 0x41dfcd0>
  [s] Useful shortcuts:
  [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
  [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
  [s]   shelp()           Shell help (print this help)
  [s]   view(response)    View response in a browser

2.4.1. CSS selector

selecting elements using CSS with the response object the result is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements, and allow you to run further queries to fine-grain the selection or extract the data.scrapy

>>> response.css("title")
    [<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

extract the text from the title :

>>> response.css("title::text").extract()
    [u'Quotes to Scrape']

>>> response.css("title").extract()
    [u'<title>Quotes to Scrape</title>']

# may cause an IndexError and return None when it doesn't find any element matching the selection.
>>> response.css('title::text').extract_first()
    u'Quotes to Scrape'

>>> response.css('title::text')[0].extract()
    u'Quotes to Scrape'

Besides the extract() and extract_first() methods, you can also use the re() method to extract using regular expressions.

>>> response.css('title::text').re(r'Quotes.*')
    [u'Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')
    [u'Quotes']

>>> response.css('title::text').re(r'(\w+) to (\w+)')
    [u'Quotes', u'Scrape']

view(response) will opening the response page from the shell in your web browser. you will find Firebug and Selector Gadget are amazoning tools to find CSS selector.

2.4.2. XPath

XPath expression are very powerful, and are the foundationi of Scrapy Selectors. In fact , CSS selectors are converted to XPath under-the-hood.

Scrapy selectors also support using XPath expressions:

>>> response.xpath('//title/text()')
    [<Selector xpath='//title/text()' data=u'Quotes to Scrape'>]

>>> response.xpath('//title/text()').extract()
    [u'Quotes to Scrape']

XPath expressions offer more power because besides navigating the structure , it can also look at the content. Using XPath , you're able to select things like: select the link that contains the text 'Next Page'

this tutorial(輔導的) to learn XPath through examples, and this tutorial to learn 「how to think in XPath」.

2.5. Storing the scraped data

2.5.1. Feed exports

JSON :

Scrapy appends to a given file instead of overwriting its contents. So removing the file before runing the command second time.
```
$ scrapy crawl quotes -o quotes.json
```
JSON Lines

The JSON Lines format is useful because it’s stream-like, you can easily append(附加) new records to it. It doesn’t have the same problem of JSON when you run twice.
```
$ scrapy crawl quotes -o quotes.jl
```

2.5.2. Item Pipeline

2.6. Following links

The Scrapy's mechanism(機制) of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finished.

>>> response.css('li.next a').extract_first()
    u'<a href="/page/2/">Next <span aria-hidden="true">\u2192</span></a>'

Scrapy supports a CSS extension that let's you select the attribute contents.

>>> response.css('li.next a::attr(href)').extract_first()
    u'/page/2/'

builds a full absolute URL using the urljoin() method (since the links can be relative), and yield a new request to the next page, registering itself as callback to handler the data extraction for the next page.

next_page = response.css('li.next a::attr(href)').extract_first()

if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

response.follow() : a shortcut for creating Reqiests:

next_page = response.css('li.next a::attr(href)').extract_first()

if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

response.follow() supports relative URL directly. Note that response.follow() just return a Request instance , you still have to yield this Request.

You can also pass a selector to response.follow() instead of a string; this selector should extract necessary attributes.

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

For <a> elements there is a shortcut : response.follow() uses their href attribute automatically.

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

2.7. Another Example

class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css("li.next a::attr(href)"):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

Even if there are many quotes from the same author , we don't need to worry about visiting the same author page multiple times.

By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

You can check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

2.8. Spider arguments

You can provide command line arguments to your spiders by using the -a option when running them:

$ scrapy crawl quotes_tags -o quotes-humor.json -a tag=humor

These arguments are passed to the Spider's __init__ method and become spider attributes by default.

In this example, the value provided for the tag argument will be available via self.tag. You can use this to make your spider fetch only quotes with a specific tag, builfing the URL based on the argument:

3. 單文件爬蟲示例

3.1 quote spider

$ cat quotes_spider.py 
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

$ scrapy runspider quotes_spider.py -o quotes.json

start_urls : 爬行開始的地方.
parse : 方法解析 start_urls 返回的響應.

request are scheduled and processed asynchronously(異步的).

using feed exports to generate the JSON file , you can easily change the export format (XML,CSV etc.) or the storage backend (FTP, S3 etc.) . You can also write an item pipeline to store the items in a database.

1. 爬蟲 Scrapy 學習系列之一：Tutorial
2. 執行命令scrapy startproject tutorial報錯
3. [Angular Tutorial]PhoneCat Tutorial App
4. win10 下運行scrapy startproject tutorial 報錯「ImportError:DLL load failed」
5. git tutorial
6. SMS Tutorial
7. Converter Tutorial
8. caffe tutorial
9. Hamcrest Tutorial
10. xpath tutorial
更多相關文章...
• C# 結構體（Struct） - C#教程
• Lua 數組 - Lua 教程
• Git五分鐘教程
• Flink 數據傳輸及反壓詳解

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。