# 會自動解決依賴. $ pip install scrapy
相關依賴的庫介紹:css
lxml
: XML 和 HTML 解析器.parsel
: 基於 lxml 的 HTML/XML 數據提取器w3lib
: a multi-purpose helper for dealing with URLs and web page encodings.twisted
: an asynchronous networking framework.cryptography
and pyOpenSSL : to deal with various network-level security needs.$ scrapy startproject tutorial $ tree tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py
Spider are classes that you define and that Scrapy uses to scrape information from a website or a group of websites.html
Spider must subclass scrapy.Spider
and define the initial requests to make, optionally how to follow linbks in the pages, and how to parse the downloaded page content to extract data.web
$ vim tutorial/spider/quotes_spider.py import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename) as f: f.write(response.body) self.log("Save file %s" % filename)
name
: Identifies the Spider, It must be unique within a project.start_requests()
: must return an iterable of Requests (you can return a list of requests or write a generator(生成器)) which the Spider will begin to crawl from.shell
you can just define a starts_urls
class attribute with a list of URLs, which will use the default implementation of start_requests()
and use parse()
as the default callback of Request, like this:express
start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ]
parse()
: a method that will be called to handle the response downloaded fro each of the requests made. The response parameter is an instance of TextResponse
that holds the page content and has further helpful methods to handle it.json
The parse()
method usually parses the response , extracting the scraped data as dicts and also finding new URLS to follow and creating new requests (Request
) from them.vim
Scrapy schedules the scrapy.Request
objects returned by the start_requests
method of the Spider. Upon receiving a response for each one, it instantiates Response
objects and calls the callback method associated with the request (in this case the parse
method) passing the response as argument.app
# go to the project's top level directory $ cd tutorial $ scrapy crawl quotes
The best way to learn how to extract data with Scrapy is trying selectors useing the shell Scrapy shell.異步
$ scrapy shell 'http://quotes.toscrape.com/page/1/' [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x3e0add0> [s] item {} [s] request <GET http://quotes.toscrape.com/page/1/> [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings <scrapy.settings.Settings object at 0x3e0ae50> [s] spider <DefaultSpider 'default' at 0x41dfcd0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser
selecting elements using CSS with the response object the result is a list-like object called SelectorList
, which represents a list of Selector
objects that wrap around XML/HTML elements, and allow you to run further queries to fine-grain the selection or extract the data.scrapy
>>> response.css("title") [<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
extract the text from the title :
>>> response.css("title::text").extract() [u'Quotes to Scrape'] >>> response.css("title").extract() [u'<title>Quotes to Scrape</title>'] # may cause an IndexError and return None when it doesn't find any element matching the selection. >>> response.css('title::text').extract_first() u'Quotes to Scrape' >>> response.css('title::text')[0].extract() u'Quotes to Scrape'
Besides the extract()
and extract_first()
methods, you can also use the re()
method to extract using regular expressions.
>>> response.css('title::text').re(r'Quotes.*') [u'Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') [u'Quotes'] >>> response.css('title::text').re(r'(\w+) to (\w+)') [u'Quotes', u'Scrape']
view(response)
will opening the response page from the shell in your web browser. you will find Firebug and Selector Gadget are amazoning tools to find CSS selector.
XPath expression are very powerful, and are the foundationi of Scrapy Selectors. In fact , CSS selectors are converted to XPath under-the-hood.
Scrapy selectors also support using XPath expressions:
>>> response.xpath('//title/text()') [<Selector xpath='//title/text()' data=u'Quotes to Scrape'>] >>> response.xpath('//title/text()').extract() [u'Quotes to Scrape']
XPath expressions offer more power because besides navigating the structure , it can also look at the content. Using XPath , you're able to select things like: select the link that contains the text 'Next Page'
this tutorial(輔導的) to learn XPath through examples, and this tutorial to learn 「how to think in XPath」.
JSON :
Scrapy appends to a given file instead of overwriting its contents. So removing the file before runing the command second time.
$ scrapy crawl quotes -o quotes.json
JSON Lines
The JSON Lines format is useful because it’s stream-like, you can easily append(附加) new records to it. It doesn’t have the same problem of JSON when you run twice.
$ scrapy crawl quotes -o quotes.jl
The Scrapy's mechanism(機制) of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finished.
>>> response.css('li.next a').extract_first() u'<a href="/page/2/">Next <span aria-hidden="true">\u2192</span></a>'
Scrapy supports a CSS extension that let's you select the attribute contents.
>>> response.css('li.next a::attr(href)').extract_first() u'/page/2/'
builds a full absolute URL using the urljoin()
method (since the links can be relative), and yield a new request to the next page, registering itself as callback to handler the data extraction for the next page.
next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
response.follow()
: a shortcut for creating Reqiests:
next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page, callback=self.parse)
response.follow()
supports relative URL directly. Note that response.follow()
just return a Request instance , you still have to yield this Request.
You can also pass a selector to response.follow()
instead of a string; this selector should extract necessary attributes.
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
For <a>
elements there is a shortcut : response.follow()
uses their href attribute automatically.
for a in response.css('li.next a'): yield response.follow(a, callback=self.parse)
class AuthorSpider(scrapy.Spider): name = "author" start_urls = ["http://quotes.toscrape.com/"] def parse(self, response): # follow links to author pages for href in response.css('.author + a::attr(href)'): yield response.follow(href, self.parse_author) # follow pagination links for href in response.css("li.next a::attr(href)"): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), }
Even if there are many quotes from the same author , we don't need to worry about visiting the same author page multiple times.
By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS
.
You can check out the CrawlSpider
class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
You can provide command line arguments to your spiders by using the -a
option when running them:
$ scrapy crawl quotes_tags -o quotes-humor.json -a tag=humor
These arguments are passed to the Spider's __init__
method and become spider attributes by default.
In this example, the value provided for the tag
argument will be available via self.tag
. You can use this to make your spider fetch only quotes with a specific tag, builfing the URL based on the argument:
$ cat quotes_spider.py import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] def parse(self, response): for quote in response.css("div.quote"): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.xpath('span/small/text()').extract_first(), } next_page = response.css('li.next a::attr("href")').extract_first() if next_page is not None: yield response.follow(next_page, self.parse) $ scrapy runspider quotes_spider.py -o quotes.json
start_urls
: 爬行開始的地方.
parse
: 方法解析 start_urls
返回的響應.
request are scheduled and processed asynchronously(異步的).
using feed exports to generate the JSON file , you can easily change the export format (XML,CSV etc.) or the storage backend (FTP, S3 etc.) . You can also write an item pipeline to store the items in a database.