使用 Scrapy 創建一個網站抓取器

時間 2020-01-11

原文原文鏈接

Scrapy 是一個用於爬行網站以及在數據挖掘、信息處理和歷史檔案等大量應用範圍內抽取結構化數據的應用程序框架，普遍用於工業。
在本文中咱們將創建一個從 Hacker News 爬取數據的爬蟲，並將數據按咱們的要求存儲在數據庫中。html

安裝

咱們將須要 Scrapy以及 BeautifulSoup用於屏幕抓取，SQLAlchemy用於存儲數據.
若是你使用ubuntu已經其餘發行版的 unix 能夠經過 pip 命令安裝 Scrapy。python

pip install Scrapy

若是你使用 Windows，你須要手工安裝 scrapy 的一些依賴。
Windows 用戶須要 pywin3二、pyOpenSSL、Twisted、lxml 和 zope.interface。你能夠下載這些包的編譯版本來完成簡易安裝。
能夠參照官方文檔查看詳情指導。
都安裝好後，經過在python命令行下輸入下面的命令驗證你的安裝：git

>> import scrapy
>>

若是沒有返回內容，那麼你的安裝已就緒。github

安裝HNScrapy

爲了建立一個新項目，在終端裏輸入如下命令web

$ scrapy startproject hn

這將會建立一系列的文件幫助你更容易的開始，cd 到 hn 目錄而後打開你最喜歡的文本編輯器。
在 items.py 文件裏，scrapy 須要咱們定義一個容器用於放置爬蟲抓取的數據。若是你原來用過 Django tutorial，你會發現items.py 與 Django 中的 models.py 相似。
你將會發現 class HnItem 已經存在了，它繼承自 Item--一個 scrapy 已經爲咱們準備好的預約義的對象。
讓咱們添加一些咱們真正想抓取的條目。咱們給它們賦值爲Field()是由於這樣咱們才能把元數據(metadata)指定給scrapy。sql

from scrapy.item import Item, Field

class HnItem(Item):
    title = Field()
    link = Field()

沒什麼難的--恩，就是這樣。在 scrapy 裏，沒有別的 filed 類型，這點和 Django 不一樣。因此，咱們和 Field() 槓上了。
scrapy 的 Item 類的行爲相似於 Python 裏面的 dictionary ，你能從中獲取 key 和 value。數據庫

開始寫爬蟲

在 spiders 文件夾下建立一個 hn_spider.py 文件。這是奇蹟發生的地方--這正是咱們告訴 scrapy 如何找到咱們尋找的確切數據的地方。正如你所想的那樣，一個爬蟲只針對一個特定網頁。它可能不會在其餘網站上工做。
在 ht_spider.py 裏，咱們將定義一個類，HnSpider 以及一些通用屬性，例如name 和 urls。
首先，咱們先創建 HnSpider 類以及一些屬性(在類內部定義的變量，也被稱爲field)。咱們將從 scrapy 的 BaseSpider 繼承：django

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class HnSpider(BaseSpider):
    name = 'hn'
    allowed_domains = []
    start_urls = ['http://news.ycombinator.com']

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//td[@class="title"]')

        for site in sites:
            title = site.xpath('a/text()').extract()
            link = site.xpath('a/@href').extract()

            print title, link

前面的幾個變量是自解釋的 :name 定義了爬蟲的名字，allowed_domains 列出了供爬蟲爬行的容許域名(allowed domain)的 base-URL，start_urls 列出了爬蟲從這裏開始爬行的 URL。後續的 URL 將從爬蟲從 start_urls 下載的數據的URL開始。
接着，scrapy 使用 XPath 選擇器從網站獲取數據--經過一個給定的 XPath 從 HTML 數據的特定部分進行選擇。正如它們的文檔所說，"XPath 是一種用於從XML選擇節點的語言，它也能夠被用於HTML"。你也能夠閱讀它們的文檔了更多關於XPath選擇器的信息。json

注意

在抓取你本身的站點並嘗試計算 XPath 時, Chrome 的開發工具提供了檢查html元素的能力, 可讓你拷貝出任何你想要的元素的 xpath. 它也提供了檢測 xpath 的能力，只須要在 Javascript 控制檯中使用 $x, 例如 $x("//img")。而在這個教程就很少深究這個了, Firefox 有一個插件, FirePath 一樣也能夠編輯，檢查和生成XPath。
咱們通常會基於一個定義好的 Xpath 來告訴 scrapy 到哪裏去開始尋找數據. 讓咱們瀏覽咱們的 Hacker News 站點，並右擊選擇」查看源代碼「:
ubuntu

你會看到那個 sel.xpath('//td[@class="title"]') 有點貌似咱們見過的 HTML 的代碼. 從它們的文檔中你能夠解讀出構造 XPath 並使用相對 XPath 的方法. 但本質上, '//td[@class="title"]' 是在說: 全部的 <td> 元素中, 若是一個 <a class="title"></a> 被展示了出來，那就到 <td>元素裏面去尋找那個擁有一個被稱做 title 的類型的 <a> 元素。

parse() 方法使用了一個參數: response。嘿，等一下 – 這個 self 是幹什麼的 – 看起來像是有兩個參數!
每個實體方法(在這種狀況下, parse() 是一個實體方法 ) 接受一個對它自身的引用做爲其第一個參數. 爲了方便就叫作「self」。
response 參數是抓取器在像 Hacker News 發起一次請求以後所要返回的東西。咱們會用咱們的 XPaths 轉換那個響應。
如今咱們將使用 BeautifulSoup 來進行轉換. Beautiful Soup 將會轉換任何你給它的東西。
下載 BeautifulSoup 並在抓取器目錄裏面建立 soup.py 文件，將代碼複製到其中。
在你的 hn_spider.py文件裏面引入 beautifulSoup 和來自 items.py 的 Hnitem，而且像下面這樣修改轉換方法。

from soup import BeautifulSoup as bs
from scrapy.http import Request
from scrapy.spider import BaseSpider
from hn.items import HnItem

class HnSpider(BaseSpider):
    name = 'hn'
    allowed_domains = []
    start_urls = ['http://news.ycombinator.com']

    def parse(self, response):
        if 'news.ycombinator.com' in response.url:
            soup = bs(response.body)
            items = [(x[0].text, x[0].get('href')) for x in
                     filter(None, [
                         x.findChildren() for x in
                         soup.findAll('td', {'class': 'title'})
                     ])]

            for item in items:
                print item
                hn_item = HnItem()
                hn_item['title'] = item[0]
                hn_item['link'] = item[1]
                try:
                    yield Request(item[1], callback=self.parse)
                except ValueError:
                    yield Request('http://news.ycombinator.com/' + item[1], callback=self.parse)

                yield hn_item

咱們正在迭代這個items，而且給標題和連接賦上抓取來的數據。

如今就試試對Hacker News域名進行抓取，你會看到鏈接和標題被打印在你的控制檯上。

scrapy crawl hn

2013-12-12 16:57:06+0530 [scrapy] INFO: Scrapy 0.20.2 started (bot: hn)
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Optional features available: ssl, http11, django
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'hn.spiders', 'SPIDER_MODULES': ['hn.spiders'], 'BOT_NAME': 'hn'}
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware
, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled item pipelines:
2013-12-12 16:57:06+0530 [hn] INFO: Spider opened
2013-12-12 16:57:06+0530 [hn] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-12 16:57:07+0530 [hn] DEBUG: Redirecting (301) to <GET https://news.ycombinator.com/> from <GET http://news.ycombinator.com>
2013-12-12 16:57:08+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None)
(u'Caltech Announces Open Access Policy | Caltech', u'http://www.caltech.edu/content/caltech-announces-open-access-policy')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.caltech.edu/content/caltech-announces-open-access-policy',
         'title': u'Caltech Announces Open Access Policy | Caltech'}
(u'Coinbase Raises $25 Million From Andreessen Horowitz', u'http://blog.coinbase.com/post/69775463031/coinbase-raises-25-million-from-andreessen-horowitz')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://blog.coinbase.com/post/69775463031/coinbase-raises-25-million-from-andreessen-horowitz',
         'title': u'Coinbase Raises $25 Million From Andreessen Horowitz'}
(u'Backpacker stripped of tech gear at Auckland Airport', u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475',
         'title': u'Backpacker stripped of tech gear at Auckland Airport'}
(u'How I introduced a 27-year-old computer to the web', u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/',
         'title': u'How I introduced a 27-year-old computer to the web'}
(u'Show HN: Bitcoin Pulse - Tracking Bitcoin Adoption', u'http://www.bitcoinpulse.com')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.bitcoinpulse.com',
         'title': u'Show HN: Bitcoin Pulse - Tracking Bitcoin Adoption'}
(u'Why was this secret?', u'http://sivers.org/ws')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://sivers.org/ws', 'title': u'Why was this secret?'}
(u'PostgreSQL Exercises', u'http://pgexercises.com/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://pgexercises.com/', 'title': u'PostgreSQL Exercises'}
(u'What it feels like being an ipad on a stick on wheels', u'http://labs.spotify.com/2013/12/12/what-it-feels-like-being-an-ipad-on-a-stick-on-wheels/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://labs.spotify.com/2013/12/12/what-it-feels-like-being-an-ipad-on-a-stick-on-wheels/',
         'title': u'What it feels like being an ipad on a stick on wheels'}
(u'Prototype ergonomic mechanical keyboards', u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html',
         'title': u'Prototype ergonomic mechanical keyboards'}
(u'H5N1', u'http://blog.samaltman.com/h5n1')
.............
.............
.............
2013-12-12 16:58:41+0530 [hn] INFO: Closing spider (finished)
2013-12-12 16:58:41+0530 [hn] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 2,
         'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
         'downloader/request_bytes': 22401,
         'downloader/request_count': 71,
         'downloader/request_method_count/GET': 71,
         'downloader/response_bytes': 1482842,
         'downloader/response_count': 69,
         'downloader/response_status_count/200': 61,
         'downloader/response_status_count/301': 4,
         'downloader/response_status_count/302': 3,
         'downloader/response_status_count/404': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 12, 12, 11, 28, 41, 289000),
         'item_scraped_count': 63,
         'log_count/DEBUG': 141,
         'log_count/INFO': 4,
         'request_depth_max': 2,
         'response_received_count': 62,
         'scheduler/dequeued': 71,
         'scheduler/dequeued/memory': 71,
         'scheduler/enqueued': 71,
         'scheduler/enqueued/memory': 71,
         'start_time': datetime.datetime(2013, 12, 12, 11, 27, 6, 843000)}
2013-12-12 16:58:41+0530 [hn] INFO: Spider closed (finished)<pre><code>你將會在終端上看到大約 400 行的大量輸出 ( 上面的輸出之因此這麼短，目的是爲了方便觀看 ).
你能夠經過下面這個小命令將輸出包裝成JSON格式
</code></pre>$ scrapy crawl hn -o items.json -t json<pre><code>如今咱們已經基於正在找尋的項目實現了咱們抓取器。

###保存抓取到的數據

咱們開始的步驟是建立一個保存咱們抓取到的數據的數據庫。打開 `settings.py` 而且像下面展示的代碼同樣定義數據庫配置。
</code></pre>BOT_NAME = 'hn'

SPIDER_MODULES = ['hn.spiders']
NEWSPIDER_MODULE = 'hn.spiders'

DATABASE = {'drivername': 'xxx',
            'username': 'yyy',
            'password': 'zzz',
            'database': 'vvv'}<pre><code>再在 `hn` 目錄下建立一個 `mdels.py` 文件。咱們將要使用 SQLAlchemy 做爲 ORM 框架創建數據庫模型。
首先，咱們須要定義一個直接鏈接到數據庫的方法。爲此，咱們須要引入 SQLAlchemy 以及 `settings.py` 文件。
</code></pre>from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings

DeclarativeBase = declarative_base()

def db_connect():
    return create_engine(URL(**settings.DATABASE))

def create_hn_table(engine):
    DeclarativeBase.metadata.create_all(engine)

class Hn(DeclarativeBase):
    __tablename__ = "hn"

    id = Column(Integer, primary_key=True)
    title = Column('title', String(200))
    link = Column('link', String(200))<pre><code>在開始下一步以前，我還想說明一下在 `URL()` 方法裏兩個星號的用法: `**settings.DATABASE`。首先，咱們經過 `settings.py` 裏的變量來訪問數據庫。這個 `**` 實際上會取出全部在 `DATABASE` 路徑下的值。`URL` 方法，一個在 `SQLAlchemy` 裏定義的構造器，將會把 key 和 value 映射成一個 SQLAlchemy 能明白的URL來鏈接咱們的數據庫。
接着，`URL()` 方法將會解析其餘元素，而後建立一個下面這樣的將被 `create_engine()` 方法讀取的 URL 。
</code></pre>'postgresql://xxx:yyy@zzz/vvv'<pre><code>接下來，咱們要爲咱們的ORM建立一個表。咱們須要 從 SQLAlchemy 引入 `declarative_base()` 以便把咱們爲表結構定義的類映射到Postgres上，以及一個從表的元數據裏建立咱們所須要的表的方法，還有咱們已經定義好的用於存儲數據的表和列。


###管道管理

咱們已經創建了用來抓取和解析HTML的抓取器, 而且已經設置了保存這些數據的數據庫 . 如今咱們須要經過一個管道來將二者鏈接起來.
打開 `pipelines.py` 並引入 SQLAlchemy 的 sessionmaker 功能，用來綁定數據庫 (建立那個鏈接), 固然也要引入咱們的模型.
</code></pre>from sqlalchemy.orm import sessionmaker
from models import Hn, db_connect, create_hn_table

class HnPipeline(object):
    def __init__(self):
        engine = db_connect()
        create_hn_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        session = self.Session()
        hn = Hn(**item)
        session.add(hn)
        session.commit()
        return item<pre><code>咱們在這裏建立了一個類, `HnPipeline()`. 咱們有一個構造器函數 `def __init__(self)` 來經過定義引擎初始化這個類, hn表格，還使用定義的這個引擎綁定/鏈接到數據庫.
而後咱們定義 `_process_item()` 來獲取參數, `_item_` 和 `_spider_`. 咱們創建了一個同數據庫的會話, 而後打開一個咱們的`Hn()` 模型中的數據項. 而後咱們經過電泳 `session.add()`來將 Hn 添加到咱們的數據庫中  – 在這一步, 它尚未被保存到數據庫中 – 它仍然處於 SQLAlchemy 級別. 而後, 經過調用 `session.commit()`, 它就將被放入數據庫中，過程也將會被提交 .

咱們這裏幾乎尚未向 `settings.py` 中添加一個變量來告訴抓取器在處理數據時到哪裏去找到咱們的管道.
那就在 `settings.py` 加入另一個變量, `ITEM_PIPELINES:`
</code></pre>ITEM_PIPELINES = {
    'hn.pipelines.HnPipeline':300
}

這就是咱們剛纔所定義管道的目錄/模塊的路徑.
如今咱們就能夠將全部抓取到的數據放到咱們的數據庫中, 讓咱們試試看咱們獲取到了什麼,
再一次運行 crawl命令，並一直等到全部的處理過程完畢爲止.
萬歲！咱們如今已經成功地把咱們所抓取到的數據存入了數據庫.

定時任務

若是咱們不得不按期手動去執行這個腳本，那將會是很煩人的. 全部這裏須要加入定時任務 .
定時任務將會在你指定的任什麼時候間自動運行. 可是! 它只會在你的計算機處在運行狀態時 (並非在休眠或者關機的時候)，而且特定於這段腳本須要是在和互聯網處於聯通狀態時，才能運行. 爲了避免管你的計算機是出在何種狀態都能運行這個定時任務, 你應該將 hn 代碼和bash 腳本，還有 cron 任務放在分開的將一直處在」運行「狀態的服務器上伺服.