第一個爬蟲程序

時間 2019-11-09

原文原文鏈接

最近熱播的電視劇《全職高手》是由小說改編而成的，爬取小說當作練習吧~~本文練習爬取第一章的章節標題和章節內容，而且保存到本地文件中。

建立一個 scrapy 項目

$ scrapy init first_scrapy

建立完成後目錄結構跟下面👇應該是同樣的，在 spiders 目錄下新建 novel.py 文件，待會咱就在這個文件中寫爬蟲程序。css

寫第一個爬蟲

爬取網頁源代碼

import scrapy

class NovelSpider(scrapy.Spider):

    # 啓動爬蟲時會用到這個名稱
    name = "novel"

    # 爬取哪一個網頁的源碼，這裏是網址
    start_urls = ["https://www.biquge5200.cc/0_857/651708.html"]

    def parse(self, response):
        # 拿到 html 網頁源代碼
        html_str = response.css('html').extract_first()
        # 保存爲本地文件 source.html
        with open('source.html', 'w') as f:
            f.write(html_str)
        self.log('保存文件成功')

name 爬蟲的名字，啓動爬蟲時須要用到它；
start_urls 要爬取哪一個網頁的源代碼，填寫網頁網址；
parse Scrapy 幫你請求網址，拿到源代碼後塞給 response 參數，接下來對 response 操做提取須要的數據。

運行爬蟲html

$ scrapy crawl novel    <= 這裏的 novel 就是上面代碼中的 name 值

執行完成後在根目錄下建立了 source.html 文件，打開能夠看到網頁的源代碼已經被咱們爬下來了。python

提取數據

Scrapy 提供調試環境和 css 提取器，幫助咱們快速準確的從 html 源代碼中拿到咱們須要的數據。shell

調試環境

Scrapy 提供了調試環境，方便咱們測試提取的數據是否正確。數組

命令行中輸入👇腳本，其中 scrapy shell 是固定寫法，後面跟要爬取網頁的網址。bash

$ scrapy shell https://www.biquge5200.cc/0_857/651708.html

輸入完成後會 blala... 打印一堆東東，咱們只要關注最後一行 >>>。cookie

2019-08-01 09:22:29 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-08-01 09:22:29 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.0 (default, Sep 18 2018, 18:47:22) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2019-08-01 09:22:29 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2019-08-01 09:22:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2019-08-01 09:22:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-01 09:22:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-01 09:22:29 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-01 09:22:29 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-08-01 09:22:29 [scrapy.core.engine] INFO: Spider opened
2019-08-01 09:22:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.biquge5200.cc/0_857/651708.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x10bc049e8>
[s]   item       {}
[s]   request    <GET https://www.biquge5200.cc/0_857/651708.html>
[s]   response   <200 https://www.biquge5200.cc/0_857/651708.html>
[s]   settings   <scrapy.settings.Settings object at 0x10bc04748>
[s]   spider     <DefaultSpider 'default' at 0x10bf0ac88>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

css 提取器

好比咱們想提取 https://www.biquge5200.cc/0_857/651708.html 的 title 標籤中的數據，在最後一行 >>> 後面輸入 response.css('title')，而後回車。scrapy

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>第一章 被驅逐的高手_全職高手_筆趣閣</title>'>]

你會發現，提取到的是個 Selector 數組，並非咱們想要的數據。Scrapy 給咱們準備了一些函數來進一步提取，extract() 函數將 Selector 轉換爲咱們熟悉的 html 標籤ide

>>> response.css('title').extract()
['<title>第一章 被驅逐的高手_全職高手_筆趣閣</title>']

👆拿到的 html 標籤仍然是一個數組，在後面加上 [0] 很方便拿到數組中第一個元素函數

>>> response.css('title').extract()[0]
'<title>第一章 被驅逐的高手_全職高手_筆趣閣</title>'

Scrapy 提供了另外一個函數 extract_first()，一樣能夠拿到數組中第一個元素，寫法上更加簡單。至此，成功提取到 title 標籤，可是多了 title 標籤，繼續修改。

>>> response.css('title').extract_first()
'<title>第一章 被驅逐的高手_全職高手_筆趣閣</title>'

在 title 後面加上 ::text 便可提取 html 標籤包裹的內容了。到這裏已經成功提取到咱們須要的數據了。

>>> response.css('title::text').extract_first()
'第一章 被驅逐的高手_全職高手_筆趣閣'

::text 能夠提取 html 標籤包裹的內容，若是要提取 html 標籤自身的屬性，好比上面👇 a 標籤的 href 屬性值，怎麼辦呢？

<a href="https://xxx.yyy.ccc">連接</a>

::attr(屬性名) 能夠提取 html 標籤自身的屬性。

>>> response.css('a::attr(href)').extract_first()

提取章節標題和章節內容

有了前面 css 提取器的學習，拿到章節標題和章節內容相信不是什麼困難的事了~~

import scrapy

class NovelSpider(scrapy.Spider):

    name = "novel"

    start_urls = ["https://www.biquge5200.cc/0_857/651708.html"]

    def parse(self, response):
        # 拿到章節標題
        title = response.css('div.bookname h1::text').extract_first()
        # 拿到章節內容
        content = '\n\n'.join(response.css('div#content p::text').extract())
        print("title ", title)
        print("content ", content)

運行爬蟲

$ scrapy crawl novel

能夠看到分別打印了章節標題和標題內容。

保存爲本地文件

在第一步 爬取網頁 中已經用過一次保存爲本地文件的操做了，當時是把網頁源代碼保存爲 source.html。

如今把章節標題和章節內容保存爲 novel.txt 文件。

import scrapy

class NovelSpider(scrapy.Spider):

    name = "novel"

    start_urls = ["https://www.biquge5200.cc/0_857/651708.html"]

    def parse(self, response):
        title = response.css('div.bookname h1::text').extract_first()
        content = '\n\n'.join(response.css('div#content p::text').extract())
        
        # 保存爲本地文件 novel.txt
        with open('novel.txt', 'w') as f:
            f.write(title)
            f.write('\n')
            f.write(content)
        self.log('保存文件成功')

運行爬蟲，能夠看到根目錄下輸出了 novel.txt 文件，成功地將章節標題和章節內容保存到文件中。