python scrapy 入門,10分鐘完成一個爬蟲


在TensorFlow熱起來以前,不少人學習python的緣由是由於想寫爬蟲。的確,有着豐富第三方庫的python很適合幹這種工做。
Scrapy是一個易學易用的爬蟲框架,儘管由於互聯網多變的複雜性仍然有不少爬蟲須要本身編寫大量的代碼,但可以有一個相對全面均衡的基礎框架,工做仍是會少量多。css

框架安裝

很差意思用別人網站做爲被爬取的例子,下面從頭開始,以本站爲例,開始一個簡單的爬蟲之旅。
由於習慣緣由,本文均以python2做爲工做環境。
scrapy框架的安裝很是簡單,只要一行命令,前提是你已經有了pip包管理器:html

pip install scrapy

創建一個爬蟲工程

由於一個爬蟲工程中能夠包含多個爬蟲模塊,因此一般對於大多數人來說,有一個爬蟲工程就夠用了。
創建工程一樣只須要一行命令:node

#scrapy startproject <工程名稱>,例如:
scrapy startproject formoon

上面命令執行後,將在當前目錄中創建一個formoon文件夾,並使用基本模板在其中創建一個爬蟲工程。
僅執行scrapy不帶任何參數能夠給出scrapy的幫助,使用scrapy 子命令 --help能夠看到更多的幫助信息。python

在工程中加入一個爬蟲

首先進入工程目錄:linux

cd formoon

隨後能夠創建工程中第一個爬蟲:ios

#scrapy genspider <爬蟲名稱> <爬蟲所應用的域名稱>,例如:
scrapy genspider pages formoon.github.io

上面命令會在路徑:<工做目錄>/formoon(這個是工程目錄)/formoon/spiders/路徑之下,創建一個python程序文件pages.py,其默認的內容:git

# -*- coding: utf-8 -*-
import scrapy

class PagesSpider(scrapy.Spider):
    name = 'pages'
    allowed_domains = ['formoon.github.io']
    start_urls = ['http://formoon.github.io/']
    
    def parse(self, response):
        pass

編寫爬蟲

假設咱們的需求是這樣,爬蟲爬取整個https://formoon.github.io網站,獲取其中全部的文章,列出文章標題,文章連接地址,和文章的發佈日期。
依照慣例,下面直接貼出完成的代碼,並在其中以註釋的形式詳細解釋:程序員

# -*- coding: utf-8 -*-
import scrapy

class PagesSpider(scrapy.Spider):
    name = 'pages'  #爬蟲的名稱,不可更改
    allowed_domains = ['formoon.github.io'] #域名稱
    start_urls = ['https://formoon.github.io/'] #從這個網址開始執行爬蟲,注意默認是http,修改爲https
    #scrapy爬蟲中不會主動修改頁面中的連接,因此本身增長一個類變量用於將相對地址完整成爲絕對地址。
    baseurl='https://formoon.github.io'
    
    def parse(self, response):
        #scrapy爬蟲主要的難點是xpath和css選擇器的使用,請在網上搜索相關資源弄清楚
        #爬蟲使用相關選擇器在整個html中定位本身所須要的節點及獲取其中的數據
        for course in response.xpath('//ul/li'):
            #獲取文章連接
            href = self.baseurl+course.xpath('a/@href').extract()[0]
            #獲取文章標題
            title = course.css('.card-title').xpath('text()').extract()[0]
            #獲取文章發佈日期
            date = course.css('.card-type.is-notShownIfHover').xpath('text()').extract()[0]
            #顯示結果
            print title,href,date
        for btn in response.css('.container--call-to-action').xpath('a'):
            href = btn.xpath('@href').extract()[0]
            name = btn.xpath('button/text()').extract()[0]
            #若是屏幕上有下一頁按鈕,則遞歸訪問下一頁的頁面
            if name == u"下一頁":  #注意python2中對於中文要顯式的增長'u'前綴表示是unicode字符
                yield scrapy.Request(self.baseurl+href,callback=self.parse)

執行爬蟲

執行爬蟲使用以下命令:github

scrapy crawl pages

得到的結果以下:golang

2018-04-16 16:26:14 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: formoon)
2018-04-16 16:26:14 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 (default, Mar  9 2018, 23:57:12) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-16 16:26:14 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'formoon.spiders', 'SPIDER_MODULES': ['formoon.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'formoon'}
2018-04-16 16:26:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-16 16:26:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-16 16:26:15 [scrapy.core.engine] INFO: Spider opened
2018-04-16 16:26:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-16 16:26:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-16 16:26:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://formoon.github.io/robots.txt> (referer: None)
2018-04-16 16:26:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/> (referer: None)
大恆工業相機多實例使用 https://formoon.github.io/2018/04/04/daheng-camera/ 2018-04-04
圖像識別基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/ 2018-03-30
macOS的OpenCL高性能計算 https://formoon.github.io/2018/03/23/mac-opencl/ 2018-03-23
量子計算及量子計算的模擬 https://formoon.github.io/2018/03/20/dlib-quantum-computing/ 2018-03-20
iPhone屢次輸入錯誤密碼鎖機後恢復 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/ 2018-03-18
Mac版AppStore沒法下載、升級錯誤處理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/ 2018-03-18
在Mac上使用vs-code快速上手c語言學習 https://formoon.github.io/2018/03/10/vscode-on-mac/ 2018-03-10
在Mac上使用遠程X11應用 https://formoon.github.io/2018/03/09/remote-xwindows/ 2018-03-09
Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/ 2018-03-07
那些使人驚豔的TensorFlow擴展包和社區貢獻模型 https://formoon.github.io/2018/03/03/TensorFlow-models/ 2018-03-03
swift異步調用和對象間互動 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/ 2018-03-02
將dylib庫嵌入macOS應用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/ 2018-02-27
macOS使用內置驅動加載可讀寫NTFS分區 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/ 2018-02-19
mac應用啓動時卡死在「驗證...」 https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/ 2018-02-16
CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/ 2018-02-16
2018-04-16 16:26:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/2/> (referer: https://formoon.github.io/)
Mark https://formoon.github.io/2018/02/09/hello-world/ 2018-02-09
GreenPlum沒法遠程訪問解決 https://formoon.github.io/2018/02/08/greenplum-on-centos/ 2018-02-08
rinetd:輕量級Linux端口轉發工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/ 2018-02-06
Ubuntu16包依賴故障解決 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/ 2018-02-05
iNode環境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/ 2018-02-02
Ubuntu 16.04.03 LTS 安裝CUDA/CUDNN/TensorFlow+GPU流水帳 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/ 2018-01-31
resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/ 2018-01-29
macOS webview編程 https://formoon.github.io/2018/01/29/mac-webview-program/ 2018-01-29
新麥裝機問題匯 https://formoon.github.io/2018/01/24/new-mac-install/ 2018-01-24
比特幣核心算法ECDSA電子簽名在線演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/ 2018-01-22
從鍋爐工到AI專家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/ 2018-01-18
gem update 升級錯誤解決 https://formoon.github.io/2018/01/18/gem-update-error-solve/ 2018-01-18
比特幣核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/ 2018-01-18
從鍋爐工到AI專家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/ 2018-01-17
Python2中文處理紀要 https://formoon.github.io/2018/01/17/python2-chn-process/ 2018-01-17
2018-04-16 16:26:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/3/> (referer: https://formoon.github.io/pages/2/)
從鍋爐工到AI專家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/ 2018-01-16
從鍋爐工到AI專家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/ 2018-01-15
從鍋爐工到AI專家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/ 2018-01-12
從鍋爐工到AI專家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/ 2018-01-11
從鍋爐工到AI專家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/ 2018-01-11
從鍋爐工到AI專家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/ 2018-01-10
Octave Fontconfig報錯解決 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/ 2018-01-10
5分鐘搭建一個quic服務器 https://formoon.github.io/2018/01/10/5mins-support-quic/ 2018-01-10
從鍋爐工到AI專家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/ 2018-01-09
從鍋爐工到AI專家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/ 2018-01-08
從鍋爐工到AI專家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/ 2018-01-08
解決本博客在手機瀏覽器拖動卡頓問題 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/ 2018-01-04
OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/ 2018-01-04
OpenCV中的亮度對比度調整及其自動均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/ 2018-01-04
Mac電腦C語言開發的入門帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/ 2018-01-03
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/4/> (referer: https://formoon.github.io/pages/3/)
如何看到微信小程序的源碼 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/ 2018-01-02
使用人工輔助點達成更優白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/ 2018-01-02
不使用插件創建jekyll網站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/ 2017-12-29
safari11如何訪問自簽名https網站 https://formoon.github.io/2017/12/29/safari-self-signed-https/ 2017-12-29
趕個時髦,給本身的博客添加一個微信二維碼 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/ 2017-12-29
被Docker/VMWare寵壞的孩子們,還記得QEMU嗎? https://formoon.github.io/2017/12/28/qemu-on-mac/ 2017-12-28
在網頁顯示數學公式 https://formoon.github.io/2017/12/28/mathjax-in-page/ 2017-12-28
使用SDL2顯示一張圖片 https://formoon.github.io/2017/12/28/hello-world-sdl2/ 2017-12-28
如何規範的把進程放到Linux後臺運行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/ 2017-12-27
兩種方法操做其它mac應用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/ 2017-12-27
本身動手,裝一個液晶電視 https://formoon.github.io/2017/12/25/lcd-tv-diy/ 2017-12-25
半小時完成一個溼度溫度計 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/ 2017-12-25
MacPro4,1升級到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/ 2017-12-22
CameraBox我的講臺客戶端使用說明 https://formoon.github.io/2017/12/22/camerabox-manual/ 2017-12-22
一段使用Educast摳像混屏直播的視頻展現 https://formoon.github.io/2017/12/21/streaming-mix/ 2017-12-21
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/5/> (referer: https://formoon.github.io/pages/4/)
七牛對象存儲的使用 https://formoon.github.io/2017/12/21/qiniu-storage/ 2017-12-21
Educast視頻直播控制檯使用說明 https://formoon.github.io/2017/12/21/educast-manual/ 2017-12-21
批量自動重命名音樂文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/ 2017-12-20
把Markdown文本發佈到微信公衆號文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/ 2017-12-20
Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/ 2017-12-19
分享一個很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/ 2017-12-19
在Mac電腦編譯c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/ 2017-12-18
golang子進程的啓動和中止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/ 2017-12-16
Ubuntu16.04LTS appstreamcli報錯的處理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/ 2017-12-15
AngularJS2+調用原有的js腳本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/ 2017-12-14
在國內使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/ 2017-12-14
Angular2+的兩個小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/ 2017-12-14
Unix程序員的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/ 2017-12-14
在Ubuntu上搭建kindle gtk開發環境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/ 2017-12-13
蘋果手機上下載的文件在哪裏? https://formoon.github.io/2017/12/13/download-on-ios/ 2017-12-13
2018-04-16 16:26:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://formoon.github.io/pages/6/> (referer: https://formoon.github.io/pages/5/)
K60平臺智能車開發工做隨手記 https://formoon.github.io/2017/12/11/smart-car-k60-develope/ 2017-12-11
使用Jekyll和github搭建本身的我的博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/ 2017-12-11
使用ffmpeg作簡單的音視頻剪輯 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/ 2017-12-11
安裝Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/ 2017-12-08
在Mac上安裝ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/ 2017-12-08
Hello World https://formoon.github.io/2017/12/08/hello-world/ 2017-12-08
2018-04-16 16:26:19 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-16 16:26:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1779,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 7,
 'downloader/response_bytes': 57926,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 6,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 16, 8, 26, 19, 71963),
 'log_count/DEBUG': 8,
 'log_count/INFO': 7,
 'memusage/max': 50831360,
 'memusage/startup': 50827264,
 'request_depth_max': 5,
 'response_received_count': 7,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 4, 16, 8, 26, 15, 15007)}
2018-04-16 16:26:19 [scrapy.core.engine] INFO: Spider closed (finished)

從結果中能夠看到,咱們的爬蟲已經執行了,並獲取了正確的結果。若是不想看到執行過程當中的日誌輸出,能夠增長--nolog參數,以下所示:

> scrapy crawl pages --nolog
大恆工業相機多實例使用 https://formoon.github.io/2018/04/04/daheng-camera/ 2018-04-04
圖像識別基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/ 2018-03-30
macOS的OpenCL高性能計算 https://formoon.github.io/2018/03/23/mac-opencl/ 2018-03-23
量子計算及量子計算的模擬 https://formoon.github.io/2018/03/20/dlib-quantum-computing/ 2018-03-20
iPhone屢次輸入錯誤密碼鎖機後恢復 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/ 2018-03-18
Mac版AppStore沒法下載、升級錯誤處理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/ 2018-03-18
在Mac上使用vs-code快速上手c語言學習 https://formoon.github.io/2018/03/10/vscode-on-mac/ 2018-03-10
在Mac上使用遠程X11應用 https://formoon.github.io/2018/03/09/remote-xwindows/ 2018-03-09
Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/ 2018-03-07
那些使人驚豔的TensorFlow擴展包和社區貢獻模型 https://formoon.github.io/2018/03/03/TensorFlow-models/ 2018-03-03
swift異步調用和對象間互動 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/ 2018-03-02
將dylib庫嵌入macOS應用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/ 2018-02-27
macOS使用內置驅動加載可讀寫NTFS分區 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/ 2018-02-19
mac應用啓動時卡死在「驗證...」 https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/ 2018-02-16
CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/ 2018-02-16
Mark https://formoon.github.io/2018/02/09/hello-world/ 2018-02-09
GreenPlum沒法遠程訪問解決 https://formoon.github.io/2018/02/08/greenplum-on-centos/ 2018-02-08
rinetd:輕量級Linux端口轉發工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/ 2018-02-06
Ubuntu16包依賴故障解決 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/ 2018-02-05
iNode環境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/ 2018-02-02
Ubuntu 16.04.03 LTS 安裝CUDA/CUDNN/TensorFlow+GPU流水帳 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/ 2018-01-31
resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/ 2018-01-29
macOS webview編程 https://formoon.github.io/2018/01/29/mac-webview-program/ 2018-01-29
新麥裝機問題匯 https://formoon.github.io/2018/01/24/new-mac-install/ 2018-01-24
比特幣核心算法ECDSA電子簽名在線演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/ 2018-01-22
從鍋爐工到AI專家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/ 2018-01-18
gem update 升級錯誤解決 https://formoon.github.io/2018/01/18/gem-update-error-solve/ 2018-01-18
比特幣核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/ 2018-01-18
從鍋爐工到AI專家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/ 2018-01-17
Python2中文處理紀要 https://formoon.github.io/2018/01/17/python2-chn-process/ 2018-01-17
從鍋爐工到AI專家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/ 2018-01-16
從鍋爐工到AI專家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/ 2018-01-15
從鍋爐工到AI專家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/ 2018-01-12
從鍋爐工到AI專家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/ 2018-01-11
從鍋爐工到AI專家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/ 2018-01-11
從鍋爐工到AI專家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/ 2018-01-10
Octave Fontconfig報錯解決 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/ 2018-01-10
5分鐘搭建一個quic服務器 https://formoon.github.io/2018/01/10/5mins-support-quic/ 2018-01-10
從鍋爐工到AI專家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/ 2018-01-09
從鍋爐工到AI專家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/ 2018-01-08
從鍋爐工到AI專家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/ 2018-01-08
解決本博客在手機瀏覽器拖動卡頓問題 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/ 2018-01-04
OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/ 2018-01-04
OpenCV中的亮度對比度調整及其自動均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/ 2018-01-04
Mac電腦C語言開發的入門帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/ 2018-01-03
如何看到微信小程序的源碼 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/ 2018-01-02
使用人工輔助點達成更優白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/ 2018-01-02
不使用插件創建jekyll網站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/ 2017-12-29
safari11如何訪問自簽名https網站 https://formoon.github.io/2017/12/29/safari-self-signed-https/ 2017-12-29
趕個時髦,給本身的博客添加一個微信二維碼 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/ 2017-12-29
被Docker/VMWare寵壞的孩子們,還記得QEMU嗎? https://formoon.github.io/2017/12/28/qemu-on-mac/ 2017-12-28
在網頁顯示數學公式 https://formoon.github.io/2017/12/28/mathjax-in-page/ 2017-12-28
使用SDL2顯示一張圖片 https://formoon.github.io/2017/12/28/hello-world-sdl2/ 2017-12-28
如何規範的把進程放到Linux後臺運行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/ 2017-12-27
兩種方法操做其它mac應用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/ 2017-12-27
本身動手,裝一個液晶電視 https://formoon.github.io/2017/12/25/lcd-tv-diy/ 2017-12-25
半小時完成一個溼度溫度計 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/ 2017-12-25
MacPro4,1升級到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/ 2017-12-22
CameraBox我的講臺客戶端使用說明 https://formoon.github.io/2017/12/22/camerabox-manual/ 2017-12-22
一段使用Educast摳像混屏直播的視頻展現 https://formoon.github.io/2017/12/21/streaming-mix/ 2017-12-21
七牛對象存儲的使用 https://formoon.github.io/2017/12/21/qiniu-storage/ 2017-12-21
Educast視頻直播控制檯使用說明 https://formoon.github.io/2017/12/21/educast-manual/ 2017-12-21
批量自動重命名音樂文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/ 2017-12-20
把Markdown文本發佈到微信公衆號文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/ 2017-12-20
Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/ 2017-12-19
分享一個很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/ 2017-12-19
在Mac電腦編譯c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/ 2017-12-18
golang子進程的啓動和中止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/ 2017-12-16
Ubuntu16.04LTS appstreamcli報錯的處理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/ 2017-12-15
AngularJS2+調用原有的js腳本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/ 2017-12-14
在國內使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/ 2017-12-14
Angular2+的兩個小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/ 2017-12-14
Unix程序員的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/ 2017-12-14
在Ubuntu上搭建kindle gtk開發環境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/ 2017-12-13
蘋果手機上下載的文件在哪裏? https://formoon.github.io/2017/12/13/download-on-ios/ 2017-12-13
K60平臺智能車開發工做隨手記 https://formoon.github.io/2017/12/11/smart-car-k60-develope/ 2017-12-11
使用Jekyll和github搭建本身的我的博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/ 2017-12-11
使用ffmpeg作簡單的音視頻剪輯 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/ 2017-12-11
安裝Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/ 2017-12-08
在Mac上安裝ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/ 2017-12-08
Hello World https://formoon.github.io/2017/12/08/hello-world/ 2017-12-08

進階爬蟲,items和pipeline

對大多數用戶來說,到了上面一步,已經可以知足基本的需求。但仍然有兩個機制可讓爬蟲工做的更清晰流暢、功能也更強大。
item是scrapy處理數據的基本單位,實際上在爬蟲的parse的方法中,應當返回1個item對象,來表達一個基本數據單元。
使用item,首先修改<工程目錄>/formoon/items.py文件,定義咱們本身的數據結構:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class FormoonItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #以上是模板中已經有的內容,下面是咱們本身增長的3個字段
    title = scrapy.Field()
    link = scrapy.Field()
    date = scrapy.Field()

使用item處理基本的數據單元有不少好處,其中比較重要的一個就是可使用scrapy自帶的pipeline流水線機制。這個流水線機制提供爬蟲開始工做前、工做所有完成以後、每一個數據單元的處理三種基本的處理狀況,從而把程序的結構劃分的很是清晰,更容易對接複雜的後期功能。
編輯<工程目錄>/formoon/pipelines.py文件:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class FormoonPipeline(object):
    total = 0   #咱們自定義的變量,用於統計文章總數
    #open_spider方法在爬蟲開始工做以前調用,一般能夠初始化環境、打開數據庫、打開文件等工做
    def open_spider(self, spider):
        #這裏只顯示一行文字做爲示例
        print "open spider ..."
    #這個方法是最基本的方法,每次爬蟲parse方法返回一個item的時候,都會調用這個函數,對基本的一個數據單元進行處理
    def process_item(self, item, spider):
        self.total += 1 #累計文章數
        #顯示基本數據內容,一般能夠在這個方法中對數據保存入庫、觸發分析動做等
        print("%s %s %s"%(item['date'],item['title'],item['link']))
        return item
    #全部連接處理完畢,結束爬蟲工做時調用,一般能夠用於關閉數據庫、關閉文件等。
    def close_spider(self, spider):
        #做爲示例,這裏只是顯示處理結果
        print u"共",self.total,u"篇文章"
        print "close spider ..."

有了上面兩個基本定義,還要將item和pipeline鏈接起來,這個配置在settings.py文件中,一般是被屏蔽的,表示日常不適用item及pipeline機制,將註釋符號刪除就能夠開啓:

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'formoon.pipelines.FormoonPipeline': 300,
}

最後是修改爬蟲程序,將原來在爬蟲中直接的數據顯示,修改成規範的返回item數據單元,爲了同原來的爬蟲作比較,咱們直接另外增長一個爬蟲程序來應用新功能:

scrapy genspider pagesnew formoon.github.io

就像前面說的,這會在<工程目錄>/formoon/spiders/目錄下創建pagesnew.py文件,容納新的爬蟲程序,咱們編輯這個文件:

# -*- coding: utf-8 -*-
import scrapy
from formoon.items import FormoonItem   #要引入咱們自定義的item
class PagesnewSpider(scrapy.Spider):
    name = 'pagesnew'
    allowed_domains = ['formoon.github.io']
    start_urls = ['https://formoon.github.io/']

    baseurl='https://formoon.github.io'

    def parse(self, response):
        for course in response.xpath('//ul/li'):
            href = self.baseurl+course.xpath('a/@href').extract()[0]
            title = course.css('.card-title').xpath('text()').extract()[0]
            date = course.css('.card-type.is-notShownIfHover').xpath('text()').extract()[0]
            #區別從這裏開始,咱們刪除了直接顯示數據,初始化一個空白的item,將數據填充進去
            item = FormoonItem()
            item['date']=date
            item['link']=href
            item['title']=title
            yield item  #將數據返回
        for btn in response.css('.container--call-to-action').xpath('a'):
            href = btn.xpath('@href').extract()[0]
            name = btn.xpath('button/text()').extract()[0]
            if name == u"下一頁":
                yield scrapy.Request(self.baseurl+href,callback=self.parse)

你看,在爬蟲程序中使用這種機制,讓爬蟲程序的結構也簡單、清晰了。
試着執行一下:

> scrapy crawl pagesnew --nolog
open spider ...
2018-04-04 大恆工業相機多實例使用 https://formoon.github.io/2018/04/04/daheng-camera/
2018-03-30 圖像識別基本算法之SURF https://formoon.github.io/2018/03/30/surf-feature/
2018-03-23 macOS的OpenCL高性能計算 https://formoon.github.io/2018/03/23/mac-opencl/
2018-03-20 量子計算及量子計算的模擬 https://formoon.github.io/2018/03/20/dlib-quantum-computing/
2018-03-18 iPhone屢次輸入錯誤密碼鎖機後恢復 https://formoon.github.io/2018/03/18/IOS-Password-Recovery/
2018-03-18 Mac版AppStore沒法下載、升級錯誤處理 https://formoon.github.io/2018/03/18/appstore-item-temporarily-unavailabel/
2018-03-10 在Mac上使用vs-code快速上手c語言學習 https://formoon.github.io/2018/03/10/vscode-on-mac/
2018-03-09 在Mac上使用遠程X11應用 https://formoon.github.io/2018/03/09/remote-xwindows/
2018-03-07 Docker for mac上使用Kubernetes https://formoon.github.io/2018/03/07/docker-for-mac/
2018-03-03 那些使人驚豔的TensorFlow擴展包和社區貢獻模型 https://formoon.github.io/2018/03/03/TensorFlow-models/
2018-03-02 swift異步調用和對象間互動 https://formoon.github.io/2018/03/02/macos-thread-and-appdelegate/
2018-02-27 將dylib庫嵌入macOS應用的方法 https://formoon.github.io/2018/02/27/macos-app-embed-dylib/
2018-02-19 macOS使用內置驅動加載可讀寫NTFS分區 https://formoon.github.io/2018/02/19/macos-mount-ntfs-as-read-write/
2018-02-16 mac應用啓動時卡死在「驗證...」 https://formoon.github.io/2018/02/16/macos-stuck-verifying-app/
2018-02-16 CrossOver和wine https://formoon.github.io/2018/02/16/crossover-wine-copy/
2018-02-09 Mark https://formoon.github.io/2018/02/09/hello-world/
2018-02-08 GreenPlum沒法遠程訪問解決 https://formoon.github.io/2018/02/08/greenplum-on-centos/
2018-02-06 rinetd:輕量級Linux端口轉發工具 https://formoon.github.io/2018/02/06/linux-port-forward-tools/
2018-02-05 Ubuntu16包依賴故障解決 https://formoon.github.io/2018/02/05/ubuntu-apt-error-of-package-depend/
2018-02-02 iNode環境Windows 10配置固定IP地址 https://formoon.github.io/2018/02/02/win10-inode-2-ipaddress/
2018-01-31 Ubuntu 16.04.03 LTS 安裝CUDA/CUDNN/TensorFlow+GPU流水帳 https://formoon.github.io/2018/01/31/ubuntu-cuda-cudnn-tensorflow-setting/
2018-01-29 resource fork, Finder information, or similar detritus not allowed https://formoon.github.io/2018/01/29/xcode-compile-error-1/
2018-01-29 macOS webview編程 https://formoon.github.io/2018/01/29/mac-webview-program/
2018-01-24 新麥裝機問題匯 https://formoon.github.io/2018/01/24/new-mac-install/
2018-01-22 比特幣核心算法ECDSA電子簽名在線演示 https://formoon.github.io/2018/01/22/bitcoin-and-ecdsa/
2018-01-18 從鍋爐工到AI專家(11)(END) https://formoon.github.io/2018/01/18/tensorFlow-series-11/
2018-01-18 gem update 升級錯誤解決 https://formoon.github.io/2018/01/18/gem-update-error-solve/
2018-01-18 比特幣核心概念及算法 https://formoon.github.io/2018/01/18/bitcoin-and-blockchain/
2018-01-17 從鍋爐工到AI專家(10) https://formoon.github.io/2018/01/17/tensorFlow-series-10/
2018-01-17 Python2中文處理紀要 https://formoon.github.io/2018/01/17/python2-chn-process/
2018-01-16 從鍋爐工到AI專家(9) https://formoon.github.io/2018/01/16/tensorFlow-series-9/
2018-01-15 從鍋爐工到AI專家(8) https://formoon.github.io/2018/01/15/tensorFlow-series-8/
2018-01-12 從鍋爐工到AI專家(7) https://formoon.github.io/2018/01/12/tensorFlow-series-7/
2018-01-11 從鍋爐工到AI專家(6) https://formoon.github.io/2018/01/11/tensorFlow-series-6/
2018-01-11 從鍋爐工到AI專家(5) https://formoon.github.io/2018/01/11/tensorFlow-series-5/
2018-01-10 從鍋爐工到AI專家(4) https://formoon.github.io/2018/01/10/tensorFlow-series-4/
2018-01-10 Octave Fontconfig報錯解決 https://formoon.github.io/2018/01/10/octave-fontconfig-warning/
2018-01-10 5分鐘搭建一個quic服務器 https://formoon.github.io/2018/01/10/5mins-support-quic/
2018-01-09 從鍋爐工到AI專家(3) https://formoon.github.io/2018/01/09/tensorFlow-series-3/
2018-01-08 從鍋爐工到AI專家(2) https://formoon.github.io/2018/01/08/tensorFlow-series-2/
2018-01-08 從鍋爐工到AI專家(1) https://formoon.github.io/2018/01/08/tensorFlow-series-1/
2018-01-04 解決本博客在手機瀏覽器拖動卡頓問題 https://formoon.github.io/2018/01/04/solve-mobile-browser-pull-problem/
2018-01-04 OpenCV中的照片剪裁 https://formoon.github.io/2018/01/04/opencv-photo-crop/
2018-01-04 OpenCV中的亮度對比度調整及其自動均衡 https://formoon.github.io/2018/01/04/opencv-brightness-and-contrast/
2018-01-03 Mac電腦C語言開發的入門帖 https://formoon.github.io/2018/01/03/c-hello-world-for-mac/
2018-01-02 如何看到微信小程序的源碼 https://formoon.github.io/2018/01/02/wechat-mini-app-rd/
2018-01-02 使用人工輔助點達成更優白平衡 https://formoon.github.io/2018/01/02/opencv-whitebalance-with-point-confirm/
2017-12-29 不使用插件創建jekyll網站sitemap https://formoon.github.io/2017/12/29/sitemap_of_jekyll/
2017-12-29 safari11如何訪問自簽名https網站 https://formoon.github.io/2017/12/29/safari-self-signed-https/
2017-12-29 趕個時髦,給本身的博客添加一個微信二維碼 https://formoon.github.io/2017/12/29/add-wechat-qrcode-on-your-blog/
2017-12-28 被Docker/VMWare寵壞的孩子們,還記得QEMU嗎? https://formoon.github.io/2017/12/28/qemu-on-mac/
2017-12-28 在網頁顯示數學公式 https://formoon.github.io/2017/12/28/mathjax-in-page/
2017-12-28 使用SDL2顯示一張圖片 https://formoon.github.io/2017/12/28/hello-world-sdl2/
2017-12-27 如何規範的把進程放到Linux後臺運行 https://formoon.github.io/2017/12/27/selinux-run-app-in-background/
2017-12-27 兩種方法操做其它mac應用的窗口 https://formoon.github.io/2017/12/27/move-other-app-window-on-mac/
2017-12-25 本身動手,裝一個液晶電視 https://formoon.github.io/2017/12/25/lcd-tv-diy/
2017-12-25 半小時完成一個溼度溫度計 https://formoon.github.io/2017/12/25/arduino-hygrothermograph/
2017-12-22 MacPro4,1升級到MacPro5,1 https://formoon.github.io/2017/12/22/macpro41-upgrade/
2017-12-22 CameraBox我的講臺客戶端使用說明 https://formoon.github.io/2017/12/22/camerabox-manual/
2017-12-21 一段使用Educast摳像混屏直播的視頻展現 https://formoon.github.io/2017/12/21/streaming-mix/
2017-12-21 七牛對象存儲的使用 https://formoon.github.io/2017/12/21/qiniu-storage/
2017-12-21 Educast視頻直播控制檯使用說明 https://formoon.github.io/2017/12/21/educast-manual/
2017-12-20 批量自動重命名音樂文件 https://formoon.github.io/2017/12/20/mp3-m4a-rename/
2017-12-20 把Markdown文本發佈到微信公衆號文章 https://formoon.github.io/2017/12/20/markdown-to-html-and-wechat/
2017-12-19 Javascript已加入AppleScript全家桶 https://formoon.github.io/2017/12/19/jxa-appscript/
2017-12-19 分享一個很通用的Makefile https://formoon.github.io/2017/12/19/Makefile-skill/
2017-12-18 在Mac電腦編譯c51程序 https://formoon.github.io/2017/12/18/c51-on-mac/
2017-12-16 golang子進程的啓動和中止 https://formoon.github.io/2017/12/16/ubuntu-golang-stop-child-process/
2017-12-15 Ubuntu16.04LTS appstreamcli報錯的處理 https://formoon.github.io/2017/12/15/ubuntu-appstreamcli-error/
2017-12-14 AngularJS2+調用原有的js腳本 https://formoon.github.io/2017/12/14/angular4-ts-and-local-js/
2017-12-14 在國內使用golang的小技巧 https://formoon.github.io/2017/12/14/use-golang-in-china/
2017-12-14 Angular2+的兩個小技巧 https://formoon.github.io/2017/12/14/angular4-hotkeys-and-detect-browser/
2017-12-14 Unix程序員的Win10二三事 https://formoon.github.io/2017/12/14/Unix%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84win10%E4%BA%8C%E4%B8%89%E4%BA%8B/
2017-12-13 在Ubuntu上搭建kindle gtk開發環境 https://formoon.github.io/2017/12/13/hello-world-for-kindle/
2017-12-13 蘋果手機上下載的文件在哪裏? https://formoon.github.io/2017/12/13/download-on-ios/
2017-12-11 K60平臺智能車開發工做隨手記 https://formoon.github.io/2017/12/11/smart-car-k60-develope/
2017-12-11 使用Jekyll和github搭建本身的我的博客 https://formoon.github.io/2017/12/11/setting-your-own-jekyll-blog/
2017-12-11 使用ffmpeg作簡單的音視頻剪輯 https://formoon.github.io/2017/12/11/ffmpeg-auido-video-edit/
2017-12-08 安裝Homebrew https://formoon.github.io/2017/12/08/install-homebrew-on-mac/
2017-12-08 在Mac上安裝ffmpeg https://formoon.github.io/2017/12/08/install-ffmpeg-on-mac/
2017-12-08 Hello World https://formoon.github.io/2017/12/08/hello-world/
共 81 篇文章
close spider ...

參考連接

scrapy中文文檔
xpath教程
css選擇器使用手冊

相關文章
相關標籤/搜索