Scrapy是一款很是成熟的爬蟲框架,能夠抓取網頁數據並抽取結構化數據,目前已經有不少企業用於生產環境。對於它的更多介紹,能夠查閱相關資料(官方網站:www.scrapy.org)。php
咱們根據官網提供的安裝指南,來一步步安裝,主要參考了http://doc.scrapy.org/en/latest/intro/install.html頁面的介紹:html
- Requirements¶
- Python 2.5, 2.6, 2.7 (3.x is not yet supported)
- Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
- w3lib
- lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
- simplejson (not required if using Python 2.6 or above)
- pyopenssl (for HTTPS support. Optional, but highly recommended)
下面記錄一下從安裝Python到安裝scrapy的過程,最後,經過執行命令進行抓取數據來驗證咱們所作的安裝配置工做。
準備工做
操做系統:RHEL 5
Python版本:Python-2.7.2
zope.interface版本:zope.interface-3.8.0
Twisted版本:Twisted-11.1.0
libxml2版本:libxml2-2.7.4.tar.gz
w3lib版本:w3lib-1.0
Scrapy版本:Scrapy-0.14.0.2841python
安裝配置
一、安裝zlib
首先檢查一下你的系統中是否已經安裝zlib,該庫是一個與數據壓縮相關的工具包,scrapy框架依賴於該工具包。我使用的RHEL 5系統,查看是否安裝:linux
- [root@localhost scrapy]# rpm -qa zlib
- zlib-1.2.3-3
個人系統已經默認安裝了,安裝的話,能夠跳過該步驟。若是沒有安裝的話,能夠到
http://www.zlib.net/
上下載,並進行安裝。假以下載的是zlib-1.2.5.tar.gz,安裝命令以下所示:
- [root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
- [root@localhost zlib-1.2.5]# cd zlib-1.2.5
- [root@localhost zlib-1.2.5]# make
- [root@localhost zlib-1.2.5]# make install
二、安裝Python
個人系統中已經安裝的Python 2.4,根據官網要求和建議,我選擇了Python-2.7.2,下載地址以下所示:shell
http://www.python.org/download/(須要代理)
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgzjson
我下載了Python的源代碼,從新編譯後,進行安裝,過程以下所示:api
- [root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
- [root@localhost scrapy]# cd Python-2.7.2
- [root@localhost Python-2.7.2]# ./configure
- [root@localhost Python-2.7.2]# make
- [root@localhost Python-2.7.2]# make install
默認狀況下,Python程序被安裝到/usr/local/lib/python2.7。app
若是你的系統中沒有安裝過Python,此時經過命令行執行一下:框架
- [root@localhost scrapy]# python
- Python 2.7.2 (default, Dec 5 2011, 22:04:07)
- [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>>
表示最新安裝的Python已經可使用了。
若是你的系統中還有其餘版本的Python,例如個人系統中2.4版本的,因此要作一個符號連接:python2.7
- [root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
- [root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python
這樣操做之後,在執行python,就生效了。
三、安裝setuptools
這裏主要是安裝一個用來管理Python模塊的工具,若是已經安裝就跳過該步驟。若是你須要安裝,能夠參考下面的連接:
http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
不過,在安裝Python-2.7.2之後,能夠看到Python的解壓縮包裏面有一個setup.py腳本,使用這個腳本能夠安裝Python一些相關的模塊,執行命令:
- [root@localhost Python-2.7.2]# python setup.py install
安裝執行後,相關Python模塊被安裝到目錄/usr/local/lib/python2.7/site-packages下。
四、安裝zope.interface
下載地址以下所示:
http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389
安裝過程以下所示:
- [root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
- [root@localhost scrapy]$ cd zope.interface-3.8.0
- [root@localhost zope.interface-3.8.0]$ python setup.py build
- [root@localhost zope.interface-3.8.0]$ python setup.py install
安裝完成後,能夠在/usr/local/lib/python2.7/site-packages下面看到zope和zope.interface-3.8.0-py2.7.egg-info。
五、安裝Twisted
下載地址以下所示:
http://twistedmatrix.com/trac/
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c
安裝過程以下所示:
- [root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
- [root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
- [root@localhost scrapy]# cd Twisted-11.1.0
- [root@localhost Twisted-11.1.0]# python setup.py install
安裝完成後,能夠在/usr/local/lib/python2.7/site-packages下面看到twisted和Twisted-11.1.0-py2.7.egg-info。
六、安裝w3lib
下載地址以下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安裝過程以下所示:
- [root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
- [root@localhost scrapy]# cd w3lib-1.0
- [root@localhost w3lib-1.0]# python setup.py install
安裝完成後,能夠在/usr/local/lib/python2.7/site-packages下面看到w3lib和w3lib-1.0-py2.7.egg-info。
七、安裝libxml2
下載地址以下所示:
http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1
或者,能夠到網站http://xmlsoft.org上面找到相應版本的壓縮包。
安裝過程以下所示:
- [root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
- [root@localhost scrapy]# cd libxml2-2.7.4
- [root@localhost libxml2-2.7.4]# ./configure
- [root@localhost libxml2-2.7.4]# make
- [root@localhost libxml2-2.7.4]# make install
八、安裝pyOpenSSL
該步驟可選,對應的安裝包下載地址爲:
https://launchpad.net/pyopenssl
若是須要的話,能夠選擇須要的版本。我這裏直接跳過該步驟。
九、安裝Scrapy
下載地址以下所示:
http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200
安裝過程以下所示:
- [root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
- [root@localhost scrapy]# cd Scrapy-0.14.0.2841
- [root@localhost Scrapy-0.14.0.2841]# python setup.py install
安裝驗證
通過上面的安裝和配置過程,已經完成了Scrapy的安裝,咱們能夠經過以下命令行來驗證一下:
- [root@localhost scrapy]# scrapy
- Scrapy 0.14.0.2841 - no active project
-
- Usage:
- scrapy <command> [options] [args]
-
- Available commands:
- fetch Fetch a URL using the Scrapy downloader
- runspider Run a self-contained spider (without creating a project)
- settings Get settings values
- shell Interactive scraping console
- startproject Create new project
- version Print Scrapy version
- view Open URL in browser, as seen by Scrapy
-
- Use "scrapy <command> -h" to see more info about a command
上面提示信息,提供了一個fetch命令,這個命令抓取指定的網頁,能夠先看看fetch命令的幫助信息,以下所示:
- [root@localhost scrapy]# scrapy fetch --help
- Usage
- =====
- scrapy fetch [options] <url>
-
- Fetch a URL using the Scrapy downloader and print its content to stdout. You
- may want to use --nolog to disable logging
-
- Options
- =======
- --help, -h show this help message and exit
- --spider=SPIDER use this spider
- --headers print response HTTP headers instead of body
-
- Global Options
- --------------
- --logfile=FILE log file. if omitted stderr will be used
- --loglevel=LEVEL, -L LEVEL
- log level (default: DEBUG)
- --nolog disable logging completely
- --profile=FILE write python cProfile stats to FILE
- --lsprof=FILE write lsprof profiling stats to FILE
- --pidfile=FILE write process ID to FILE
- --set=NAME=VALUE, -s NAME=VALUE
- set/override setting (may be repeated)
根據命令提示,指定一個URL,執行後抓取一個網頁的數據,以下所示:
- [root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html
- 2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
- 2011-12-05 23:40:05+0800 [default] INFO: Spider opened
- 2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
- 2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
- 2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
- 2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
- {'downloader/request_bytes': 227,
- 'downloader/request_count': 1,
- 'downloader/request_method_count/GET': 1,
- 'downloader/response_bytes': 22676,
- 'downloader/response_count': 1,
- 'downloader/response_status_count/200': 1,
- 'finish_reason': 'finished',
- 'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
- 'scheduler/memory_enqueued': 1,
- 'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
- 2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
- 2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
- {'memusage/max': 17711104, 'memusage/startup': 17711104}
- [root@localhost scrapy]# ll install.html
- -rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
- [root@localhost scrapy]#
可見,咱們已經成功抓取了一個網頁。
接下來,能夠根據scrapy官網的指南來進一步應用scrapy框架,Tutorial連接頁面爲http://doc.scrapy.org/en/latest/intro/tutorial.html。