安裝python爬蟲scrapy踩過的那些坑和編程外的思考

時間 2019-11-08

標籤安裝 python 爬蟲 scrapy 那些編程思考欄目 Python 简体版

原文原文鏈接

　　這些天應朋友的要求抓取某個論壇帖子的信息，網上搜索了一下開源的爬蟲資料，看了許多對於開源爬蟲的比較發現開源爬蟲scrapy比較好用。可是之前一直用的java和php，對python不熟悉，因而花一天時間粗略瞭解了一遍python的基礎知識。而後就開幹了，沒想到的配置一個運行環境就花了我一天時間。下面記錄下安裝和配置scrapy踩過的那些坑吧。php

　　運行環境：CentOS 6.0 虛擬機css

　　開始上來先得安裝python運行環境。然而我運行了一下python命令，發現已經自帶了，竊（大）喜（坑）。因而google搜索了一下安裝步驟，pip install Scrapy直接安裝，發現不對。少了pip，因而安裝pip。再次pip install Scrapy，發現少了python-devel，因而這麼來回折騰了一上午。後來下載了scrapy的源碼安裝，忽然曝出一個須要python2.7版本，再經過python --version查看，一個2.6映入眼前；頓時千萬個草泥馬在心中奔騰。html

　　因而查看了官方文檔（http://doc.scrapy.org/en/master/intro/install.html），果真是要python2.7。沒辦法，只能升級python的版本了。java

一、升級pythonpython

下載python2.7並安裝

wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
tar -zxvf Python-2.7.10.tgz
cd Python-2.7.10
./configure  
make all             
make installmake clean  
make distclean

檢查python版本

python --version

　　發現仍是2.6mysql

更改python命令指向

mv /usr/bin/python /usr/bin/python2.6.6_bak
ln -s /usr/local/bin/python2.7 /usr/bin/python

再次檢查版本

# python --version
Python 2.7.10

　　到這裏，python算是升級完成了，繼續安裝scrapy。因而pip install scrapy，仍是報錯。linux

-bash: pip: command not found

安裝pip

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

　　因而pip install scrapy，仍是報錯git

Collecting Twisted>=10.0.0 (from scrapy)
  Could not find a version that satisfies the requirement Twisted>=10.0.0 (from scrapy) (from versions: )
No matching distribution found for Twisted>=10.0.0 (from scrapy)

　　少了Twisted，因而安裝Twistedsql

二、安裝Twisted編程

下載Twisted（https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2#md5=4be066a899c714e18af1ecfcb01cfef7）
安裝

wget https://pypi.python.org/packages/source/T/Twisted/Twisted-15.2.1.tar.bz2
tar -xjvf Twisted-15.2.1.tar.bz2
cd Twisted-15.2.1
python setup.py install

查看是否安裝成功

python
Python 2.7.10 (default, Jun  5 2015, 17:56:24) 
[GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import twisted
>>>

　　此時索命twisted已經安裝成功。因而繼續pip install scrapy，仍是報錯。

三、安裝libxlst、libxml2和xslt-config

Collecting libxlst
  Could not find a version that satisfies the requirement libxlst (from versions: )
No matching distribution found for libxlst

Collecting libxml2
  Could not find a version that satisfies the requirement libxml2 (from versions: )
No matching distribution found for libxml2

wget http://xmlsoft.org/sources/libxslt-1.1.28.tar.gz
tar -zxvf libxslt-1.1.28.tar.gz
cd libxslt-1.1.28/
./configure
make
make install

wget ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz
tar -zxvf libxml2-git-snapshot.tar.gz
cd libxml2-2.9.2/
./configure
make
make install

　　安裝好之後繼續pip install scrapy，幸運之星仍未降臨

四、安裝cryptography

Failed building wheel for cryptography

　　下載cryptography（https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz）

　　安裝

wget https://pypi.python.org/packages/source/c/cryptography/cryptography-0.4.tar.gz
tar -zxvf cryptography-0.4.tar.gz
cd cryptography-0.4
python setup.py build
python setup.py install

　　發現安裝的時候報錯：

No package 'libffi' found

　　因而下載libffi下載並安裝

wget ftp://sourceware.org/pub/libffi/libffi-3.2.1.tar.gz
tar -zxvf libffi-3.2.1.tar.gz
cd libffi-3.2.1
./configure
make
make install

　　安裝後發現仍然報錯

Package libffi was not found in the pkg-config search path.
    Perhaps you should add the directory containing `libffi.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'libffi' found

　　因而設置：PKG_CONFIG_PATH

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH

　　再次安裝scrapy

pip install scrapy

　　幸運女神都去哪兒了？　　

ImportError: libffi.so.6: cannot open shared object file: No such file or directory

　　因而

whereis libffi
libffi: /usr/local/lib/libffi.a /usr/local/lib/libffi.la /usr/local/lib/libffi.so

　　已經正常安裝，網上搜索了一通，發現是LD_LIBRARY_PATH沒設置，因而

export LD_LIBRARY_PATH=/usr/local/lib

　　因而繼續安裝cryptography-0.4

python setup.py build
python setup.py install

　　此時正確安裝，沒有報錯信息了。

　　五、繼續安裝scrapy

pip install scrapy

　　看着提示信息：

Building wheels for collected packages: cryptography
  Running setup.py bdist_wheel for cryptography

　　在這裏停了很久，在想幸運女神是否是到了。等了一會

Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /usr/local/lib/python2.7/site-packages/zope.interface-4.1.2-py2.7-linux-i686.egg (from Twisted>=10.0.0->scrapy)
Collecting cryptography>=0.7 (from pyOpenSSL->scrapy)
  Using cached cryptography-0.9.tar.gz
Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/local/lib/python2.7/site-packages (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Requirement already satisfied (use --upgrade to upgrade): idna in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): enum34 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ipaddress in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in /usr/local/lib/python2.7/site-packages (from cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): ordereddict in /usr/local/lib/python2.7/site-packages (from enum34->cryptography>=0.7->pyOpenSSL->scrapy)
Requirement already satisfied (use --upgrade to upgrade): pycparser in /usr/local/lib/python2.7/site-packages (from cffi>=0.8->cryptography>=0.7->pyOpenSSL->scrapy)
Building wheels for collected packages: cryptography
  Running setup.py bdist_wheel for cryptography
  Stored in directory: /root/.cache/pip/wheels/d7/64/02/7258f08eae0b9c930c04209959c9a0794b9729c2b64258117e
Successfully built cryptography
Installing collected packages: cryptography
  Found existing installation: cryptography 0.4
    Uninstalling cryptography-0.4:
      Successfully uninstalled cryptography-0.4
Successfully installed cryptography-0.9

　　顯示如此的信息。看到此刻，內流馬面。謝謝CCAV，感謝MTV，釣魚島是中國的。終於安裝成功了。

六、測試scrapy

建立測試腳本

cat > myspider.py <<EOF

from scrapy import Spider, Item, Field

class Post(Item):
    title = Field()

class BlogSpider(Spider):
    name, start_urls = 'blogspider', ['http://www.cnblogs.com/rwxwsblog/']

    def parse(self, response):
        return [Post(title=e.extract()) for e in response.css("h2 a::text")]

EOF

　　測試腳本可否正常運行

scrapy runspider myspider.py

2015-06-06 20:25:16 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
2015-06-06 20:25:16 [scrapy] INFO: Optional features available: ssl, http11
2015-06-06 20:25:16 [scrapy] INFO: Overridden settings: {}
2015-06-06 20:25:16 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

2015-06-06 20:25:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-06 20:25:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-06 20:25:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-06 20:25:16 [scrapy] INFO: Enabled item pipelines: 
2015-06-06 20:25:16 [scrapy] INFO: Spider opened
2015-06-06 20:25:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-06 20:25:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-06 20:25:17 [scrapy] DEBUG: Crawled (200) <GET http://www.cnblogs.com/rwxwsblog/> (referer: None)
2015-06-06 20:25:17 [scrapy] INFO: Closing spider (finished)
2015-06-06 20:25:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 5383,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 6, 12, 25, 17, 310084),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 6, 6, 12, 25, 16, 863599)}
2015-06-06 20:25:17 [scrapy] INFO: Spider closed (finished)

　　運行正常（此時心中竊喜，^_^....）。

　　七、建立本身的scrapy項目（此時換了一個會話）

scrapy startproject tutorial

　　輸出如下信息

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 552, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2672, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2345, in load
    return self.resolve()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2351, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/__init__.py", line 48, in <module>
    from scrapy.spiders import Spider
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/spiders/__init__.py", line 10, in <module>
    from scrapy.http import Request
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/__init__.py", line 11, in <module>
    from scrapy.http.request.form import FormRequest
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/http/request/form.py", line 9, in <module>
    import lxml.html
  File "/usr/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 42, in <module>
    from lxml import etree
ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)

　　心中無數個草泥馬再次狂奔。怎麼又不行了？難道會變戲法？定定神看了下：ImportError: /usr/lib/libxml2.so.2: version `LIBXML2_2.9.0' not found (required by /usr/local/lib/python2.7/site-packages/lxml/etree.so)。這是那樣的熟悉呀！想了想，這怎麼和前面的ImportError: libffi.so.6: cannot open shared object file: No such file or directory那麼相似呢？因而

　　八、添加環境變量

export LD_LIBRARY_PATH=/usr/local/lib

　　再次運行：

scrapy startproject tutorial

　　輸出如下信息：

[root@bogon scrapy]# scrapy startproject tutorial
2015-06-06 20:35:43 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot)
2015-06-06 20:35:43 [scrapy] INFO: Optional features available: ssl, http11
2015-06-06 20:35:43 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'tutorial' created in:
    /root/scrapy/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com

　　尼瑪的終於成功了。因而可知，scrapy運行的時候須要LD_LIBRARY_PATH環境變量的支持。能夠考慮將其加入環境變量中

vi /etc/profile

　　添加：export LD_LIBRARY_PATH=/usr/local/lib 這行（前面的PKG_CONFIG_PATH也能夠考慮添加進來，export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH）

　　注：安裝的時候能夠留意Libraries安裝的路徑，以libffi爲例：

libtool: install: /usr/bin/install -c .libs/libffi.so.6.0.4 /usr/local/lib/../lib64/libffi.so.6.0.4
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0.4 libffi.so.6 || { rm -f libffi.so.6 && ln -s libffi.so.6.0.4 libffi.so.6; }; })
libtool: install: (cd /usr/local/lib/../lib64 && { ln -s -f libffi.so.6.0.4 libffi.so || { rm -f libffi.so && ln -s libffi.so.6.0.4 libffi.so; }; })
libtool: install: /usr/bin/install -c .libs/libffi.lai /usr/local/lib/../lib64/libffi.la
libtool: install: /usr/bin/install -c .libs/libffi.a /usr/local/lib/../lib64/libffi.a
libtool: install: chmod 644 /usr/local/lib/../lib64/libffi.a
libtool: install: ranlib /usr/local/lib/../lib64/libffi.a
libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/www/wdlinux/mysql/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib/../lib64
----------------------------------------------------------------------
Libraries have been installed in:
   /usr/local/lib/../lib64

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /bin/mkdir -p '/usr/local/share/info'
 /usr/bin/install -c -m 644 ../doc/libffi.info '/usr/local/share/info'
 install-info --info-dir='/usr/local/share/info' '/usr/local/share/info/libffi.info'
 /bin/mkdir -p '/usr/local/lib/pkgconfig'
 /usr/bin/install -c -m 644 libffi.pc '/usr/local/lib/pkgconfig'
make[3]: Leaving directory `/root/python/libffi-3.2.1/x86_64-unknown-linux-gnu'
make[2]: Leaving directory `/root/python/libffi-3.2.1/x86_64-unknown-linux-gnu'
make[1]: Leaving directory `/root/python/libffi-3.2.1/x86_64-unknown-linux-gnu'

　　這裏能夠知道libffi安裝的路徑爲/usr/local/lib/../lib64，所以在引入LD_LIBRARY_PATH時應該爲：export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64:$LD_LIBRARY_PATH，此處須要特別留意。

　　保存後檢查是否存在異常：

source /etc/profile

　　開一個新的會話運行

scrapy runspider myspider.py

　　發現正常運行，可見LD_LIBRARY_PATH是生效的。至此scrapy就算正式安裝成功了。

　　查看scrapy版本：運行scrapy version，看了下scrapy的版本爲「Scrapy 1.0.0rc2」

九、編程外的思考(感謝閱讀到此的你，我本身都有點暈了。)

- 有沒有更好的安裝方式呢？個人這種安裝方式是否有問題？有的話請告訴我。（不少依賴包我採用pip和easy_install都沒法安裝，感受是pip配置文件配置源的問題）
- 必定要看官方的文檔，Google和百度出來的結果每每是碎片化的，不全面。這樣能夠少走不少彎路，減小沒必要要的工做量。
- 遇到的問題要先思考，想一想是什麼問題再Google和百度。
- 解決問題要造成文檔，方便本身也方便別人。

　　十、參考文檔

　　　　http://scrapy.org/

　　　　http://doc.scrapy.org/en/master/

　　　　http://blog.csdn.net/slvher/article/details/42346887

　　　　http://blog.csdn.net/niying/article/details/27103081

　　　　http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html