如何運行簡單的scrapy

1.建scrapy工程html

scrapy startproject python123demopython

2.在工程中寫一個爬蟲文件api

cd python123democookie

scrapy genspider demo python123.ioscrapy

3.寫爬蟲的配置文件ide

4.運行爬蟲post

scrapy crawl demourl

 

運行的時候出了一些小問題,這些問題是在安裝scrapy時沒有把關聯的包安裝上致使的。3d

ModuleNotFoundError: No module named 'win32api'rest

上述問題須要

pywin32-221-cp36-cp36m-win_amd64.whl這個包

ImportError: DLL load failed: 找不到指定的模塊。

上述問題是因爲沒有成功安裝pywin32-221-cp36-cp36m-win_amd64.whl這個包

從新運行生成的pywin32_postinstall.py文件便可

python.exe Scripts\pywin32_postinstall.py -install

 

可是可能還會出現錯誤,

F:\Python36>python.exe Scripts\pywin32_postinstall.py -install
Copied pythoncom36.dll to F:\Python36\pythoncom36.dll
Copied pywintypes36.dll to F:\Python36\pywintypes36.dll
You do not have the permissions to install COM objects.
The sample COM objects were not registered.
-> Software\Python\PythonCore\3.6\Help[None]=None
-> Software\Python\PythonCore\3.6\Help\Pythonwin Reference[None]='F:\\Python36\\Lib\\site-packages\\PyWin32.chm'
Pythonwin has been registered in context menu
Creating directory F:\Python36\Lib\site-packages\win32com\gen_py
Can't install shortcuts - 'C:\\Users\\asus\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.6' is not a folder
The pywin32 extensions were successfully installed.

 

很明顯,須要使用管理員權限使用上述命令


成功則顯示以下信息


PS C:\WINDOWS\system32> f:
PS F:\> cd .\Python36\
PS F:\Python36> python.exe Scripts\pywin32_postinstal
Copied pythoncom36.dll to C:\WINDOWS\system32\pythonc
Copied pywintypes36.dll to C:\WINDOWS\system32\pywint
Registered: Python.Interpreter
Registered: Python.Dictionary
Registered: Python
-> Software\Python\PythonCore\3.6\Help[None]=None
-> Software\Python\PythonCore\3.6\Help\Pythonwin Refe
Pythonwin has been registered in context menu
Shortcut for Pythonwin created
Shortcut to documentation created
The pywin32 extensions were successfully installed.

 

再次運行爬蟲,終於成功了

F:\pyProject\python123demo>scrapy crawl demo
2017-10-29 09:16:43 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: python123demo)
2017-10-29 09:16:43 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'python123demo', 'NEWSPIDER_MODULE': 'python123demo.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['python123demo.spiders']}
2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-29 09:16:43 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-29 09:16:43 [scrapy.core.engine] INFO: Spider opened
2017-10-29 09:16:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-29 09:16:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-29 09:16:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/robots.txt> from <GET http://python123.io/robots.txt>
2017-10-29 09:16:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://python123.io/robots.txt> (referer: None)
2017-10-29 09:16:44 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://python123.io/ws/demo.html> from <GET http://python123.io/ws/demo.html>
2017-10-29 09:16:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://python123.io/ws/demo.html> (referer: None)
2017-10-29 09:16:44 [scrapy.core.scraper] ERROR: Spider error processing <GET https://python123.io/ws/demo.html> (referer: None)
Traceback (most recent call last):
File "f:\python36\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "F:\pyProject\python123demo\python123demo\spiders\demo.py", line 14, in parse
self.log('Save file %s.' % name)
NameError: name 'name' is not defined
2017-10-29 09:16:44 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-29 09:16:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 884,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 1595,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 29, 1, 16, 44, 929393),
'log_count/DEBUG': 5,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/NameError': 1,
'start_time': datetime.datetime(2017, 10, 29, 1, 16, 44, 121136)}
2017-10-29 09:16:44 [scrapy.core.engine] INFO: Spider closed (finished)

 

 

總結:爲防止出現scrapy相關依賴安裝失敗,能夠本身逐個下載依賴

https://www.lfd.uci.edu/~gohlke/pythonlibs/

lxml

pywin32

Twisted

OpenSSL

依賴放在scripts下

經過pip install 對應whl文件便可

最後使用import對應模塊判斷是否安裝成功

 

最後可經過命令升級scrapy

相關文章
相關標籤/搜索