pyspider安裝: pip3 install Pyspiderpython
啓動服務操做web
一、打開cmd:輸入 pyspider --help 回車,能夠查看幫助信息,pyspider all 啓動command服務chrome
二、啓動後看到0.0.0.0.5000 提示就啓動好了,打開瀏覽器127.0.0.1:5000或者http://localhost:5000/ 打開pyspider的web UI界面,瀏覽器
三、首先點擊creat建立項目,名字任意安全
四、右邊web頁面代碼以下:async
#!/usr/bin/env pythonide
# -*- encoding: utf-8 -*-
# Created on 2018-08-22 23:16:23
# Project: TripAdvisortornado
from pyspider.libs.base_handler import *oop
class Handler(BaseHandler):
crawl_config = {
}學習
@every(minutes=24 * 60)
def on_start(self):
self.crawl('__START_URL__', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
把__START_URL__替換成要爬取的網站地址,進行save,點擊左邊的run按鈕,點擊左邊窗體的follow點擊《、》箭頭
第一次嘗試pyspider,出師未捷身先死,,,599了,立馬百度下PySpider HTTP 599: SSL certificate problem錯誤的解決方法,發現有同病相憐的小夥伴,學習下經驗https://blog.csdn.net/asmcvc/article/details/51016485
報錯完整的代碼(每一個人安裝的python路徑不同地址會有差別)
[E 180822 23:51:45 base_handler:203] HTTP 599: SSL certificate problem: self signed certificate in certificate chain Traceback (most recent call last): File "e:\programs\python\python36\lib\site-packages\pyspider\libs\base_handler.py", line 196, in run_task result = self._run_task(task, response) File "e:\programs\python\python36\lib\site-packages\pyspider\libs\base_handler.py", line 175, in _run_task response.raise_for_status() File "e:\programs\python\python36\lib\site-packages\pyspider\libs\response.py", line 172, in raise_for_status six.reraise(Exception, Exception(self.error), Traceback.from_string(self.traceback).as_traceback()) File "e:\programs\python\python36\lib\site-packages\six.py", line 692, in reraise raise value.with_traceback(tb) File "e:\programs\python\python36\lib\site-packages\pyspider\fetcher\tornado_fetcher.py", line 378, in http_fetch response = yield gen.maybe_future(self.http_client.fetch(request)) File "e:\programs\python\python36\lib\site-packages\tornado\httpclient.py", line 102, in fetch self._async_client.fetch, request, **kwargs)) File "e:\programs\python\python36\lib\site-packages\tornado\ioloop.py", line 458, in run_sync return future_cell[0].result() File "e:\programs\python\python36\lib\site-packages\tornado\concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info Exception: HTTP 599: SSL certificate problem: self signed certificate in certificate chain
錯誤緣由:
這個錯誤會發生在請求 https 開頭的網址,SSL 驗證錯誤,證書有誤。
解決方法:
使用 self.crawl(url, callback=self.index_page, validate_cert=False) ------------------------------validate_cert=False要放在每一個提取頁裏面否則打開子頁面仍是會599,吐血
代碼以下:
1 #!/usr/bin/env python 2 # -*- encoding: utf-8 -*- 3 # Created on 2018-08-23 23:06:13 4 # Project: v2ex 5 6 from pyspider.libs.base_handler import * 7 8 9 class Handler(BaseHandler): 10 crawl_config = { 11 } 12 13 @every(minutes=24 * 60) 14 def on_start(self): 15 self.crawl('https://www.v2ex.com/?tab=tech', callback=self.index_page,validate_cert=False) 16 17 @config(age=10 * 24 * 60 * 60) 18 def index_page(self, response): 19 for each in response.doc('a[href^="https://www.v2ex.com/?tab="]').items(): 20 self.crawl(each.attr.href, callback=self.tab_page, validate_cert=False) 21 22 @config(priority=2) 23 def tab_page(self, response): 24 for each in response.doc('a[href^="https://www.v2ex.com/go/"]').items(): 25 self.crawl(each.attr.href, callback=self.board_page, validate_cert=False) 26 27 28 @config(priority=2) 29 def board_page(self, response): 30 for each in response.doc('a[href^="https://www.v2ex.com/t/"]').items(): 31 url = each.attr.href 32 if url.find('#reply') > 0: 33 url = url[0:url.find('#')] 34 self.crawl(url, callback=self.detail_page, validate_cert=False) 35 36 37 38 @config(priority=2) 39 def detail_page(self, response): 40 title = response.doc('h1').text() 41 content = response.doc('div.topic_content') 42 return { 43 "url": response.url, 44 "title": response.doc('title').text(), 45 }
這個方法基本能夠解決問題了(瀏覽器要手動刷新下,用360安全瀏覽器貌似有這個小問題,多是我設置的問題,果斷換chrome和火狐試了下,沒發現這個狀況。。。)
For Linux and MAC systems, please refer to the following links:
https://blog.csdn.net/WebStudy8/article/details/51610953