一.背景python
爬蟲的本質就是一個socket客戶端與服務端的通訊過程,若是咱們有多個url待爬取,採用串行的方式執行,只能等待爬取一個結束後才能繼續下一個,效率會很是低。react
須要強調的是:串行並不意味着低效,若是串行的都是純計算的任務,那麼cpu的利用率仍然會很高,之因此爬蟲程序的串行低效,是由於爬蟲程序是明顯的IO密集型程序。git
二.同步,異步,回調機制程序員
在編寫爬蟲是,性能的消耗主要在IO請求中,當單進程單線程模式下 請求URL時,必然會引發等待,從而使得請求總體變慢。github
1.同步:提交一個任務後就在原地等待任務結束,等到拿到任務的結果後在繼續下一行代碼,效率低下。web
import requests def fetch_async(url): response = requests.get(url) return response #返回信息:response.text... url_list = ['http://www.jiemian.com','http://www.bing.com'] for url in url_list: print(fetch_async(url))
2.簡單的解決方案:多線程或多進程數據庫
#在服務器端使用多線程(或多進程)。多線程(或多進程)的目的是讓每一個鏈接都擁有獨立的線程(或進程),這樣任何一個鏈接的阻塞都不會影響其餘的鏈接。
from concurrent.futures import ThreadPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ThreadPoolExecutor(5) for url in url_list: pool.submit(fetch_async, url) pool.shutdown(wait=True)
from multiprocessing import Process from threading import Thread import requests def get_page(url): response=requests.get(url) if response.status_code == 200: return response.text if __name__ == '__main__': urls=['https://www.baidu.com/','http://www.sina.com.cn/','https://www.python.org'] for url in urls: p=Process(target=get_page,args=(url,)) p.start() # t=Thread(target=get_page,args=(url,)) # t.start()
該方案的侷限性:api
#開啓多進程或都線程的方式,咱們是沒法無限制地開啓多進程或多線程的:在遇到要同時響應成百上千路的鏈接請求,則不管多線程仍是多進程都會嚴重佔據系統資源,下降系統對外界響應效率,並且線程與進程自己也更容易進入假死狀態。
3.改進方案:線程池或進程池+異步調用:提交一個任務後並不會等待任務結束,而是繼續下一行代碼緩存
#不少程序員可能會考慮使用「線程池」或「鏈接池」。「線程池」旨在減小建立和銷燬線程的頻率,其維持必定合理數量的線程,並讓空閒的線程從新承擔新的執行任務。「鏈接池」維持鏈接的緩存池,儘可能重用已有的鏈接、減小建立和關閉鏈接的頻率。這兩種技術均可以很好的下降系統開銷,都被普遍應用不少大型系統,如websphere、tomcat和各類數據庫等。
from concurrent.futures import ProcessPoolExecutor import requests import os def fetch_async(url): print('%s GET : %s' % (os.getpid(), url)) response = requests.get(url) return response def callback(future): future.result() print('%s parsing' %os.getpid()) if __name__ == '__main__': url_list = ['https://www.baidu.com/','http://www.sina.com.cn/','https://www.python.org'] pool = ProcessPoolExecutor(2) for url in url_list: v = pool.submit(fetch_async, url) v.add_done_callback(callback) pool.shutdown(wait=True)
改進後方案其實也存在着問題: tomcat
#「線程池」和「鏈接池」技術也只是在必定程度上緩解了頻繁調用IO接口帶來的資源佔用。並且,所謂「池」始終有其上限,當請求大大超過上限時,「池」構成的系統對外界的響應並不比沒有池的時候效果好多少。因此使用「池」必須考慮其面臨的響應規模,並根據響應規模調整「池」的大小。
對應上例中的所面臨的可能同時出現的上千甚至上萬次的客戶端請求,「線程池」或「鏈接池」或許能夠緩解部分壓力,可是不能解決全部問題。總之,多線程模型能夠方便高效的解決小規模的服務請求,但面對大規模的服務請求,多線程模型也會遇到瓶頸,能夠用非阻塞接口來嘗試解決這個問題。
三.高性能
上述五路哪一種解決方案其實沒有解決一個性能相關的問題:IO阻塞,不管是多進程仍是多線程,在遇到IO阻塞式都會被操做系統強行剝奪走CPU的執行權限,程序的執行效率所以就下降了下來。
解決問題的關鍵在於,咱們本身從應用程序級別檢測IO阻塞而後切換到咱們本身程序的其餘任務讓其執行,這樣吧咱們程序的IO降到最低,咱們的程序處於就緒態就會增多,以此來迷惑操做系統,操做系統便覺得咱們的程序是IO比較少的程序,從而會盡量多的分配CPU給咱們,這就就達到提高程序執行效率的目的。
一、在python3.3以後新增了asyncio模塊,能夠幫咱們檢測IO(只能是網絡IO),實現應用程序級別的切換
import asyncio @asyncio.coroutine def func1(): print('before...func1......') yield from asyncio.sleep(5) print('end...func1......') tasks = [func1(), func1()] loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
二、但asyncio模塊只能發tcp級別的請求,不能發http協議,所以,在咱們須要發送http請求的時候,須要咱們自定義http報頭
#咱們爬取一個網頁的過程,以https://www.python.org/doc/爲例,將關鍵步驟列舉以下 #步驟一:向www.python.org這臺主機發送tcp三次握手,是IO阻塞操做 #步驟二:封裝http協議的報頭 #步驟三:發送http協議的請求包,是IO阻塞操做 #步驟四:接收http協議的響應包,是IO阻塞操做 import asyncio @asyncio.coroutine def get_page(host,port=80,url='/'): #步驟一(IO阻塞):發起tcp連接,是阻塞操做,所以須要yield from recv,send=yield from asyncio.open_connection(host,port) #步驟二:封裝http協議的報頭,由於asyncio模塊只能封裝併發送tcp包,所以這一步須要咱們本身封裝http協議的包 requset_headers="""GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,) # requset_headers="""POST %s HTTP/1.0\r\nHost: %s\r\n\r\nname=egon&password=123""" % (url, host,) requset_headers=requset_headers.encode('utf-8') #步驟三(IO阻塞):發送http請求包 send.write(requset_headers) yield from send.drain() #步驟四(IO阻塞):接收http協議的響應包 text=yield from recv.read() #其餘處理 print(host,url,text) send.close() print('-===>') return 1 tasks=[get_page(host='www.python.org',url='/doc'),get_page(host='www.cnblogs.com',url='linhaifeng'),get_page(host='www.openstack.org')] loop=asyncio.get_event_loop() results=loop.run_until_complete(asyncio.gather(*tasks)) loop.close() print('=====>',results) #[1, 1, 1]
import asyncio @asyncio.coroutine def fetch_async(host, url='/'): print(host, url) reader, writer = yield from asyncio.open_connection(host, 80) request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,) request_header_content = bytes(request_header_content, encoding='utf-8') writer.write(request_header_content) yield from writer.drain() text = yield from reader.read() print(host, url, text) writer.close() tasks = [ fetch_async('www.cnblogs.com', '/wupeiqi/'), fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
三、自定義http報頭多少有點麻煩,因而有了aiohttp模塊,專門幫咱們封裝http報頭,而後咱們還須要用asyncio檢測IO實現切換
import aiohttp import asyncio @asyncio.coroutine def fetch_async(url): print(url) response = yield from aiohttp.request('GET', url) # data = yield from response.read() # print(url, data) print(url, response) response.close() tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')] event_loop = asyncio.get_event_loop() results = event_loop.run_until_complete(asyncio.gather(*tasks)) event_loop.close()
四、此外,還能夠將requests.get函數傳給asyncio,就可以被檢測了
import asyncio import requests @asyncio.coroutine def fetch_async(func,*args): loop = asyncio.get_event_loop() future = loop.run_in_executor(None,func,*args) response = yield from future print(response.url,response.content) tasks = [ fetch_async(requests.get, 'http://www.jiemian.com'), fetch_async(requests.get, 'http://www.biying.com') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
五、還有以前在協程時介紹的gevent模塊
import gevent import requests from gevent import monkey monkey.patch_all() def fetch_async(method, url, req_kwargs): print(method, url, req_kwargs) response = requests.request(method=method, url=url, **req_kwargs) print(response.url, response.content) # ##### 發送請求 ##### gevent.joinall([ gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}), ]) # ##### 發送請求(協程池控制最大協程數量) ##### # from gevent.pool import Pool # pool = Pool(None) # gevent.joinall([ # pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), # pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), # pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}), # ])
六、封裝了gevent+requests模塊的grequests模塊
import grequests request_list = [ grequests.get('http://httpbin.org/delay/1', timeout=0.001), grequests.get('http://fakedomain/'), grequests.get('http://httpbin.org/status/500') ] # ##### 執行並獲取響應列表 ##### # response_list = grequests.map(request_list) # print(response_list) # ##### 執行並獲取響應列表(處理異常) ##### # def exception_handler(request, exception): # print(request,exception) # print("Request failed") # response_list = grequests.map(request_list, exception_handler=exception_handler) # print(response_list)
七、twisted:是一個網絡框架,其中一個功能是發送異步請求,檢測IO並自動切換
''' #問題一:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install C:\Users\Administrator\Downloads\Twisted-17.9.0-cp36-cp36m-win_amd64.whl pip3 install twisted #問題二:ModuleNotFoundError: No module named 'win32api' https://sourceforge.net/projects/pywin32/files/pywin32/ #問題三:openssl pip3 install pyopenssl ''' #twisted基本用法 from twisted.web.client import getPage,defer from twisted.internet import reactor def all_done(arg): # print(arg) reactor.stop() def callback(res): print(res) return 1 defer_list=[] urls=[ 'http://www.baidu.com', 'http://www.bing.com', 'https://www.python.org', ] for url in urls: obj=getPage(url.encode('utf=-8'),) obj.addCallback(callback) defer_list.append(obj) defer.DeferredList(defer_list).addBoth(all_done) reactor.run() #twisted的getPage的詳細用法 from twisted.internet import reactor from twisted.web.client import getPage import urllib.parse def one_done(arg): print(arg) reactor.stop() post_data = urllib.parse.urlencode({'check_data': 'adf'}) post_data = bytes(post_data, encoding='utf8') headers = {b'Content-Type': b'application/x-www-form-urlencoded'} response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'), method=bytes('POST', encoding='utf8'), postdata=post_data, cookies={}, headers=headers) response.addBoth(one_done) reactor.run()
from twisted.web.client import getPage, defer from twisted.internet import reactor def all_done(arg): reactor.stop() def callback(contents): print(contents) deferred_list = [] url_list = ['http://www.bing.com', 'http://www.baidu.com', ] for url in url_list: deferred = getPage(bytes(url, encoding='utf8')) deferred.addCallback(callback) deferred_list.append(deferred) dlist = defer.DeferredList(deferred_list) dlist.addBoth(all_done) reactor.run()
dlist.addBoth(all_done) #不管正確與否,都會執行 回調函數 dlist.addCallback(all_done) #正確的時候,執行 dlist.addErrback(all_done) #錯誤的時候,執行
from twisted.web.client import getPage, defer from twisted.internet import reactor def callback(contents): print(contents) @defer.inlineCallbacks def task(url): # defer.Deferred() -------》返回defer對象 d = getPage(bytes(url, encoding='utf8')) d.addCallback(callback) yield d url_list = ['http://www.bing.com', 'http://www.baidu.com', ] _active = [] for url in url_list: d = task(url) _active.append(d) def all_done(arg): reactor.stop() xx = defer.DeferredList(_active) xx.addBoth(all_done) reactor.run()
8.Tornado
from tornado.httpclient import AsyncHTTPClient from tornado.httpclient import HTTPRequest from tornado import ioloop def handle_response(response): """ 處理返回值內容(須要維護計數器,來中止IO循環),調用 ioloop.IOLoop.current().stop() :param response: :return: """ if response.error: print("Error:", response.error) else: print(response.body) def func(): url_list = [ 'http://www.baidu.com', 'http://www.bing.com', ] for url in url_list: print(url) http_client = AsyncHTTPClient() http_client.fetch(HTTPRequest(url), handle_response) ioloop.IOLoop.current().add_callback(func) ioloop.IOLoop.current().start()
from twisted.web.client import getPage,defer from twisted.internet import reactor def all_done(arg): reactor.stop() def callback(contents): print(contents) deferred_list = [] url_list = ['http://www.jiemian.com', 'http://www.baidu.com',] for url in url_list: deferred = getPage(bytes(url,encoding='utf8')) deferred.addCallback(callback) deferred_list.append(deferred) dlist = defer.DeferredList(deferred_list) dlist.addBoth(all_done) react八、tornado
以上是Python內置以及第三方提供的異步IO請求模塊,使用簡便大大提升效率,而對異步IO請求的本質則是 [ 非阻塞Socket ] + [IO多路複用]:
import select import socket import time class AsyncTimeoutException(TimeoutError): """ 請求超時異常類 """ def __init__(self, msg): self.msg = msg super(AsyncTimeoutException, self).__init__(msg) class HttpContext(object): """封裝請求和相應的基本數據""" def __init__(self, sock, host, port, method, url, data, callback, timeout=5): """ sock: 請求的客戶端socket對象 host: 請求的主機名 port: 請求的端口 port: 請求的端口 method: 請求方式 url: 請求的URL data: 請求時請求體中的數據 callback: 請求完成後的回調函數 timeout: 請求的超時時間 """ self.sock = sock self.callback = callback self.host = host self.port = port self.method = method self.url = url self.data = data self.timeout = timeout self.__start_time = time.time() self.__buffer = [] def is_timeout(self): """當前請求是否已經超時""" current_time = time.time() if (self.__start_time + self.timeout) < current_time: return True def fileno(self): """請求sockect對象的文件描述符,用於select監聽""" return self.sock.fileno() def write(self, data): """在buffer中寫入響應內容""" self.__buffer.append(data) def finish(self, exc=None): """在buffer中寫入響應內容完成,執行請求的回調函數""" if not exc: response = b''.join(self.__buffer) self.callback(self, response, exc) else: self.callback(self, None, exc) def send_request_data(self): content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % ( self.method.upper(), self.url, self.host, self.data,) return content.encode(encoding='utf8') class AsyncRequest(object): def __init__(self): self.fds = [] self.connections = [] def add_request(self, host, port, method, url, data, callback, timeout): """建立一個要請求""" client = socket.socket() client.setblocking(False) try: client.connect((host, port)) except BlockingIOError as e: pass # print('已經向遠程發送鏈接的請求') req = HttpContext(client, host, port, method, url, data, callback, timeout) self.connections.append(req) self.fds.append(req) def check_conn_timeout(self): """檢查全部的請求,是否有已經鏈接超時,若是有則終止""" timeout_list = [] for context in self.connections: if context.is_timeout(): timeout_list.append(context) for context in timeout_list: context.finish(AsyncTimeoutException('請求超時')) self.fds.remove(context) self.connections.remove(context) def running(self): """事件循環,用於檢測請求的socket是否已經就緒,從而執行相關操做""" while True: r, w, e = select.select(self.fds, self.connections, self.fds, 0.05) if not self.fds: return for context in r: sock = context.sock while True: try: data = sock.recv(8096) if not data: self.fds.remove(context) context.finish() break else: context.write(data) except BlockingIOError as e: break except TimeoutError as e: self.fds.remove(context) self.connections.remove(context) context.finish(e) break for context in w: # 已經鏈接成功遠程服務器,開始向遠程發送請求數據 if context in self.fds: data = context.send_request_data() context.sock.sendall(data) self.connections.remove(context) self.check_conn_timeout() if __name__ == '__main__': def callback_func(context, response, ex): """ :param context: HttpContext對象,內部封裝了請求相關信息 :param response: 請求響應內容 :param ex: 是否出現異常(若是有異常則值爲異常對象;不然值爲None) :return: """ print(context, response, ex) obj = AsyncRequest() url_list = [ {'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5, 'callback': callback_func}, {'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5, 'callback': callback_func}, {'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5, 'callback': callback_func}, ] for item in url_list: print(item) obj.add_request(**item) obj.running()