Python爬蟲 | 多線程、多進程、協程

時間 2020-05-26

原文原文鏈接

對於操做系統來講，一個任務就是一個進程（Process），好比打開一個瀏覽器就是啓動一個瀏覽器進程，打開一個記事本就啓動了一個記事本進程，打開兩個記事本就啓動了兩個記事本進程，打開一個Word就啓動了一個Word進程。html

有些進程還不止同時幹一件事，好比Word，它能夠同時進行打字、拼寫檢查、打印等事情。在一個進程內部，要同時幹多件事，就須要同時運行多個「子任務」，咱們把進程內的這些「子任務」稱爲線程（Thread）。web

進程、線程、協程的區別編程

多進程模式最大的優勢就是穩定性高，由於一個子進程崩潰了，不會影響主進程和其餘子進程。（固然主進程掛了全部進程就全掛了，可是Master進程只負責分配任務，掛掉的機率低）著名的Apache最先就是採用多進程模式。瀏覽器

多進程模式的缺點是建立進程的代價大，在Unix/Linux系統下，用fork調用還行，在Windows下建立進程開銷巨大。另外，操做系統能同時運行的進程數也是有限的，在內存和CPU的限制下，若是有幾千個進程同時運行，操做系統連調度都會成問題。session

多線程模式一般比多進程快一點，可是也快不到哪去，並且，多線程模式致命的缺點就是任何一個線程掛掉均可能直接形成整個進程崩潰，由於全部線程共享進程的內存。多線程

協程的優點：併發

最大的優點就是協程極高的執行效率。由於子程序切換不是線程切換，而是由程序自身控制，所以，沒有線程切換的開銷，和多線程比，線程數量越多，協程的性能優點就越明顯。app

第二大優點就是不須要多線程的鎖機制，由於只有一個線程，也不存在同時寫變量衝突，在協程中控制共享資源不加鎖，只須要判斷狀態就行了，因此執行效率比多線程高不少。異步

1、多進程

Case 01
# 多進程，使用Pool

from multiprocessing import Pool def f(x): return x*x if __name__ =='__main__': p = Pool(5) list = [1,2,3,4,5,6,7,8,9] print(p.map(f,list)) # map是作映射  輸出：[1, 4, 9, 16, 25, 36, 49, 64, 81] Case 01-1 # 多進程，使用Pool import time import requests from multiprocessing import Pool task_list = [ 'http://bj.maitian.cn/zfall/PG1', 'http://bj.maitian.cn/zfall/PG2', 'http://bj.maitian.cn/zfall/PG3', 'http://bj.maitian.cn/zfall/PG4', 'http://bj.maitian.cn/zfall/PG5', 'http://bj.maitian.cn/zfall/PG6', 'http://bj.maitian.cn/zfall/PG7', 'http://bj.maitian.cn/zfall/PG8', 'http://bj.maitian.cn/zfall/PG9', 'http://bj.maitian.cn/zfall/PG10', ] header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } def download(url): response = requests.get(url, headers=header, timeout=30 ) return response.status_code if __name__ == '__main__': p = Pool(10) time_old = time.time() for item in p.map(download, task_list): print(item) time_new = time.time() time_cost = time_new - time_old print(time_cost)

Case 02

# 多進程，使用Process對象

from multiprocessing import Process def f(name): print('hello',name) if __name__ == '__main__': p_1 = Process(target=f, args=('bob',)) # 注意：參數是隻包含一個元素的元祖  p_1.start() p_1.join() p_2 = Process(target=f, args=('alice',)) p_2.start() p_2.join() 輸出： hello bob hello alice Case 02-1 # 多進程，使用Process對象 import time import requests from multiprocessing import Process task_list = [ 'http://bj.maitian.cn/zfall/PG1', 'http://bj.maitian.cn/zfall/PG2', 'http://bj.maitian.cn/zfall/PG3', 'http://bj.maitian.cn/zfall/PG4', 'http://bj.maitian.cn/zfall/PG5', ] header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } def download(url): response = requests.get(url, headers=header, timeout=30 ) print(response.status_code) if __name__ == '__main__': for item in task_list: p = Process(target=download, args=(item,)) p.start() p.join()

2、多線程

Case 01

import threading import time class myThread(threading.Thread): def __init__(self, threadID, name, counter): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.counter = counter def run(self): print("Starting " + self.name) # 得到鎖，成功得到鎖定後返回True # 可選的timeout參數不填時將一直阻塞直到得到鎖定 # 不然超時後將返回False  threadLock.acquire() print_time(self.name, self.counter, 3) # 釋放鎖  threadLock.release() def print_time(threadName, delay, counter): while counter: time.sleep(delay) print("%s: %s" % (threadName, time.ctime(time.time()))) counter -= 1 threadLock = threading.Lock() threads = [] # 建立新線程 thread1 = myThread(1, "Thread-1", 1) thread2 = myThread(2, "Thread-2", 2) # 開啓新線程 thread1.start() thread2.start() # 添加線程到線程列表 threads.append(thread1) threads.append(thread2) # 等待全部線程完成 for t in threads: t.join() print("Exiting Main Thread")
 Case 02 import threadpool import time def sayhello (a): print("hello: "+a) time.sleep(2) def main(): global result seed=["a","b","c"] start=time.time() task_pool=threadpool.ThreadPool(5) requests=threadpool.makeRequests(sayhello,seed) for req in requests: task_pool.putRequest(req) task_pool.wait() end=time.time() time_m = end-start print("time: "+str(time_m)) start1=time.time() for each in seed: sayhello(each) end1=time.time() print("time1: "+str(end1-start1)) if __name__ == '__main__': main()
 Case 03 from concurrent.futures import ThreadPoolExecutor import time def sayhello(a): print("hello: "+a) time.sleep(2) def main(): seed=["a","b","c"] start1=time.time() for each in seed: sayhello(each) end1=time.time() print("time1: "+str(end1-start1)) start2=time.time() with ThreadPoolExecutor(3) as executor: for each in seed: executor.submit(sayhello,each) end2=time.time() print("time2: "+str(end2-start2)) start3=time.time() with ThreadPoolExecutor(3) as executor1: executor1.map(sayhello,seed) end3=time.time() print("time3: "+str(end3-start3)) if __name__ == '__main__': main()

多線程作爬蟲，若是有一個線程出現問題，全部的都失敗了。因此，不適合作爬蟲。async

3、協程

Case 01 Client example:await, 等待某某執行完成之後才執行下一步 import aiohttp import asyncio async def fetch(session, url): async with session.get(url,) as response: return await response.text() # 注意text加括號了  async def main(): async with aiohttp.ClientSession() as session: # 使用aiohttp庫裏的ClientSession()函數建立一個session對象 html = await fetch(session, 'http://www.baidu.com') # 想要使用異步函數fetch的返回結果，必須加await參數，意思是必須等它執行完畢，纔會去取它的返回值 print(html) loop = asyncio.get_event_loop() # 獲取EventLoop loop.run_until_complete(main()) # 執行coroutine  Case  經過gather實現併發，sleep，是暫時睡，把CPU給其餘任務 經過gather方法實現併發.gather除了多任務外，還能夠對任務進行分組。優先使用gather. gather的意思是「蒐集」，也就是可以收集協程的結果，並且要注意，它會按輸入協程的順序保存的對應協程的執行結果。 #coding:utf-8 import asyncio async def a(t): print('-->', t) await asyncio.sleep(0.5) # 暫停0.5秒，在這期間把CPU讓給其餘協程，可讓其餘協程去執行 print('<--', t) return t * 10 def main(): futs = [a(t) for t in range(6)] # 列表生成式 print(futs) # coroutine object 協程對象  ret = asyncio.gather(*futs) #記得加 * print(ret) # <_GatheringFuture pending> 收集將來對象  loop = asyncio.get_event_loop() ret1 = loop.run_until_complete(ret) print(ret1) main() Case 03 loop.create_task比gather方法使用的更廣泛一些，loop.create_task讓任務開始執行 #coding:utf-8 import asyncio async def a(t): print('-->', t) await asyncio.sleep(0.5) # 這裏睡0.5s print('<--', t) return t * 10 async def b(): # loop = asyncio.get_event_loop()  cnt = 0 while 1: # 死循環，無限執行下去 cnt += 1 # counter計數器的縮寫 cor = a(cnt) # coroutine resp = loop.create_task(cor) await asyncio.sleep(0.1) # 睡的過程當中，a 函數就能夠執行。先執行a(1),睡0.1s；再執行a(2),睡0.1s；再執行，執行到a(5)時，用時0.5s print(resp) loop = asyncio.get_event_loop() loop.run_until_complete(b())