多線程、多進程、協程、IO多路複用請求百度

最近學習了多線程、多進程、協程以及IO多路複用,那麼對於爬取數據來講,這幾個方式哪一個最快呢,今天就來稍微測試一下java

普通方式請求百度5次python

import socket import time import socks socks.set_default_proxy(socks.HTTP,addr='192.168.105.71',port=80) #設置socks代理
socket.socket = socks.socksocket  # 把代理應用到socket
def blocking(wd): sock = socket.socket() sock.connect(('www.baidu.com',80)) # 鏈接百度
    request = 'GET {} HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n'.format('/s?wd={}'.format(wd)) # 構造http請求頭
    response = b''  # 用於接收數據
    sock.send(request.encode())  # 發送http請求
    chunk = sock.recv(1024)  # 一次接收1024字節數據
    while chunk:  # 循環接收數據,若沒有數據了說明已接收完
        response += chunk  # 字符串拼接
        chunk = sock.recv(1024) # print(response.decode())
    return response def blocking_way(): search_list = ['python', 'java', 'C++', 'Ruby', 'Go'] for item in search_list: blocking(item) if __name__ == '__main__': start_time = time.time() blocking_way() print('請求5次百度總耗時:{}'.format(round(time.time()-start_time,2)))

屢次執行結果:多線程

請求5次百度總耗時:4.24秒app

多線程版本socket

import socket import time import socks from multiprocessing.pool import ThreadPool socks.set_default_proxy(socks.HTTP,addr='192.168.105.71',port=80) #設置socks代理
socket.socket = socks.socksocket  # 把代理應用到socket
def blocking(wd): sock = socket.socket() sock.connect(('www.baidu.com',80)) # 鏈接百度
    request = 'GET {} HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n'.format('/s?wd={}'.format(wd)) # 構造http請求頭
    response = b''  # 用於接收數據
    sock.send(request.encode())  # 發送http請求
    chunk = sock.recv(1024)  # 一次接收1024字節數據
    while chunk:  # 循環接收數據,若沒有數據了說明已接收完
        response += chunk  # 字符串拼接
        chunk = sock.recv(1024) # print(response.decode())
    return response def blocking_way(): #多線程
    pool = ThreadPool(5) #實例線程池,開啓5個線程
    search_list = ['python','java','C++','Ruby','Go'] for i in search_list: pool.apply_async(blocking,args=(i,)) # 提交任務到線程池
    pool.close() #線程池再也不接收任務
    pool.join() #等待任務執行完

if __name__ == '__main__': start_time = time.time() blocking_way() print('請求5次百度總耗時:{}'.format(round(time.time()-start_time,2)))

屢次執行結果:async

請求5次百度總耗時:1.0秒函數

多進程版本oop

import socket import time import socks from multiprocessing import Pool socks.set_default_proxy(socks.HTTP,addr='192.168.105.71',port=80) #設置socks代理
socket.socket = socks.socksocket  # 把代理應用到socket
def blocking(wd): sock = socket.socket() sock.connect(('www.baidu.com',80)) # 鏈接百度
    request = 'GET {} HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n'.format('/s?wd={}'.format(wd)) # 構造http請求頭
    response = b''  # 用於接收數據
    sock.send(request.encode())  # 發送http請求
    chunk = sock.recv(1024)  # 一次接收1024字節數據
    while chunk:  # 循環接收數據,若沒有數據了說明已接收完
        response += chunk  # 字符串拼接
        chunk = sock.recv(1024) # print(response.decode())
    return response def blocking_way(): #多進程
    pool = Pool(5) search_list = ['python','java','C++','Ruby','Go'] for i in search_list: pool.apply_async(blocking,args=(i,)) pool.close() pool.join() if __name__ == '__main__': start_time = time.time() blocking_way() print('請求5次百度總耗時:{}'.format(round(time.time()-start_time,2)))

屢次執行結果:學習

請求5次百度總耗時:1.52秒測試

協程版本

from gevent import monkey;monkey.patch_socket() import socket import time import socks import gevent socks.set_default_proxy(socks.HTTP,addr='192.168.105.71',port=80) #設置socks代理
socket.socket = socks.socksocket  # 把代理應用到socket
def blocking(wd): sock = socket.socket() sock.connect(('www.baidu.com',80)) # 鏈接百度
    request = 'GET {} HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n'.format('/s?wd={}'.format(wd)) # 構造http請求頭
    response = b''  # 用於接收數據
    sock.send(request.encode())  # 發送http請求
    chunk = sock.recv(1024)  # 一次接收1024字節數據
    while chunk:  # 循環接收數據,若沒有數據了說明已接收完
        response += chunk  # 字符串拼接
        chunk = sock.recv(1024) # print(response.decode())
    return response def blocking_way(): search_list = ['python', 'java', 'C++', 'Ruby', 'Go'] tasks = [gevent.spawn(blocking,i) for i in search_list] gevent.joinall(tasks) if __name__ == '__main__': start_time = time.time() blocking_way() print('請求5次百度總耗時:{}'.format(round(time.time()-start_time,2)))

屢次執行結果:

請求5次百度總耗時:1.02秒

IO多路複用版本

import socks import time import socket import selectors socks.set_default_proxy(socks.HTTP,addr='192.168.105.71',port=80)  # 設置socks代理
socket.socket = socks.socksocket  # 把代理應用到socket
 selector = selectors.DefaultSelector()  # 事件選擇器
flag = True  # 事件循環的標誌
times = 5  # 用於計數,每請求一次百度,就減1,若爲0,說明已請求5次,此時結束事件循環

class Crawler(): def __init__(self,wd): self.response = b'' # 用於接收數據
        self.wd = wd # 搜索內容

    def fetch(self): '''建立客戶端套接字,鏈接百度,定義好若是鏈接成功應該調用什麼函數''' client = socket.socket() client.setblocking(False) try: client.connect(('www.baidu.com',80))  #此處須要註冊事件監控
        except BlockingIOError: pass selector.register(client,selectors.EVENT_WRITE,self.send_request) def send_request(self,client): '''鏈接成功後發送請求到百度,並註冊事件:收到百度應答應該作什麼''' selector.unregister(client) # 把原先監控的事件取消,方便後面監控其餘事件
        request = 'GET {} HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n'.format('/s?wd={}'.format(self.wd))  # 構造http請求頭
 client.send(request.encode()) selector.register(client,selectors.EVENT_READ,self.get_response) #此處註冊事件,若百度響應,調用get_response

    def get_response(self,client): '''如有數據發過來,就接收,每次發數據過來,都會觸發,因此不用while循環'''
        global flag global times data = client.recv(1024) # 每次接收的數據不超過1024字節,若大於1024,分批傳輸
        if data: self.response += data # 字符串拼接
        else:  # 數據接收完
            # print(self.response.decode())
 client.close() selector.unregister(client) times -= 1 # 每次請求的響應接收完後,計數器減一
            if times == 0: # 5次請求完後,結束事件監控循環
                flag = False def loop(): '''事件監控循環'''
    while flag: events = selector.select() for key,mask in events: callback = key.data callback(key.fileobj) if __name__ == '__main__': start_time = time.time() search_list = ['python', 'java', 'C++', 'Ruby', 'Go'] for item in search_list: crawler = Crawler(item) crawler.fetch() loop() print('請求5次百度耗時:{}'.format(round(time.time()-start_time,2)))

屢次執行結果:

請求5次百度耗時:1.17

你們能夠把請求數調多一些多試幾回!

基本上協程和多線程耗時較短,更適用於爬蟲。

相關文章
相關標籤/搜索