04:爬蟲之性能相關

1.1 實現併發的常見方法

  一、簡介python

      1. 在編寫爬蟲時,性能的消耗主要在IO請求中,當單進程單線程模式下請求URL時必然會引發等待,從而使得請求總體變慢。git

      2. 進程:啓用進程很是浪費資源github

      3. 線程:線程多,而且在阻塞過程當中沒法執行其餘任務多線程

      4. 協程:gevent只用起一個線程,當請求發出去後gevent就無論,永遠就只有一個線程工做,誰先回來先處理併發

  二、實現併發幾個方法比較異步

    1)使用線程池實現併發async

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from concurrent.futures import ThreadPoolExecutor

def fetch_request(url):
    result = requests.get(url)
    print(result.content)

pool = ThreadPoolExecutor(10)       # 建立一個線程池,最多開10個線程
url_list = [
    'www.google.com',
    'http://www.baidu.com',
]

for url in url_list:
    # 去線程池中獲取一個線程
    # 線程去執行fetch_request方法
    pool.submit(fetch_request,url)

pool.shutdown(True)     # 主線程本身關閉,讓子線程本身拿任務執行
使用線程池實現併發

    2)使用進程池實現併發ide

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from concurrent.futures import ProcessPoolExecutor

def fetch_request(url):
    result = requests.get(url)
    print(result.text)

url_list = [
    'www.google.com',
    'http://www.bing.com',
]

if __name__ == '__main__':
    pool = ProcessPoolExecutor(10)  # 線程池
    # 缺點:線程多,而且在阻塞過程當中沒法執行其餘任務
    for url in url_list:
        # 去線程池中獲取一個進程
        # 進程去執行fetch_request方法
        pool.submit(fetch_request,url)
    pool.shutdown(True)
使用進程池實現併發

    3)多線程+回調函數執行函數

#! /usr/bin/env python
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor
import requests

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().content)

if __name__ == '__main__':
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ThreadPoolExecutor(5)
    for url in url_list:
        v = pool.submit(fetch_async, url)
        v.add_done_callback(callback)
    pool.shutdown(wait=True)
多線程+回調函數執行

    4) 協程:微線程實現異步性能

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import gevent
import requests
from gevent import monkey

monkey.patch_all()

# 這些請求誰先回來就先處理誰
def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)


if __name__ == '__main__':
    ##### 發送請求 #####
    gevent.joinall([
        gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
    ])
協程:微線程實現異步

 

 

 

 

 

 

1111111111111

相關文章
相關標籤/搜索