爬蟲高性能相關

高性能相關

如何實現多個任務的同時進行 並且還效率高 react

串行實現

效率最低最不可取web

import requests

urls = [
    'http://www.baidu.com/',
    'https://www.cnblogs.com/',
    'https://www.cnblogs.com/news/',
    'https://cn.bing.com/',
    'https://stackoverflow.com/',
]

for url in urls:
    response = requests.get(url)
    print(response)

多線程

多線程存在線程利用率不高的問題多線程

import requests
import threading


urls = [
    'http://www.baidu.com/',
    'https://www.cnblogs.com/',
    'https://www.cnblogs.com/news/',
    'https://cn.bing.com/',
    'https://stackoverflow.com/',
]

def task(url):
    response = requests.get(url)
    print(response)

for url in urls:
    t = threading.Thread(target=task,args=(url,))
    t.start()

協程+IO切換

gevent內部調用greenlet(實現了協程)app

基於協程比線程更加省資源異步

from gevent import monkey; monkey.patch_all()
import gevent
import requests


def func(url):
    response = requests.get(url)
    print(response)

urls = [
    'http://www.baidu.com/',
    'https://www.cnblogs.com/',
    'https://www.cnblogs.com/news/',
    'https://cn.bing.com/',
    'https://stackoverflow.com/',
]
spawn_list = []    
for url in urls:
    spawn_list.append(gevent.spawn(func, url))    # 建立協程 

gevent.joinall(spawn_list)

事件循環

基於事件循環的異步非阻塞模塊:Twisted函數

from twisted.web.client import getPage, defer
from twisted.internet import reactor

def stop_loop(arg):
    reactor.stop()


def get_response(contents):
    print(contents)

deferred_list = []

url_list = [
    'http://www.baidu.com/',
    'https://www.cnblogs.com/',
    'https://www.cnblogs.com/news/',
    'https://cn.bing.com/',
    'https://stackoverflow.com/',
]

for url in url_list:
    deferred = getPage(bytes(url, encoding='utf8')) # 拿到了要爬取的任務,並無真正的執行爬蟲
    deferred.addCallback(get_response)    # 要調用的回調函數 
    deferred_list.append(deferred) # 將全部的任務加入帶一個列表裏面


dlist = defer.DeferredList(deferred_list)    # 檢測全部的任務是否都被循環
dlist.addBoth(stop_loop)    # 若是列表中的任務都完成了就中止循環,執行中止的函數 

reactor.run()
相關文章
相關標籤/搜索