事情是這樣的:由於在寫一個豆瓣抽獎的小程序,我須要抓取豆瓣廣播全部轉發的用戶信息,而後從這些用戶裏面抽取幸運觀衆。python
典型的IO密集型操做呀。編程
首先爬蟲要進入廣播找到一共有多少轉發頁,大概長這樣:小程序
def get_pages(url): r = fetch(url) soup = BeautifulSoup(r, 'lxml') page_num = soup.find("div", class_="paginator").find_all("a")[-2].string return page_num
而後依次解析每一頁的用戶信息:segmentfault
def parse_one(r): human = list() soup = BeautifulSoup(r, 'lxml') try: content = soup.find_all("li", class_="list-item") except: logging.error("there is no info.") raise ..... return human
很簡單吧。session
剛開始我用requests庫直接依次請求每一頁的內容,代碼寫好後點擊執行,等啊等,等啊等,終於爬完了,控制檯輸出:併發
耗時: 106.168...
我兩眼一抹黑,要是用戶用這個來抽獎,黃花菜都涼了。dom
遂決定多開幾個進程,使用multiprocessing.Pool()
建立多個進程,看看效果:異步
pool = multiprocessing.Pool(5) url_list = [url + '?start={}&tab=reshare#reshare'.format(i * 20) for i in range(0, int(page_num))] with pool as p: res = p.map(parse_one, url_list)
耗時:async
耗時: 26.168740749359131
好像還不錯哎。比以前順序爬取要快了很多。但仍是有點慢。oop
而後我決定上攜程 (協程)。
首先獲取網頁內容確定要異步獲取:
async def fetch(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", ...... } async with aiohttp.ClientSession(headers=headers) as session: async with session.get(url) as res: result = await res.text() return result
而後將全部請求放入task中:
url_list = [redict_url + '?start={}&tab=reshare#reshare'.format(i * 20) for i in range(0, int(page_num))] loop = asyncio.=get_event_loop() tasks = [asyncio.ensure_future(reshare(url)) for url in url_list] loop.run_until_complete(asyncio.wait(tasks)) for task in tasks: human.extend(task.result())
接下來就是見證奇蹟的時刻了!
(然而並非,豆瓣把我封了。我太難了。)
那就上代理IP吧。又寫了個爬蟲爬了幾百個代理ip地址。每次請求時候隨機選擇代理:
def random_proxy(): url = hosts[random.randint(1, len(hosts))] proxy = "http://{}".format(url) return proxy
如今來看看效果:
耗時: 2.2529048919677734
Amazing!!!
關於線程,進程,協程,參考我以前寫的文章:
另外,
個人豆瓣抽獎小程序(豆醬)已經上線了,歡迎使用提bug呀。