Python爬蟲的N種姿式

問題的由來

前幾天，在微信公衆號（Python爬蟲及算法）上有我的問了筆者一個問題，如何利用爬蟲來實現以下的需求，須要爬取的網頁以下（網址爲：https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）：javascript

咱們的需求爲爬取紅色框框內的名人（有500條記錄，圖片只展現了一部分）的名字以及其介紹，關於其介紹，點擊該名人的名字便可，以下圖：php

這就意味着咱們須要爬取500個這樣的頁面，即500個HTTP請求（暫且這麼認爲吧），而後須要提取這些網頁中的名字和描述，固然有些不是名人，也沒有描述，咱們能夠跳過。最後，這些網頁的網址在第一頁中的名人後面能夠找到，如George Washington的網頁後綴爲Q23.
爬蟲的需求大概就是這樣。css

爬蟲的N中姿式

首先，分析來爬蟲的思路：先在第一個網頁（https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0）中獲得500個名人所在的網址，接下來就爬取這500個網頁中的名人的名字及描述，如無描述，則跳過。
接下來，咱們將介紹實現這個爬蟲的4種方法，並分析它們各自的優缺點，但願能讓讀者對爬蟲有更多的體會。實現爬蟲的方法爲：html

通常方法（同步，requests+BeautifulSoup）
併發（使用concurrent.futures模塊以及requests+BeautifulSoup）
異步（使用aiohttp+asyncio+requests+BeautifulSoup）
使用框架Scrapy

通常方法

通常方法即爲同步方法，主要使用requests+BeautifulSoup，按順序執行。完整的Python代碼以下：java

import requests from bs4 import BeautifulSoup import time # 開始時間 t1 = time.time() print('#' * 50) url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0" # 請求頭部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} # 發送HTTP請求 req = requests.get(url, headers=headers) # 解析網頁 soup = BeautifulSoup(req.text, "lxml") # 找到name和Description所在的記錄 human_list = soup.find(id='mw-whatlinkshere-list')('li') urls = [] # 獲取網址 for human in human_list: url = human.find('a')['href'] urls.append('https://www.wikidata.org'+url) # 獲取每一個網頁的name和description def parser(url): req = requests.get(url) # 利用BeautifulSoup將獲取到的文本解析成HTML soup = BeautifulSoup(req.text, "lxml") # 獲取name和description name = soup.find('span', class_="wikibase-title-label") desc = soup.find('span', class_="wikibase-descriptionview-text") if name is not None and desc is not None: print('%-40s,\t%s'%(name.text, desc.text)) for url in urls: parser(url) t2 = time.time() # 結束時間 print('通常方法，總共耗時：%s' % (t2 - t1)) print('#' * 50)

輸出的結果以下(省略中間的輸出，以......代替)：python

################################################## George Washington , first President of the United States Douglas Adams , British author and humorist (1952–2001) ...... Willoughby Newton , Politician from Virginia, USA Mack Wilberg , American conductor 通常方法，總共耗時：724.9654655456543 ##################################################

使用同步方法，總耗時約725秒，即12分鐘多。
通常方法雖然思路簡單，容易實現，但效率不高，耗時長。那麼，使用併發試試看。web

併發方法

併發方法使用多線程來加速通常方法，咱們使用的併發模塊爲concurrent.futures模塊，設置多線程的個數爲20個（實際不必定能達到，視計算機而定）。完整的Python代碼以下：正則表達式

import requests from bs4 import BeautifulSoup import time from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED # 開始時間 t1 = time.time() print('#' * 50) url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0" # 請求頭部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} # 發送HTTP請求 req = requests.get(url, headers=headers) # 解析網頁 soup = BeautifulSoup(req.text, "lxml") # 找到name和Description所在的記錄 human_list = soup.find(id='mw-whatlinkshere-list')('li') urls = [] # 獲取網址 for human in human_list: url = human.find('a')['href'] urls.append('https://www.wikidata.org'+url) # 獲取每一個網頁的name和description def parser(url): req = requests.get(url) # 利用BeautifulSoup將獲取到的文本解析成HTML soup = BeautifulSoup(req.text, "lxml") # 獲取name和description name = soup.find('span', class_="wikibase-title-label") desc = soup.find('span', class_="wikibase-descriptionview-text") if name is not None and desc is not None: print('%-40s,\t%s'%(name.text, desc.text)) # 利用併發加速爬取 executor = ThreadPoolExecutor(max_workers=20) # submit()的參數： 第一個爲函數， 以後爲該函數的傳入參數，容許有多個 future_tasks = [executor.submit(parser, url) for url in urls] # 等待全部的線程完成，才進入後續的執行 wait(future_tasks, return_when=ALL_COMPLETED) t2 = time.time() # 結束時間 print('併發方法，總共耗時：%s' % (t2 - t1)) print('#' * 50)

輸出的結果以下（省略中間的輸出，以......代替)：算法

################################################## Larry Sanger , American former professor, co-founder of Wikipedia, founder of Citizendium and other projects Ken Jennings , American game show contestant and writer ...... Antoine de Saint-Exupery , French writer and aviator Michael Jackson , American singer, songwriter and dancer 併發方法，總共耗時：226.7499692440033 ##################################################

使用多線程併發後的爬蟲執行時間約爲227秒，大概是通常方法的三分之一的時間，速度有了明顯的提高啊！多線程在速度上有明顯提高，但執行的網頁順序是無序的，在線程的切換上開銷也比較大，線程越多，開銷越大。
關於多線程與通常方法在速度上的比較，能夠參考文章：Python爬蟲之多線程下載豆瓣Top250電影圖片。編程

異步方法

異步方法在爬蟲中是有效的速度提高手段，使用aiohttp能夠異步地處理HTTP請求，使用asyncio能夠實現異步IO，須要注意的是，aiohttp只支持3.5.3之後的Python版本。使用異步方法實現該爬蟲的完整Python代碼以下：

import requests from bs4 import BeautifulSoup import time import aiohttp import asyncio # 開始時間 t1 = time.time() print('#' * 50) url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0" # 請求頭部 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} # 發送HTTP請求 req = requests.get(url, headers=headers) # 解析網頁 soup = BeautifulSoup(req.text, "lxml") # 找到name和Description所在的記錄 human_list = soup.find(id='mw-whatlinkshere-list')('li') urls = [] # 獲取網址 for human in human_list: url = human.find('a')['href'] urls.append('https://www.wikidata.org'+url) # 異步HTTP請求 async def fetch(session, url): async with session.get(url) as response: return await response.text() # 解析網頁 async def parser(html): # 利用BeautifulSoup將獲取到的文本解析成HTML soup = BeautifulSoup(html, "lxml") # 獲取name和description name = soup.find('span', class_="wikibase-title-label") desc = soup.find('span', class_="wikibase-descriptionview-text") if name is not None and desc is not None: print('%-40s,\t%s'%(name.text, desc.text)) # 處理網頁，獲取name和description async def download(url): async with aiohttp.ClientSession() as session: try: html = await fetch(session, url) await parser(html) except Exception as err: print(err) # 利用asyncio模塊進行異步IO處理 loop = asyncio.get_event_loop() tasks = [asyncio.ensure_future(download(url)) for url in urls] tasks = asyncio.gather(*tasks) loop.run_until_complete(tasks) t2 = time.time() # 結束時間 print('使用異步，總共耗時：%s' % (t2 - t1)) print('#' * 50)

輸出結果以下（省略中間的輸出，以......代替)：

################################################## Frédéric Taddeï , French journalist and TV host Gabriel Gonzáles Videla , Chilean politician ...... Denmark , sovereign state and Scandinavian country in northern Europe Usain Bolt , Jamaican sprinter and soccer player 使用異步，總共耗時：126.9002583026886 ##################################################

顯然，異步方法使用了異步和併發兩種提速方法，天然在速度有明顯提高，大約爲通常方法的六分之一。異步方法雖然效率高，但須要掌握異步編程，這須要學習一段時間。
關於異步方法與通常方法在速度上的比較，能夠參考文章：利用aiohttp實現異步爬蟲。
若是有人以爲127秒的爬蟲速度仍是慢，能夠嘗試一下異步代碼（與以前的異步代碼的區別在於：僅僅使用了正則表達式代替BeautifulSoup來解析網頁，以提取網頁中的內容）：

import requests from bs4 import BeautifulSoup import time import aiohttp import asyncio import re # 開始時間 t1 = time.time() print('#' * 50) url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0" # 請求頭部 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} # 發送HTTP請求 req = requests.get(url, headers=headers) # 解析網頁 soup = BeautifulSoup(req.text, "lxml") # 找到name和Description所在的記錄 human_list = soup.find(id='mw-whatlinkshere-list')('li') urls = [] # 獲取網址 for human in human_list: url = human.find('a')['href'] urls.append('https://www.wikidata.org' + url) # 異步HTTP請求 async def fetch(session, url): async with session.get(url) as response: return await response.text() # 解析網頁 async def parser(html): # 利用正則表達式解析網頁 try: name = re.findall(r'<span class="wikibase-title-label">(.+?)</span>', html)[0] desc = re.findall(r'<span class="wikibase-descriptionview-text">(.+?)</span>', html)[0] print('%-40s,\t%s' % (name, desc)) except Exception as err: pass # 處理網頁，獲取name和description async def download(url): async with aiohttp.ClientSession() as session: try: html = await fetch(session, url) await parser(html) except Exception as err: print(err) # 利用asyncio模塊進行異步IO處理 loop = asyncio.get_event_loop() tasks = [asyncio.ensure_future(download(url)) for url in urls] tasks = asyncio.gather(*tasks) loop.run_until_complete(tasks) t2 = time.time() # 結束時間 print('使用異步（正則表達式），總共耗時：%s' % (t2 - t1)) print('#' * 50)

輸出的結果以下（省略中間的輸出，以......代替)：

################################################## Dejen Gebremeskel , Ethiopian long-distance runner Erik Kynard , American high jumper ...... Buzz Aldrin , American astronaut Egon Krenz , former General Secretary of the Socialist Unity Party of East Germany 使用異步（正則表達式），總共耗時：16.521944999694824 ##################################################

16.5秒，僅僅爲通常方法的43分之一，速度如此之快，使人咋舌（感謝某人提供的嘗試）。筆者雖然本身實現了異步方法，但用的是BeautifulSoup來解析網頁，耗時127秒，沒想到使用正則表達式就取得了如此驚人的效果。可見，BeautifulSoup解析網頁雖然快，但在異步方法中，仍是限制了速度。但這種方法的缺點爲，當你須要爬取的內容比較複雜時，通常的正則表達式就難以勝任了，須要另想辦法。

爬蟲框架Scrapy

最後，咱們使用著名的Python爬蟲框架Scrapy來解決這個爬蟲。咱們建立的爬蟲項目爲wikiDataScrapy，項目結構以下：

在settings.py中設置「ROBOTSTXT_OBEY = False」. 修改items.py，代碼以下：

# -*- coding: utf-8 -*- import scrapy class WikidatascrapyItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() desc = scrapy.Field()

而後，在spiders文件夾下新建wikiSpider.py，代碼以下:

import scrapy.cmdline from wikiDataScrapy.items import WikidatascrapyItem import requests from bs4 import BeautifulSoup # 獲取請求的500個網址，用requests+BeautifulSoup搞定 def get_urls(): url = "http://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Q5&limit=500&from=0" # 請求頭部 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'} # 發送HTTP請求 req = requests.get(url, headers=headers) # 解析網頁 soup = BeautifulSoup(req.text, "lxml") # 找到name和Description所在的記錄 human_list = soup.find(id='mw-whatlinkshere-list')('li') urls = [] # 獲取網址 for human in human_list: url = human.find('a')['href'] urls.append('https://www.wikidata.org' + url) # print(urls) return urls # 使用scrapy框架爬取 class bookSpider(scrapy.Spider): name = 'wikiScrapy' # 爬蟲名稱 start_urls = get_urls() # 須要爬取的500個網址 def parse(self, response): item = WikidatascrapyItem() # name and description item['name'] = response.css('span.wikibase-title-label').xpath('text()').extract_first() item['desc'] = response.css('span.wikibase-descriptionview-text').xpath('text()').extract_first() yield item # 執行該爬蟲，並轉化爲csv文件 scrapy.cmdline.execute(['scrapy', 'crawl', 'wikiScrapy', '-o', 'wiki.csv', '-t', 'csv'])

輸出結果以下（只包含最後的Scrapy信息總結部分）：

{'downloader/request_bytes': 166187, 'downloader/request_count': 500, 'downloader/request_method_count/GET': 500, 'downloader/response_bytes': 18988798, 'downloader/response_count': 500, 'downloader/response_status_count/200': 500, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 10, 16, 9, 49, 15, 761487), 'item_scraped_count': 500, 'log_count/DEBUG': 1001, 'log_count/INFO': 8, 'response_received_count': 500, 'scheduler/dequeued': 500, 'scheduler/dequeued/memory': 500, 'scheduler/enqueued': 500, 'scheduler/enqueued/memory': 500, 'start_time': datetime.datetime(2018, 10, 16, 9, 48, 44, 58673)}

能夠看到，已成功爬取500個網頁，耗時31秒，速度也至關OK。再來看一下生成的wiki.csv文件，它包含了全部的輸出的name和description，以下圖：

能夠看到，輸出的CSV文件的列並非有序的。至於如何解決Scrapy輸出的CSV文件有換行的問題，請參考stackoverflow上的回答：https://stackoverflow.com/questions/39477662/scrapy-csv-file-has-uniform-empty-rows/43394566#43394566 。

Scrapy來製做爬蟲的優點在於它是一個成熟的爬蟲框架，支持異步，併發，容錯性較好（好比本代碼中就沒有處理找不到name和description的情形），但若是須要頻繁地修改中間件，則仍是本身寫個爬蟲比較好，並且它在速度上沒有超過咱們本身寫的異步爬蟲，至於能自動導出CSV文件這個功能，仍是至關實在的。

總結

本文內容較多，比較了4種爬蟲方法，每種方法都有本身的利弊，已在以前的陳述中給出，固然，在實際的問題中，並非用的工具或方法越高級就越好，具體問題具體分析嘛~
本文到此結束，感謝閱讀哦~

注意：本人現已開通微信公衆號： Python爬蟲與算法（微信號爲：easy_web_scrape），歡迎你們關注哦~~

【轉】爬蟲的通常方法、異步、併發與框架scrapy的效率比較

該文非原創文字，文字轉載至 jclian91 連接：https://www.cnblogs.com/jclian91/p/9799697.html