<div class="markdown_views"> <p> 下面咱們將介紹三種抓取網頁數據的方法,首先是<strong>正則表達式</strong>,而後是流行的 <strong>BeautifulSoup</strong> 模塊,最後是強大的 <strong>lxml</strong> 模塊。</p>css
<p><strong>1. 正則表達式</strong></p>html
<p> 若是你對正則表達式還不熟悉,或是須要一些提示時,能夠查閱<a href="https://docs.python.org/2/howto/regex.html" rel="nofollow" target="_blank">Regular Expression HOWTO</a> 得到完整介紹。</p>python
<p> 當咱們使用正則表達式抓取國家面積數據時,首先要嘗試匹配元素中的內容,以下所示:</p>css3
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">>>> </span><span class="hljs-keyword">import</span> re <span class="hljs-prompt">>>> </span><span class="hljs-keyword">import</span> urllib2</br> <span class="hljs-prompt">>>> </span>url = <span class="hljs-string">'http://example.webscraping.com/view/United-Kingdom-239'</span></br> <span class="hljs-prompt">>>> </span>html = urllib2.urlopen(url).read()</br> <span class="hljs-prompt">>>> </span>re.findall(<span class="hljs-string">'<td class="w2p_fw">(.*?)</td>'</span>, html)</br> [<span class="hljs-string">'<img src="/places/static/images/flags/gb.png" />'</span>, <span class="hljs-string">'244,820 square kilometres'</span>, <span class="hljs-string">'62,348,447'</span>, <span class="hljs-string">'GB'</span>, <span class="hljs-string">'United Kingdom'</span>, <span class="hljs-string">'London'</span>, <span class="hljs-string">'<a href="/continent/EU">EU</a>'</span>, <span class="hljs-string">'.uk'</span>, <span class="hljs-string">'GBP'</span>, <span class="hljs-string">'Pound'</span>, <span class="hljs-string">'44'</span>, <span class="hljs-string">'@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA'</span>, <span class="hljs-string">'^(([A-Z]\\d{2}[A-Z]{2})|([A-Z]\\d{3}[A-Z]{2})|([A-Z]{2}\\d{2}[A-Z]{2})|([A-Z]{2}\\d{3}[A-Z]{2})|([A-Z]\\d[A-Z]\\d[A-Z]{2})|([A-Z]{2}\\d[A-Z]\\d[A-Z]{2})|(GIR0AA))$'</span>, <span class="hljs-string">'en-GB,cy-GB,gd'</span>, <span class="hljs-string">'<div><a href="/iso/IE">IE </a></div>'</span>] <span class="hljs-prompt">>>> </span></code></pre>web
<p> 從上述結果看出,多個國家眷性都使用了< td class=」w2p_fw」 >標籤。要想分離出面積屬性,咱們能夠只選擇其中的第二個元素,以下所示:</p>正則表達式
<pre class="prettyprint" name="code"><code class="hljs vbnet has-numbering">>>> re.findall(<span class="hljs-comment">'<span class="hljs-xmlDocTag"><td class="w2p_fw"></span>(.*?)<span class="hljs-xmlDocTag"></td></span>', html)[1]</span> <span class="hljs-comment">'244,820 square kilometres'</span></code></pre>express
<p> 雖然如今可使用這個方案,可是若是網頁發生變化,該方案極可能就會失效。好比表格發生了變化,去除了第二行中的國土面積數據。若是咱們只在如今抓取數據,就能夠忽略這種將來可能發生的變化。可是,若是咱們但願將來還能再次抓取該數據,就須要給出更加健壯的解決方案,從而儘量避免這種佈局變化所帶來的影響。想要該正則表達式更加健壯,咱們能夠將其父元素< tr >也加入進來。因爲該元素具備ID屬性,因此應該是惟一的。</p>api
<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">>>> re.findall('<span class="hljs-tag"><<span class="hljs-title">tr</span> <span class="hljs-attribute">id</span>=<span class="hljs-value">"places_area__row"</span>></span><span class="hljs-tag"><<span class="hljs-title">td</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"w2p_fl"</span>></span><span class="hljs-tag"><<span class="hljs-title">label</span> <span class="hljs-attribute">for</span>=<span class="hljs-value">"places_area"</span> <span class="hljs-attribute">id</span>=<span class="hljs-value">"places_area__label"</span>></span>Area: <span class="hljs-tag"></<span class="hljs-title">label</span>></span><span class="hljs-tag"></<span class="hljs-title">td</span>></span><span class="hljs-tag"><<span class="hljs-title">td</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"w2p_fw"</span>></span>(.*?)<span class="hljs-tag"></<span class="hljs-title">td</span>></span>', html)</br> ['244,820 square kilometres']</code></pre>緩存
<p> 這個迭代版本看起來更好一些,可是網頁更新還有不少其餘方式,一樣可讓該正則表達式沒法知足。好比,將雙引號變爲單引號,< td >標籤之間添加多餘的空格,或是變動area_label等。下面是嘗試支持這些可能性的改進版本。</p>markdown
<pre class="prettyprint" name="code"><code class="hljs scilab has-numbering">>>> <span class="hljs-transposed_variable">re.</span>findall(<span class="hljs-string">'<tr id="</span>places_area__row<span class="hljs-string">">.*?<td\s*class=["</span>\<span class="hljs-string">']w2p_fw["</span>\<span class="hljs-string">']>(.*?)</td>'</span>,html)<span class="hljs-matrix">[<span class="hljs-string">'244,820 square kilometres'</span>]</span></code></pre>
<p> 雖然該正則表達式更容易適應將來變化,但又存在難以構造、可讀性差的問題。此外,還有一些微小的佈局變化也會使該正則表達式沒法知足,好比在< td >標籤裏添加title屬性。 <br> 從本例中能夠看出,正則表達式爲咱們提供了抓取數據的快捷方式,可是,該方法過於脆弱,容易在網頁更新後出現問題。幸虧還有一些更好的解決方案,後期將會介紹。</p>
<p><strong>2. Beautiful Soup</strong></p>
<p> <strong>Beautiful Soup</strong>是一個很是流行的 <strong>Python</strong> 模塊。該模塊能夠解析網頁,並提供定位內容的便捷接口。若是你尚未安裝該模塊,可使用下面的命令安裝其最新版本(須要先安裝 <strong>pip</strong>,請自行百度):</p>
<pre class="prettyprint" name="code"><code class="hljs cmake has-numbering">pip <span class="hljs-keyword">install</span> beautifulsoup4</code></pre>
<p> 使用 <strong>Beautiful Soup</strong> 的第一步是將已下載的 <strong>HTML</strong> 內容解析爲 <strong>soup</strong> 文檔。因爲大多數網頁都不具有良好的 <strong>HTML</strong> 格式,所以 <strong>Beautiful Soup</strong> 須要對其實際格式進行肯定。例如,在下面這個簡單網頁的列表中,存在屬性值兩側引號缺失和標籤未閉合的問題。</p>
<pre class="prettyprint" name="code"><code class="hljs xml has-numbering"><span class="hljs-tag"><<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">country</span>></span></br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Area</br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population</br> <span class="hljs-tag"></<span class="hljs-title">ul</span>></span></code></pre>
<p> 若是 Population 列表項被解析爲 Area 列表項的子元素,而不是並列的兩個列表項的話,咱們在抓取時就會獲得錯誤的結果。下面讓咱們看一下 <strong>Beautiful Soup</strong> 是如何處理的。</p>
<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">>>> from bs4 import BeautifulSoup >>> broken_html = '<span class="hljs-tag"><<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">country</span>></span><span class="hljs-tag"><<span class="hljs-title">li</span>></br></span>Area<span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">ul</span>></span>'</br> >>> # parse the HTML</br> >>> soup = BeautifulSoup(broken_html, 'html.parser')</br> >>> fixed_html = soup.prettify()</br> >>> print fixed_html</br> <span class="hljs-tag"><<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"country"</span>></span></br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span></br> Area</br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span></br> Population</br> <span class="hljs-tag"></<span class="hljs-title">li</span>></span></br> <span class="hljs-tag"></<span class="hljs-title">li</span>></span></br> <span class="hljs-tag"></<span class="hljs-title">ul</span>></span></code></pre>
<p> 從上面的執行結果中能夠看出,<strong>Beautiful Soup</strong> 可以正確解析缺失的引號並閉合標籤。如今可使用 <strong>find()</strong> 和 <strong>find_all()</strong> 方法來定位咱們須要的元素了。</p>
<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">>>> ul = soup.find('ul', attrs={'class':'country'})</br> >>> ul.find('li') # return just the first match</br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Area<span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">li</span>></span><span class="hljs-tag"></<span class="hljs-title">li</span>></span></br> >>> ul.find_all('li') # return all matches</br> [<span class="hljs-tag"><<span class="hljs-title">li</span>></span>Area<span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">li</span>></span><span class="hljs-tag"></<span class="hljs-title">li</span>></span>, <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">li</span>></span>]</code></pre>
<p><strong>Note: 因爲不一樣版本的Python內置庫的容錯能力有所區別,可能處理結果和上述有所不一樣,具體請參考: <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser" rel="nofollow" target="_blank">https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser</a>。想了解所有方法和參數,能夠查閱 Beautiful Soup 的 <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow" target="_blank">官方文檔</a></strong></p>
<p> 下面是使用該方法抽取示例國家面積數據的完整代碼。</p>
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">>>> </span><span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup</br> <span class="hljs-prompt">>>> </span><span class="hljs-keyword">import</span> urllib2</br> <span class="hljs-prompt">>>> </span>url = <span class="hljs-string">'http://example.webscraping.com/view/United-Kingdom-239'</span></br> <span class="hljs-prompt">>>> </span>html = urllib2.urlopen(url).read()</br> <span class="hljs-prompt">>>> </span><span class="hljs-comment"># locate the area row</span></br> <span class="hljs-prompt">>>> </span>tr = soup.find(attrs={<span class="hljs-string">'id'</span>:<span class="hljs-string">'places_area__row'</span>})</br> <span class="hljs-prompt">>>> </span><span class="hljs-comment"># locate the area tag</span></br> <span class="hljs-prompt">>>> </span>td = tr.find(attrs={<span class="hljs-string">'class'</span>:<span class="hljs-string">'w2p_fw'</span>})</br> <span class="hljs-prompt">>>> </span>area = td.text <span class="hljs-comment"># extract the text from this tag</span></br> <span class="hljs-prompt">>>> </span><span class="hljs-keyword">print</span> area</br> <span class="hljs-number">244</span>,<span class="hljs-number">820</span> square kilometres</code></pre>
<p> 這段代碼雖然比正則表達式的代碼更加複雜,但更容易構造和理解。並且,像多餘的空格和標籤屬性這種佈局上的小變化,咱們也無需再擔憂了。</p>
<p><strong>3. Lxml</strong></p>
<p> <strong>Lxml</strong> 是基於 <strong>libxml2</strong> 這一 <strong>XML</strong> 解析庫的 <strong>Python</strong> 封裝。該模塊使用 C語言 編寫,解析速度比 <strong>Beautiful Soup</strong> 更快,不過安裝過程也更爲複雜。最新的安裝說明能夠參考 <a href="http://lxml.de/installation.html" rel="nofollow" target="_blank">http://lxml.de/installation.html</a> .**</p>
<p> 和 <strong>Beautiful Soup</strong> 同樣,使用 <strong>lxml</strong> 模塊的第一步也是將有可能不合法的 <strong>HTML</strong> 解析爲統一格式。下面是使用該模塊解析一個不完整 <strong>HTML</strong> 的例子:</p>
<pre class="prettyprint" name="code"><code class="hljs xml has-numbering">>>> import lxml.html</br> >>> broken_html = '<span class="hljs-tag"><<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">country</span>></span><span class="hljs-tag"><<span class="hljs-title">li</span>></br></span>Area<span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">ul</span>></span>' >>> # parse the HTML</br> >>> tree = lxml.html.fromstring(broken_html)</br> >>> fixed_html = lxml.html.tostring(tree, pretty_print=True)</br> >>> print fixed_html</br> <span class="hljs-tag"><<span class="hljs-title">ul</span> <span class="hljs-attribute">class</span>=<span class="hljs-value">"country"</span>></span></br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Area<span class="hljs-tag"></<span class="hljs-title">li</span>></span></br> <span class="hljs-tag"><<span class="hljs-title">li</span>></span>Population<span class="hljs-tag"></<span class="hljs-title">li</span>></span></br> <span class="hljs-tag"></<span class="hljs-title">ul</span>></span></code></pre>
<p> 一樣地,<strong>lxml</strong> 也能夠正確解析屬性兩側缺失的引號,並閉合標籤,不過該模塊沒有額外添加 < html > 和 < body > 標籤。</p>
<p> 解析完輸入內容以後,進入選擇元素的步驟,此時 <strong>lxml</strong> 有幾種不一樣的方法,好比 <strong>XPath</strong> 選擇器和相似 <strong>Beautiful Soup</strong> 的 <strong>find()</strong> 方法。不過,後續咱們將使用 <strong>CSS</strong> 選擇器,由於它更加簡潔,而且可以在解析動態內容時得以複用。此外,一些擁有 <strong>jQuery</strong> 選擇器相關經驗的讀者會對其更加熟悉。</p>
<p> 下面是使用 <strong>lxml</strong> 的 <strong>CSS</strong> 選擇器抽取面積數據的示例代碼:</p>
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-prompt">>>> </span><span class="hljs-keyword">import</span> urllib2</br> <span class="hljs-prompt">>>> </span><span class="hljs-keyword">import</span> lxml.html</br> <span class="hljs-prompt">>>> </span>url = <span class="hljs-string">'http://example.webscraping.com/view/United-Kingdom-239'</span></br> <span class="hljs-prompt">>>> </span>html = urllib2.urlopen(url).read()</br> <span class="hljs-prompt">>>> </span>tree = lxml.html.fromstring(html)</br> <span class="hljs-prompt">>>> </span>td = tree.cssselect(<span class="hljs-string">'tr#places_area__row > td.w2p_fw'</span>)[<span class="hljs-number">0</span>] <span class="hljs-comment"># *行代碼</span></br> <span class="hljs-prompt">>>> </span>area = td.text_content()</br> <span class="hljs-prompt">>>> </span><span class="hljs-keyword">print</span> area <span class="hljs-number">244</span>,<span class="hljs-number">820</span> square kilometres</code></pre>
<p> <strong>*行代碼</strong>首先會找到 ID 爲 <strong>places_area__row</strong> 的表格行元素,而後選擇 <strong>class</strong> 爲 <strong>w2p_fw</strong> 的表格數據子標籤。</p>
<p> <strong>CSS</strong> 選擇器表示選擇元素所使用的模式,下面是一些經常使用的選擇器示例:</p>
<pre class="prettyprint" name="code"><code class="hljs livecodeserver has-numbering">選擇全部標籤: *</br> 選擇 <<span class="hljs-operator">a</span>> 標籤: <span class="hljs-operator">a</span></br> 選擇全部 class=<span class="hljs-string">"link"</span> 的元素: .link</br> 選擇 class=<span class="hljs-string">"link"</span> 的 <<span class="hljs-operator">a</span>> 標籤: <span class="hljs-operator">a</span>.link</br> 選擇 id=<span class="hljs-string">"home"</span> 的 <<span class="hljs-operator">a</span>> 標籤: <span class="hljs-operator">a</span><span class="hljs-comment">#home</span></br> 選擇父元素爲 <<span class="hljs-operator">a</span>> 標籤的全部 <span> 子標籤: <span class="hljs-operator">a</span> > span</br> 選擇 <<span class="hljs-operator">a</span>> 標籤內部的全部 <span> 標籤: <span class="hljs-operator">a</span> span </br> 選擇 title 屬性爲<span class="hljs-string">"Home"</span>的全部 <<span class="hljs-operator">a</span>> 標籤: <span class="hljs-operator">a</span>[title=Home]</code></pre>
<p> <strong>W3C</strong> 已提出 <strong>CSS3</strong> 規範,其網址爲 <a href="https://www.w3.org/TR/2011/REC-css3-selectors-20110929/" rel="nofollow" target="_blank">https://www.w3.org/TR/2011/REC-css3-selectors-20110929/</a></p>
<p> <strong>Lxml</strong> 已經實現了大部分 <strong>CSS3</strong> 屬性,其不支持的功能能夠參見: <a href="https://cssselect.readthedocs.io/en/latest/" rel="nofollow" target="_blank">https://cssselect.readthedocs.io/en/latest/</a> .</p>
<p><strong>Note: lxml在內部的實現中,其實是將 CSS 選擇器轉換爲等價的 XPath 選擇器。</strong></p>
<p><strong>4. 性能對比</strong></p>
<p> 在如下這段代碼中,每一個爬蟲都會執行 1000 次,每次執行都會檢查抓取結果是否正確,而後打印總用時。</p>
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-comment"># -*- coding: utf-8 -*-</span></br></br> <span class="hljs-keyword">import</span> csv</br> <span class="hljs-keyword">import</span> time</br> <span class="hljs-keyword">import</span> urllib2</br> <span class="hljs-keyword">import</span> re</br> <span class="hljs-keyword">import</span> timeit</br> <span class="hljs-keyword">from</span> bs4 <span class="hljs-keyword">import</span> BeautifulSoup</br> <span class="hljs-keyword">import</span> lxml.html</br></br> FIELDS = (<span class="hljs-string">'area'</span>, <span class="hljs-string">'population'</span>, <span class="hljs-string">'iso'</span>, <span class="hljs-string">'country'</span>, <span class="hljs-string">'capital'</span>, <span class="hljs-string">'continent'</span>, <span class="hljs-string">'tld'</span>, <span class="hljs-string">'currency_code'</span>, <span class="hljs-string">'currency_name'</span>, <span class="hljs-string">'phone'</span>, <span class="hljs-string">'postal_code_format'</span>, <span class="hljs-string">'postal_code_regex'</span>, <span class="hljs-string">'languages'</span>, <span class="hljs-string">'neighbours'</span>)</br></br> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">regex_scraper</span><span class="hljs-params">(html)</span>:</span></br></br> results = {}</br> <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> FIELDS:</br> results[field] = re.search(<span class="hljs-string">'<tr id="places_{}__row">.*?<td class="w2p_fw">(.*?)</td>'</span>.format(field), html).groups()[<span class="hljs-number">0</span>]</br> <span class="hljs-keyword">return</span> results</br></br> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">beautiful_soup_scraper</span><span class="hljs-params">(html)</span>:</span></br> soup = BeautifulSoup(html, <span class="hljs-string">'html.parser'</span>) </br> results = {}</br> <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> FIELDS:</br> results[field] = soup.find(<span class="hljs-string">'table'</span>).find(<span class="hljs-string">'tr'</span>, id=<span class="hljs-string">'places_{}__row'</span>.format(field)).find(<span class="hljs-string">'td'</span>, class_=<span class="hljs-string">'w2p_fw'</span>).text</br> <span class="hljs-keyword">return</span> results</br></br> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">lxml_scraper</span><span class="hljs-params">(html)</span>:</span></br> tree = lxml.html.fromstring(html)</br> results = {}</br> <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> FIELDS:</br> results[field] = tree.cssselect(<span class="hljs-string">'table > tr#places_{}__row > td.w2p_fw'</span>.format(field))[<span class="hljs-number">0</span>].text_content()</br> <span class="hljs-keyword">return</span> results</br> </br> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span><span class="hljs-params">()</span>:</span> times = {} html = urllib2.urlopen(<span class="hljs-string">'http://example.webscraping.com/view/United-Kingdom-239'</span>).read()</br> NUM_ITERATIONS = <span class="hljs-number">1000</span> <span class="hljs-comment"># number of times to test each scraper</span></br> <span class="hljs-keyword">for</span> name, scraper <span class="hljs-keyword">in</span> (<span class="hljs-string">'Regular expressions'</span>, regex_scraper), (<span class="hljs-string">'Beautiful Soup'</span>, beautiful_soup_scraper), (<span class="hljs-string">'Lxml'</span>, lxml_scraper):</br> times[name] = []</br> <span class="hljs-comment"># record start time of scrape</span></br> start = time.time()</br> <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(NUM_ITERATIONS):</br> <span class="hljs-keyword">if</span> scraper == regex_scraper:</br> <span class="hljs-comment"># the regular expression module will cache results</span></br> <span class="hljs-comment"># so need to purge this cache for meaningful timings</span></br> re.purge() <span class="hljs-comment"># *行代碼</span></br> result = scraper(html)</br></br> <span class="hljs-comment"># check scraped result is as expected</span></br> <span class="hljs-keyword">assert</span>(result[<span class="hljs-string">'area'</span>] == <span class="hljs-string">'244,820 square kilometres'</span>)</br> times[name].append(time.time() - start)</br> <span class="hljs-comment"># record end time of scrape and output the total</span></br> end = time.time()</br> <span class="hljs-keyword">print</span> <span class="hljs-string">'{}: {:.2f} seconds'</span>.format(name, end - start)</br></br> writer = csv.writer(open(<span class="hljs-string">'times.csv'</span>, <span class="hljs-string">'w'</span>))</br> header = sorted(times.keys())</br> writer.writerow(header)</br> <span class="hljs-keyword">for</span> row <span class="hljs-keyword">in</span> zip(*[times[scraper] <span class="hljs-keyword">for</span> scraper <span class="hljs-keyword">in</span> header]):</br> writer.writerow(row)</br></br> <span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:</br> main()</code></pre>
<p><br> 注意,咱們在 <strong>*行代碼</strong> 中調用了 <strong>re.purge()</strong> 方法。默認狀況下,正則表達式會緩存搜索結果,爲了公平起見,咱們須要使用該方法清除緩存。</p>
<p>下面是個人電腦運行該腳本的結果:</p>
<p><img src="https://img-blog.csdn.net/20170419131832120?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvT3NjZXIyMDE2/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="這裏寫圖片描述" title=""> </p>
<p><br> 因爲硬件條件的區別,不一樣電腦的執行結果也會存在必定差別。不過,每種方法之間的相對差別應當是至關的。從結果中能夠看出,在抓取咱們的示例網頁時,<strong>Beautiful Soup</strong> 比其餘兩種方法慢了超過 7 倍之多。實際上這一結果是符合預期的,由於 <strong>lxml</strong> 和正則表達式模塊都是 <strong>C</strong> 語言編寫的,而 <strong>Beautiful Soup</strong> 則是純 <strong>Python</strong> 編寫的。一個有趣的事實是,<strong>lxml</strong> 表現的和正則表達式差很少好。因爲 <strong>lxml</strong> 在搜索元素以前,必須將輸入解析爲內部格式,所以會產生額外的開銷。而當抓取同一網頁的多個特徵時,這種初始化解析產生的開銷就會下降,<strong>lxml</strong> 也就更具競爭力,因此說,<strong>lxml</strong> 是一個強大的模塊。<br></p>
<p><strong>5. 總結</strong></p>
<p>三種網頁抓取方法優缺點:<br><br></p>
<table> <thead> <tr> <th align="center"> 抓取方法</th> <th align="center"> 性能</th> <th align="center"> 使用難度</th> <th align="center"> 安裝難度</th> </tr> </thead> <tbody><tr> <td align="center">正則表達式</td> <td align="center">快</td> <td align="center">困難</td> <td align="center">簡單(內置模塊)</td> </tr> <tr> <td align="center">Beautiful Soup</td> <td align="center">慢</td> <td align="center">簡單</td> <td align="center">簡單(純Python)</td> </tr> <tr> <td align="center">Lxml</td> <td align="center">快</td> <td align="center">簡單</td> <td align="center">相對困難</td> </tr> </tbody></table>
<p><br><br> 若是你的爬蟲瓶頸是下載網頁,而不是抽取數據的話,那麼使用較慢的方法(如 <strong>Beautiful Soup</strong>)也不成問題。正則表達式在一次性抽取中很是有用,此外還能夠避免解析整個網頁帶來的開銷,若是隻需抓取少許數據,而且想要避免額外依賴的話,那麼正則表達式可能更加適合。不過,一般狀況下,<strong>lxml</strong> 是抓取數據的最好選擇,這是由於它不只速度快,功能也更加豐富,而正則表達式和 <strong>Beautiful Soup</strong>只在某些特定場景下有用。</p> </div>