網頁抓取適合收集和處理大量的數據。超越搜索引擎,好比能找到最便宜的機票。javascript
API能提供很好的格式化的數據。可是不少站點不提供API,無統一的API。即使有API,數據類型和格式未必徹底符合你的要求,且速度也可能太慢。css
應用場景:市場預測、機器語言翻譯、醫療診斷等。甚至時藝術,好比http://wefeelfine.org/。html
本文基於python3,須要python基礎。java
代碼下載:http://pythonscraping.com/code/。python
from urllib.request import urlopen html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html") print(html.read())
執行結果:程序員
$ python3 1-basicExample.py b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page1.html") bsObj = BeautifulSoup(html.read(), 'lxml'); print(bsObj.h1)
執行結果:web
$ python3 2-beautifulSoup.py <h1>An Interesting Title</h1>
HTML代碼層次以下:正則表達式
• html → <html><head>...</head><body>...</body></html> — head → <head><title>A Useful Page<title></head> — title → <title>A Useful Page</title> — body → <body><h1>An Int...</h1><div>Lorem ip...</div></body> — h1 → <h1>An Interesting Title</h1> — div → <div>Lorem Ipsum dolor...</div>
注意這裏bsObj.h1和下面的效果相同:數據庫
bsObj.html.body.h1 bsObj.body.h1 bsObj.html.h1
urlopen容易發生的錯誤爲:express
•在服務器上找不到該頁面(或獲取錯誤), 404或者500
•找不到服務器
都體現爲HTTPError。能夠以下方式處理:
try: html = urlopen("http://www.pythonscraping.com/pages/page1.html") except HTTPError as e: print(e) #return null, break, or do some other "Plan B" else: #program continues. Note: If you return or break in the #exception catch, you do not need to use the "else" statement
本文摘要自Web Scraping with Python - 2015
書籍下載地址:https://bitbucket.org/xurongzhong/python-chinese-library/downloads
源碼地址:https://bitbucket.org/wswp/code
演示站點:http://example.webscraping.com/
演示站點代碼:http://bitbucket.org/wswp/places
推薦的python基礎教程: http://www.diveintopython.net
HTML和JavaScript基礎:
本文博客:http://my.oschina.net/u/1433482/
本文網址:http://my.oschina.net/u/1433482/blog/620858
交流:python開發自動化測試羣291184506 PythonJava單元白盒測試羣144081101
爲何要進行web抓取?
網購的時候想比較下各個網站的價格,也就是實現惠惠購物助手的功能。有API天然方便,可是一般是沒有API,此時就須要web抓取。
web抓取是否合法?
抓取的數據,我的使用不違法,商業用途或從新發布則須要考慮受權,另外須要注意禮節。根據國外已經判決的案例,通常來講位置和電話能夠從新發布,可是原創數據不容許從新發布。
更多參考:
http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html
http://caselaw.findlaw.com/us-supreme-court/499/340.html
背景研究
robots.txt和Sitemap能夠幫助瞭解站點的規模和結構,還可使用谷歌搜索和WHOIS等工具。
好比:http://example.webscraping.com/robots.txt
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Crawl-delay: 5 Disallow: /trap # section 3 Sitemap: http://example.webscraping.com/sitemap.xml
更多關於web機器人的介紹參見 http://www.robotstxt.org。
Sitemap的協議: http://www.sitemaps.org/protocol.html,好比:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://example.webscraping.com/view/Afghanistan-1 </loc></url> <url><loc>http://example.webscraping.com/view/Aland-Islands-2 </loc></url> <url><loc>http://example.webscraping.com/view/Albania-3</loc> </url> ... </urlset>
站點地圖常常不完整。
站點大小評估:
經過google的site查詢 好比:site:automationtesting.sinaapp.com
站點技術評估:
# pip install builtwith # ipython In [1]: import builtwith In [2]: builtwith.parse('http://automationtesting.sinaapp.com/') Out[2]: {u'issue-trackers': [u'Trac'], u'javascript-frameworks': [u'jQuery'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx']}
分析網站全部者:
# pip install python-whois # ipython In [1]: import whois In [2]: print whois.whois('http://automationtesting.sinaapp.com') { "updated_date": "2016-01-07 00:00:00", "status": [ "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited" ], "name": null, "dnssec": null, "city": null, "expiration_date": "2021-06-29 00:00:00", "zipcode": null, "domain_name": "SINAAPP.COM", "country": null, "whois_server": "whois.paycenter.com.cn", "state": null, "registrar": "XIN NET TECHNOLOGY CORPORATION", "referral_url": "http://www.xinnet.com", "address": null, "name_servers": [ "NS1.SINAAPP.COM", "NS2.SINAAPP.COM", "NS3.SINAAPP.COM", "NS4.SINAAPP.COM" ], "org": null, "creation_date": "2009-06-29 00:00:00", "emails": null }
抓取第一個站點
簡單的爬蟲(crawling)代碼以下:
import urllib2 def download(url): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None return html
能夠基於錯誤碼重試。HTTP狀態碼:https://tools.ietf.org/html/rfc7231#section-6。4**不必重試,5**能夠重試下。
import urllib2 def download(url, num_retries=2): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
http://httpstat.us/500 會返回500,能夠用它來測試下:
>>> download('http://httpstat.us/500') Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error
設置 user agent:
urllib2默認的user agent是「Python-urllib/2.7」,不少網站會對此進行攔截, 推薦使用接近真實的agent,好比
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
爲此咱們增長user agent設置:
import urllib2 def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
爬行站點地圖:
def crawl_sitemap(url): # download the sitemap file sitemap = download(url) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = download(link) # scrape html here # ...
ID循環爬行:
• http://example.webscraping.com/view/Afghanistan-1
• http://example.webscraping.com/view/Australia-2
• http://example.webscraping.com/view/Brazil-3
上面幾個網址僅僅是最後面部分不一樣,一般程序員喜歡用數據庫的id,好比:
http://example.webscraping.com/view/1 ,這樣咱們就能夠數據庫的id抓取網頁。
for page in itertools.count(1): url = 'http://example.webscraping.com/view/-%d' % page html = download(url) if html is None: break else: # success - can scrape the result pass
固然數據庫有可能刪除了一條記錄,爲此咱們改進成以下:
# maximum number of consecutive download errors allowed max_errors = 5 # current number of consecutive download errors num_errors = 0 for page in itertools.count(1): url = 'http://example.webscraping.com/view/-%d' % page html = download(url) if html is None: # received an error trying to download this webpage num_errors += 1 if num_errors == max_errors: # reached maximum number of # consecutive errors so exit break else: # success - can scrape the result # ... num_errors = 0
有些網站不存在的時候會返回404,有些網站的ID不是這麼有規則的,好比亞馬遜使用ISBN。
通常的瀏覽器都有"查看頁面源碼"的功能,在Firefox,Firebug尤爲方便。以上工具均可以郵件點擊網頁調出。
抓取網頁數據主要有3種方法:正則表達式、BeautifulSoup和lxml。
正則表達式示例:
In [1]: import re In [2]: import common In [3]: url = 'http://example.webscraping.com/view/UnitedKingdom-239' In [4]: html = common.download(url) Downloading: http://example.webscraping.com/view/UnitedKingdom-239 In [5]: re.findall('<td class="w2p_fw">(.*?)</td>', html) Out[5]: ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\\d{2}[A-Z]{2})|([A-Z]\\d{3}[A-Z]{2})|([A-Z]{2}\\d{2}[A-Z]{2})|([A-Z]{2}\\d{3}[A-Z]{2})|([A-Z]\\d[A-Z]\\d[A-Z]{2})|([A-Z]{2}\\d[A-Z]\\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] In [6]: re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] Out[6]: '244,820 square kilometres'
維護成本比較高。
Beautiful Soup:
In [7]: from bs4 import BeautifulSoup In [8]: broken_html = '<ul class=country><li>Area<li>Population</ul>' In [9]: # parse the HTML In [10]: soup = BeautifulSoup(broken_html, 'html.parser') In [11]: fixed_html = soup.prettify() In [12]: print fixed_html <ul class="country"> <li> Area <li> Population </li> </li> </ul> In [13]: ul = soup.find('ul', attrs={'class':'country'}) In [14]: ul.find('li') # returns just the first match Out[14]: <li>Area<li>Population</li></li> In [15]: ul.find_all('li') # returns all matches Out[15]: [<li>Area<li>Population</li></li>, <li>Population</li>]
完整的例子:
In [1]: from bs4 import BeautifulSoup In [2]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239' In [3]: import common In [5]: html = common.download(url) Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 In [6]: soup = BeautifulSoup(html) /usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. To get rid of this warning, change this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type)) In [7]: # locate the area row In [8]: tr = soup.find(attrs={'id':'places_area__row'}) In [9]: td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag In [10]: area = td.text # extract the text from this tag In [11]: print area 244,820 square kilometres
Lxml基於 libxml2(c語言實現),更快速,可是有時更難安裝。網址:http://lxml.de/installation.html。
In [1]: import lxml.html In [2]: broken_html = '<ul class=country><li>Area<li>Population</ul>' In [3]: tree = lxml.html.fromstring(broken_html) # parse the HTML In [4]: fixed_html = lxml.html.tostring(tree, pretty_print=True) In [5]: print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul>
lxml的容錯能力也比較強,少半邊標籤一般沒事。
下面使用css選擇器,注意安裝cssselect。
In [1]: import common In [2]: import lxml.html In [3]: url = 'http://example.webscraping.com/places/view/United-Kingdom-239' In [4]: html = common.download(url) Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 In [5]: tree = lxml.html.fromstring(html) In [6]: td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] In [7]: area = td.text_content() In [8]: print area 244,820 square kilometres
在 CSS 中,選擇器是一種模式,用於選擇須要添加樣式的元素。
"CSS" 列指示該屬性是在哪一個 CSS 版本中定義的。(CSS一、CSS2 仍是 CSS3。)
選擇器 | 例子 | 例子描述 | CSS |
---|---|---|---|
.class | .intro | 選擇 class="intro" 的全部元素。 | 1 |
#id | #firstname | 選擇 id="firstname" 的全部元素。 | 1 |
* | * | 選擇全部元素。 | 2 |
element | p | 選擇全部 <p> 元素。 | 1 |
element,element | div,p | 選擇全部 <div> 元素和全部 <p> 元素。 | 1 |
element element | div p | 選擇 <div> 元素內部的全部 <p> 元素。 | 1 |
element>element | div>p | 選擇父元素爲 <div> 元素的全部 <p> 元素。 | 2 |
element+element | div+p | 選擇緊接在 <div> 元素以後的全部 <p> 元素。 | 2 |
[attribute] | [target] | 選擇帶有 target 屬性全部元素。 | 2 |
[attribute=value] | [target=_blank] | 選擇 target="_blank" 的全部元素。 | 2 |
[attribute~=value] | [title~=flower] | 選擇 title 屬性包含單詞 "flower" 的全部元素。 | 2 |
[attribute|=value] | [lang|=en] | 選擇 lang 屬性值以 "en" 開頭的全部元素。 | 2 |
:link | a:link | 選擇全部未被訪問的連接。 | 1 |
:visited | a:visited | 選擇全部已被訪問的連接。 | 1 |
:active | a:active | 選擇活動連接。 | 1 |
:hover | a:hover | 選擇鼠標指針位於其上的連接。 | 1 |
:focus | input:focus | 選擇得到焦點的 input 元素。 | 2 |
:first-letter | p:first-letter | 選擇每一個 <p> 元素的首字母。 | 1 |
:first-line | p:first-line | 選擇每一個 <p> 元素的首行。 | 1 |
:first-child | p:first-child | 選擇屬於父元素的第一個子元素的每一個 <p> 元素。 | 2 |
:before | p:before | 在每一個 <p> 元素的內容以前插入內容。 | 2 |
:after | p:after | 在每一個 <p> 元素的內容以後插入內容。 | 2 |
:lang(language) | p:lang(it) | 選擇帶有以 "it" 開頭的 lang 屬性值的每一個 <p> 元素。 | 2 |
element1~element2 | p~ul | 選擇前面有 <p> 元素的每一個 <ul> 元素。 | 3 |
[attribute^=value] | a[src^="https"] | 選擇其 src 屬性值以 "https" 開頭的每一個 <a> 元素。 | 3 |
[attribute$=value] | a[src$=".pdf"] | 選擇其 src 屬性以 ".pdf" 結尾的全部 <a> 元素。 | 3 |
[attribute*=value] | a[src*="abc"] | 選擇其 src 屬性中包含 "abc" 子串的每一個 <a> 元素。 | 3 |
:first-of-type | p:first-of-type | 選擇屬於其父元素的首個 <p> 元素的每一個 <p> 元素。 | 3 |
:last-of-type | p:last-of-type | 選擇屬於其父元素的最後 <p> 元素的每一個 <p> 元素。 | 3 |
:only-of-type | p:only-of-type | 選擇屬於其父元素惟一的 <p> 元素的每一個 <p> 元素。 | 3 |
:only-child | p:only-child | 選擇屬於其父元素的惟一子元素的每一個 <p> 元素。 | 3 |
:nth-child(n) | p:nth-child(2) | 選擇屬於其父元素的第二個子元素的每一個 <p> 元素。 | 3 |
:nth-last-child(n) | p:nth-last-child(2) | 同上,從最後一個子元素開始計數。 | 3 |
:nth-of-type(n) | p:nth-of-type(2) | 選擇屬於其父元素第二個 <p> 元素的每一個 <p> 元素。 | 3 |
:nth-last-of-type(n) | p:nth-last-of-type(2) | 同上,可是從最後一個子元素開始計數。 | 3 |
:last-child | p:last-child | 選擇屬於其父元素最後一個子元素每一個 <p> 元素。 | 3 |
:root | :root | 選擇文檔的根元素。 | 3 |
:empty | p:empty | 選擇沒有子元素的每一個 <p> 元素(包括文本節點)。 | 3 |
:target | #news:target | 選擇當前活動的 #news 元素。 | 3 |
:enabled | input:enabled | 選擇每一個啓用的 <input> 元素。 | 3 |
:disabled | input:disabled | 選擇每一個禁用的 <input> 元素 | 3 |
:checked | input:checked | 選擇每一個被選中的 <input> 元素。 | 3 |
:not(selector) | :not(p) | 選擇非 <p> 元素的每一個元素。 | 3 |
::selection | ::selection | 選擇被用戶選取的元素部分。 | 3 |
CSS 選擇器參見:http://www.w3school.com.cn/cssref/css_selectors.ASP 和 https://pythonhosted.org/cssselect/#supported-selectors。
下面經過提取以下頁面的國家數據來比較性能:
比較代碼:
import urllib2 import itertools import re from bs4 import BeautifulSoup import lxml.html import time FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search(r'places_%s__row.*?w2p_fw">(.*?)</td>' % field, html.replace('\n','')).groups()[0] return results def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr',id='places_%s__row' % field).find('td',class_='w2p_fw').text return results def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row> td.w2p_fw' % field)[0].text_content() return results NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper),('BeautifulSoup', bs_scraper),('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end - start)
Windows執行結果:
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 Regular expressions: 11.63 seconds BeautifulSoup: 92.80 seconds Lxml: 7.25 seconds
Linux執行結果:
Downloading: http://example.webscraping.com/places/view/United-Kingdom-239 Regular expressions: 3.09 seconds BeautifulSoup: 29.40 seconds Lxml: 4.25 seconds
其中 re.purge() 用戶清正則表達式的緩存。
推薦使用基於Linux的lxml,在同一網頁屢次分析的狀況優點更爲明顯。