這裏主要是作一個關於數據爬取之後的數據解析功能的整合,方便查閱,以防混淆html
主要講到的技術有Xpath,BeautifulSoup,PyQuery,re(正則)spa
首先舉出兩個做示例的代碼,方便後面舉例code
解析以前須要先將html代碼轉換成相應的對象,各自的方法以下:orm
Xpath:xml
In [7]: from lxml import etree In [8]: text = etree.HTML(html)
BeautifulSoup:htm
In [2]: from bs4 import BeautifulSoup In [3]: soup = BeautifulSoup(html, 'lxml')
PyQuery:對象
In [10]: from pyquery import PyQuery as pq In [11]: doc = pq(html)
re:沒有須要的對象,他是直接對字符串進行匹配的規則blog
示例1three
html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> '''
接下來咱們來用不一樣的解析方法分析示例的HTML代碼ip
Xpath:
In [16]: text.xpath('//title/text()')[0] Out[16]: "The Dormouse's story"
BeautifulSoup:
In [18]: soup.title.string Out[18]: "The Dormouse's story"
PyQuery:
In [20]: doc('title').text() Out[20]: "The Dormouse's story"
re:
In [11]: re.findall(r'<title>(.*?)</title></head>', html)[0] Out[11]: "The Dormouse's story"
Xpath:#推薦
In [36]: text.xpath('//a[@id="link3"]/@href')[0] Out[36]: 'http://example.com/tillie'
BeautifulSoup:
In [27]: soup.find_all(attrs={'id':'link3'})
Out[27]: [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']
Out[33]: 'http://example.com/tillie'
PyQuery:#推薦
In [45]: doc("#link3").attr.href Out[45]: 'http://example.com/tillie'
re:
In [46]: re.findall(r'<a href="(.*?)" class="sister" id="link3">Tillie</a>;', html)[0] Out[46]: 'http://example.com/tillie'
Xpath:
In [48]: text.xpath('string(//p[@class="story"])').strip() Out[48]: 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.' In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('\n')) Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'
BeautifulSoup:
In [89]: ' '.join(list(soup.body.stripped_strings)).replace('\n', '') Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."
PyQuery:
In [99]: doc('.story').text() Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'
re:不推薦使用,過於麻煩
In [101]: re.findall(r'<p class="story">(.*?)<a href="http://example.com/elsie" class="sister" id="link1">(.*?)</a>(.*?)<a href="http://example.com/lacie" class="siste ...: r" id="link2">(.*?)</a>(.*?)<a href="http://example.com/tillie" class="sister" id="link3">(.*?)</a>;(.*?)</p>', html, re.S)[0] Out[101]: ('Once upon a time there were three little sisters; and their names were\n', 'Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', '\nand they lived at the bottom of a well.')
示例2
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> '''
Xpath:
In [14]: text.xpath('//li[2]/a/text()')[0] Out[14]: 'second item'
BeautifulSoup:
In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string Out[23]: 'second item'
PyQuery:
In [34]: doc('.item-1>a')[0].text Out[34]: 'second item'
re:
In [35]: re.findall(r'<li class="item-1"><a href="link2.html">(.*?)</a></li>', html)[0] Out[35]: 'second item'
Xpath:
In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0] Out[36]: 'link5.html'
BeautifulSoup:
In [52]: soup.find_all(attrs={'class': 'item-0'}) Out[52]: [<li class="item-0">first item</li>, <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>, <li class="item-0"><a href="link5.html">fifth item</a></li>] In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href'] Out[53]: 'link5.html'
PyQuery:
In [75]: [i.attr.href for i in doc('.item-0 a').items()][1] Out[75]: 'link5.html'
re:
In [95]: re.findall(r'<li class="item-0" ><a href="(.*?)">fifth item</a></li>',html)[0] Out[95]: 'link5.html'
示例3
<li><span class="label">房屋用途</span>普通住宅</li>
分別獲取出房屋用途和普通住宅
Xpath:
In [47]: text.xpath('//li/span/text()')[0] Out[47]: '房屋用途' In [49]: text.xpath('//li/text()')[0] Out[49]: '普通住宅'
BeautifulSoup:
In [65]: soup.span.string Out[65]: '房屋用途' In [69]: soup.li.contents[1] # contents 獲取直接子節點 Out[69]: '普通住宅'
PyQuery:
In [70]: doc('li span').text() Out[70]: '房屋用途' In [75]: doc('li .label')[0].tail Out[75]: '普通住宅'
re: 略
示例4
<div class="unitPrice"> <span class="unitPriceValue">26667<i>元/平米</i></span> </div>
分別獲取26667和元/平米
Xpath:
In [81]: text.xpath('//div[@class="unitPrice"]/span/text()')[0] Out[81]: '26667' In [82]: text.xpath('//div[@class="unitPrice"]/span/i/text()')[0] Out[82]: '元/平米'
BeautifulSoup:
In [97]: [i for i in soup.find('div', class_="unitPrice").strings] Out[97]: ['\n', '26667', '元/平米', '\n'] In [98]: [i for i in soup.find('div', class_="unitPrice").strings][1] Out[98]: '26667' In [99]: [i for i in soup.find('div', class_="unitPrice").strings][2] Out[99]: '元/平米'
PyQuery:
In [109]: doc('.unitPrice .unitPriceValue')[0].text Out[109]: '26667' In [110]: doc('.unitPrice .unitPriceValue i')[0].text Out[110]: '元/平米'