測試庫:lxml庫;連接連接:http://www.sxchxx.com/index-13-1075-1.htmlhtml
我的比較喜歡用xpath解析網頁,但時常獲得的結果倒是一個空列表。測試
from lxml import etree import requests url = 'http://www.sxchxx.com/index-13-1075-1.html' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', } resposne = requests.get(url, headers=headers) parser = etree.HTMLParser(encoding="utf-8") html = etree.HTML(resposne.text, parser=parser) resu=html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()') print(resu)
當用如上代碼解析以下網頁時,能夠獲取正文url
但發現咱們並無在rule裏面加入tbody標籤。相反,加入tbody標籤會使的解析結果變成一個空列表spa
html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') # 這樣會獲得空列表
3d
使用etree.parse和etree.HTML剛好相反code
from lxml import etree import requests parser = etree.HTMLParser(encoding="utf-8") html = etree.parse('test.html', parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') print(content)
將網頁保存成test.html,再用etree.parse加載,發現rule中加入tbody標籤才能得到預期的結果;不加tbody標籤會得到一個空列表xml
from lxml import etree import requests parser = etree.HTMLParser(encoding="utf-8") html = etree.parse('test.html', parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') print(content) print('----------------分割線-------------------') url = 'http://www.sxchxx.com/index-13-1075-1.html' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', } resposne = requests.get(url, headers=headers) parser = etree.HTMLParser(encoding="utf-8") html = etree.HTML(resposne.text, parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()') print(content)
若是解析在線網頁,不要添加tbody標籤
反則解析本地(離線)網頁,添加tbody標籤htm
請看下面的緣由分析blog
對比上面兩種方法,差別在於html = etree.parse('test.html', parser=parser)
html = etree.HTML(resposne.text)
這兩行代碼utf-8
而解析器是相同的parser = etree.HTMLParser(encoding="utf-8")
所以,我猜想,多是parse或者HTML對代碼作了某種「格式化」調整。
貌似lxml這個庫使用其餘語言編寫,看不到源代碼,沒法從源代碼下手檢查