[爬蟲]lxml 獲取當前節點的html,並正確顯示中文

獲取當前節點:etree.tostringhtml

正確顯示中文
方法一:html.unescape.net

from lxml import etree
import html

with open('list.html', 'r', encoding='utf-8') as f:
    text = f.read()

tree = etree.HTML(text)


r = html.unescape(etree.tostring(tree.xpath(
    '//*[@id="scroll_marquee"]')[0]).decode('utf-8'))
print(r)
print(type(r))

參考連接:爬取網頁時調用tostring()中文亂碼("數字;")解決方案
方法二:code

from lxml import etree
import requests

response = requests.get('https://www.baidu.com/).text
tree = etree.HTML(response)
strs = tree.xpath( "//body")
strs = strs[0]
 strs = (etree.tostring(strs)) # 不能正常顯示中文
strs = (etree.tostring(strs, encoding = "utf-8", pretty_print = True, method = "html")) # 能夠正常顯示中文
print (strs)

參考連接:lxml提取html標籤內容, tostring()不能顯示中文 解決方案xml

相關文章
相關標籤/搜索