爬取某個國外的網址,遇到的編碼問題 ,在前段頁面 返回的數據是 javascript
亞洲私人珍藏html
;賣,令仝好分享他為此java
所傾注的心血與熱愛。編碼
爬蟲源碼是:url
url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1' try: result = requests.get(url=url).text except: result = requests.get(url=url).text if 'javascript">setTimeout' in result: result = requests.get(url=url).text
如何處理?spa
url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1'
try:
result = requests.get(url=url).text except: result = requests.get(url=url).text if 'javascript">setTimeout' in result: result = requests.get(url=url).text
from HTMLParser import HTMLParser result_HTMLParser = HTMLParser().unescape(result) print result_HTMLParser
打印原始網頁代碼code
發現編碼格式正常htm
html = '<abc>' 用Python能夠這樣處理: import HTMLParser html_parser = HTMLParser.HTMLParser() txt = html_parser.unescape(html) #這樣就獲得了txt = '<abc>' 若是還想轉回去,能夠這樣: import cgi html = cgi.escape(txt) # 這樣又回到了 html = '<abc>'