###1. 安裝
pip install requests或者easy_install requestshtml
###2. 基本使用
在ipython中利用自動補全看下調用requests以後返回的response對象的一些屬性:python
In [1]: import requests In [2]: r = requests.get('https://api.github.com') In [3]: r. r.apparent_encoding r.history r.raw r.close r.is_redirect r.reason r.connection r.iter_content r.request r.content r.iter_lines r.status_code r.cookies r.json r.text r.elapsed r.links r.url r.encoding r.ok r.headers r.raise_for_status
快速入門:http://requests-docs-cn.readthedocs.io/zh_CN/latest/user/quickstart.html git
高級的用法:http://requests-docs-cn.readthedocs.io/zh_CN/latest/user/advanced.htmlgithub
安裝方法(不少同窗在安裝過程當中遇到了問題),見個人上一篇博客。json
使用requests庫獲取到網頁內容後,再經過lxml解析,也可經過BeautifulSoup等等工具api
lxml是基於C語言庫libxml2和libxslt的python化綁定,其對XML(HTMl)有強大的處理能力,而且向下兼容Python的ElementTree API,支持Xpath和BeautifulSoup解析, 使用起來很是方便。cookie
官方教程:http://lxml.de/app
下面是一個在Windows平臺下用python3.5用lxml解析HTML的例子,lxml經過xpath表達式來獲取數據工具
(詳見:http://www.cnblogs.com/descusr/archive/2012/06/20/2557075.html):ui
from lxml import etree html = ''' <html> <head> <meta name="content-type" content="text/html; charset=utf-8" /> <title>友情連接查詢 - 站長工具</title> <!-- uRj0Ak8VLEPhjWhg3m9z4EjXJwc --> <meta name="Keywords" content="友情連接查詢" /> <meta name="Description" content="友情連接查詢" /> </head> <body> <h1 class="heading">Top News</h1> <p style="font-size: 200%">World News only on this page</p> Ah, and here's some more text, by the way. <p>... and this is a parsed fragment ...</p> <a href="http://www.cydf.org.cn/" rel="nofollow" target="_blank">青少年發展基金會</a> <a href="http://www.4399.com/flash/32979.htm" target="_blank">洛克王國</a> <a href="http://www.4399.com/flash/35538.htm" target="_blank">奧拉星</a> <a href="http://game.3533.com/game/" target="_blank">手機遊戲</a> <a href="http://game.3533.com/tupian/" target="_blank">手機壁紙</a> <a href="http://www.4399.com/" target="_blank">4399小遊戲</a> <a href="http://www.91wan.com/" target="_blank">91wan遊戲</a> </body> </html> ''' page = etree.HTML(html.lower()) hrefs = page.xpath(u"//a") for href in hrefs: # print(href.attrib) print(href.text)