python網絡爬蟲之requests庫

時間 2019-12-20

原文原文鏈接

Requests庫是用Python編寫的HTTP客戶端。Requests庫比urlopen更加方便。能夠節約大量的中間處理過程，從而直接抓取網頁數據。來看下具體的例子：

def request_function_try():

    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
     r=requests.get(url="http://www.baidu.com",headers=headers)
     print "status code:%s" % r.status_code
     print "headers:%s" % r.headers
     print "encoding:%s" % r.encoding
     print "cookies:%s" % r.cookies
     print "url:%s" % r.url
     print r.content.decode('utf-8').encode('mbcs')

直接用requests.get()方法進行http連接，其中輸入參數url以及headers。返回值就是網頁的response。從返回的response中能夠獲得狀態嗎，頭信息。編碼範式，cookie值，網頁地址以及網頁代碼

E:\python2.7.11\python.exe E:/py_prj/test3.py

status code:200

headers:{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:24 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Sun, 17 Sep 2017 02:53:11 GMT', 'Content-Type': 'text/html'}

encoding:ISO-8859-1

cookies:{'.baidu.com': {'/': {'BDORZ': Cookie(version=0, name='BDORZ', value='27315', port=None, port_specified=False, domain='.baidu.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1505702637, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}}

url:http://www.baidu.com/

注意在獲取網頁代碼的時候，因爲有中文，在python2中直接打印會有問題。所以須要先解碼而後編碼。在這裏編碼的方式爲mbcs。具體的編碼方式能夠經過以下的方式獲取到。

sys.setdefaultencoding('utf-8')
 type = sys.getfilesystemencoding()

requests中也有一個內置的json解碼器，能夠幫助解析獲得的json數據

r=requests.get('https://github.com/timeline.json')
 print r.json()

E:\python2.7.11\python.exe E:/py_prj/test3.py

{u'documentation_url': u'https://developer.github.com/v3/activity/events/#list-public-events', u'message': u'Hello there, wayfaring stranger. If you\u2019re reading this then you probably didn\u2019t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.'}

若是想要傳遞數據，如何處理呢。在這裏咱們以百度搜索爲例。在輸入框中輸入python,而後獲得返回的結果。

def request_function_try1():

    reload(sys)
     sys.setdefaultencoding('utf-8')
     type = sys.getfilesystemencoding()
     print type
     headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
     payload={'wd':'python'}
     r=requests.get(url="http://www.baidu.com/s",params=payload,headers=headers)
     print r.status_code
     print r.content.decode('utf-8').encode(type)
     fp = open('search2.html', 'w')
     for line in r.content:
         fp.write(line)
     fp.close()

這裏爲何網址要用到http://www.baidu.com/s呢。咱們從網頁上來看下。在輸入框中輸入了python以後，網頁其實跳轉到了https://www.baidu.com/s的界面。後面跟的wd=python等都是輸入的數據

執行結果以下：

status code:200

headers:{'Strict-Transport-Security': 'max-age=172800', 'Bdqid': '0xeb453e0b0000947a', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDSVRTM=0; path=/, BD_HOME=0; path=/, H_PS_PSSID=1421_21078_17001_24394; path=/; domain=.baidu.com', 'Expires': 'Sun, 17 Sep 2017 02:56:13 GMT', 'Bduserid': '0', 'X-Powered-By': 'HPHP', 'Server': 'BWS/1.1', 'Connection': 'Keep-Alive', 'Cxy_all': 'baidu+2455763ad13223918d1e7f7431d4d18e', 'Cache-Control': 'private', 'Date': 'Sun, 17 Sep 2017 02:56:43 GMT', 'Vary': 'Accept-Encoding', 'Content-Type': 'text/html; charset=utf-8', 'Bdpagetype': '1', 'X-Ua-Compatible': 'IE=Edge,chrome=1'}

encoding:utf-8

cookies:<RequestsCookieJar[<Cookie H_PS_PSSID=1421_21078_17001_24394 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]>

url:https://www.baidu.com/

若是咱們訪問的網站返回的狀態碼不是200.這個時候requests庫也有異常處理的方式就是raise_for_status.當返回爲非200響應的時候拋出異常

url='http://www.baidubaidu.com/'
 try:
     r=requests.get(url)
     r.raise_for_status()
 except requests.RequestException as e:
     print e

執行結果以下，在異常中會返回具體的錯誤碼信息。

E:\python2.7.11\python.exe E:/py_prj/test3.py

409 Client Error: Conflict for url: http://www.baidubaidu.com/

咱們再來看下如何模擬訪問一個HTTPS網站。咱們以CSDN網站爲例。要想模擬登錄，首先要採集網頁數據進行分析，這裏用Fidder來採集。

 
 
 
 (一)
  
  
  
  分析網頁跳轉，首先是登錄界面，網址是https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn。 而後是自動跳轉到my.csdn.net

 
 
 
 (二)
  
  
  
  分析網頁遞交的數據。在右側界面會出現網頁實際遞交的數據。上面的框是發送的頭信息。下面是服務器返回數據的頭信息。咱們經過上面的數據來構造咱們發送的頭信息

 
 
 
 (三)
  
  
  
  從上面第三步咱們看到遞交數據的方式是POST。那麼咱們須要看下POST的數據有哪些。點擊webForms能夠看到上傳的數據，其中有username,password,lt,execution,_eventId等字段。咱們將這些字段存取下來便於在代碼中構造。

 
 
 
 (四)
  
  
  
  最後一步就是查看跳轉到mycsdn界面的數據，這一步是採用get的方法，只發送了頭信息。所以只須要構造頭信息就能夠了。

數據流分析完了，下面就能夠開始來構造代碼了：

首先是構造頭信息，最重要的是User-Agent，若是沒有設置的話，會被網站給禁掉

headers={'host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'}
 headers1={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'}

而後就是構造頭信息中的cookie值

cookie={'JSESSIONID':'5543aaaaaaaaaaaaaaaabbbbbB.tomcat2',
         'uuid_tt_dd':'-411111111111119_20170926','JSESSIONID':'2222222222222220265C40D8A33CB.tomcat2',
         'UN':'XXXXX','UE':'xxxxx@163.com','BT':'334343481','LSSC':'LSSC-145514-7aaaaaaaaaaazgGmhFvHfO9taaaaaaaR-passport.csdn.net',
         'Hm_lvt_6bcd52f51bbbbbb2bec4a3997715ac':'15044213,150656493,15064444445,1534488843','Hm_lpvt_6bcd52f51bbbbbbbe32bec4a3997715ac':'1506388843',
         'dc_tos':'oabckz','dc_session_id':'15063aaaa027_0.7098840409889817','__message_sys_msg_id':'0','__message_gu_msg_id':'0','__message_cnel_msg_id':'0','__message_district_code':'000000','__message_in_school':'0'}

而後設置url以及post的data
 url='https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
 data={'username':'xxxx','password':'xxxxx','lt':'LT-1522220-BSnH9fN6ycbbbbbqgsSP2waaa1jvq','execution':'e4ab','_eventId':'submit'}

開始準備連接，這裏用Session是爲了保持後面的連接都是用的同一個回話，好比cookie值等

r=requests.Session()

r.post(url=url,headers=headers,cookies=cookie,data=data)

在這一步報錯了，返回以下結果提示certificate verify failed

File "E:\python2.7.11\lib\site-packages\requests\adapters.py", line 506, in send

    raise SSLError(e, request=request)

requests.exceptions.SSLError: HTTPSConnectionPool(host='passport.csdn.net', port=443): Max retries exceeded with url: /account/login?from=http://my.csdn.net/my/mycsdn (Caused by SSLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)'),))

這個錯誤的緣由在於Python 2.7.9 以後引入了一個新特性，當你urllib.urlopen一個 https 的時候會驗證一次 SSL 證書 
 當目標使用的是自簽名的證書時就會爆出一個 urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)> 的錯誤消息

要解決這個問題PEP-0476文檔這樣說的：

For users who wish to opt out of certificate verification on a single connection, they can achieve this by providing the contextargument to urllib.urlopen

就是說你能夠禁掉這個證書的要求，urllib來講有兩種方式，一種是urllib.urlopen()有一個參數context,把他設成ssl._create_unverified_context

import ssl

context = ssl._create_unverified_context()  

urllib.urlopen("https://no-valid-cert", context=context)

但其實在requests中，有一個verify的參數，把它設成False就能夠了

r.post(url=url,headers=headers,cookies=cookie,data=data,verify=False)

接下來訪問mycsdn的地址。這樣就成功的登陸csdn網站了

s=r.get('http://my.csdn.net/my/mycsdn',headers=headers1)
 print s.status_code
 print s.content.decode('utf-8').encode(type)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。