Python標準庫中提供了:urllib、urllib二、httplib等模塊以供Http請求,可是它的 API 太渣了。它是爲另外一個時代、另外一個互聯網所建立的。它須要巨量的工做,甚至包括各類方法覆蓋,來完成最簡單的任務。php
import urllib2 import json import cookielib def urllib2_request(url, method="GET", cookie="", headers={}, data=None): """ :param url: 要請求的url :param cookie: 請求方式,GET、POST、DELETE、PUT.. :param cookie: 要傳入的cookie,cookie= 'k1=v1;k1=v2' :param headers: 發送數據時攜帶的請求頭,headers = {'ContentType':'application/json; charset=UTF-8'} :param data: 要發送的數據GET方式須要傳入參數,data={'d1': 'v1'} :return: 返回元祖,響應的字符串內容 和 cookiejar對象 對於cookiejar對象,可使用for循環訪問: for item in cookiejar: print item.name,item.value """ if data: data = json.dumps(data) cookie_jar = cookielib.CookieJar() handler = urllib2.HTTPCookieProcessor(cookie_jar) opener = urllib2.build_opener(handler) opener.addheaders.append(['Cookie', 'k1=v1;k1=v2']) request = urllib2.Request(url=url, data=data, headers=headers) request.get_method = lambda: method response = opener.open(request) origin = response.read() return origin, cookie_jar # GET result = urllib2_request('http://127.0.0.1:8001/index/', method="GET") # POST result = urllib2_request('http://127.0.0.1:8001/index/', method="POST", data= {'k1': 'v1'}) # PUT result = urllib2_request('http://127.0.0.1:8001/index/', method="PUT", data= {'k1': 'v1'})
(1)什麼是requests模塊html
requests模塊是python中原生的基於網絡請求的模塊,其主要做用是用來模擬瀏覽器發起請求。功能強大,用法簡潔高效。在爬蟲領域中佔據着半壁江山的地位。python
使用 Apache2 Licensed 許可證的基於Python開發的HTTP庫,其在Python內置模塊的基礎上進行了高度的封裝,從而使得Pythoner進行網絡請求時,變得美好了許多,使用Requests能夠垂手可得的完成瀏覽器可有的任何操做。git
(2)爲何要使用requests模塊github
1)由於在使用urllib模塊的時候,會有諸多不便之處: web
1.手動處理url編碼 2.手動處理post請求的參數 3.處理cookie操做比較繁瑣 建立cookiejar對象 建立handler對象 建立opener對象 4.處理代理的操做比較繁瑣 建立handler對象,代理ip和端口封裝到該對象 建立opener對象
2)使用requests模塊,相對來講的便利:ajax
1.自動處理url編碼 2.自動處理post請求參數 3.大大簡化cookie和代理操做 ...
Requests模塊的api更加便捷(本質就是封裝了urllib3)。json
注意:Requests庫發送請求將網頁內容下載下來後,並不會執行js代碼,這須要咱們本身分析目標站點而後發起新的request請求。api
$ pip3 install requests
使用流程:瀏覽器
1.指定url 2.使用requests模塊發起請求 3.獲取響應數據 4.進行持久化存儲
經常使用的就是requests.get()和requests.post()
requests.get(url, params=None, **kwargs) requests.post(url, data=None, json=None, **kwargs) requests.put(url, data=None, **kwargs) requests.head(url, **kwargs) requests.delete(url, **kwargs) requests.patch(url, data=None, **kwargs) requests.options(url, **kwargs) # 以上方法均是在此方法的基礎上構建 requests.request(method, url, **kwargs)
注意:requests.get()等價於requests.request(method="get")。
官網連接:http://docs.python-requests.org/en/master/
HTTP默認的請求方法就是GET:
GET請求經常使用的操做:
import requests # 指定url url = 'https://www.sogou.com/' # 發起get請求:get方法會返回請求成功的響應對象 response = requests.get(url=url) # 獲取響應中的數據值:text能夠獲取響應對象中字符串形式的頁面數據 page_data = response.text print(page_data) # 持久化操做 with open('./sougou.html', "w", encoding="utf-8") as fp: fp.write(page_data)
注意:text屬性能夠獲取響應對象中字符串形式的頁面數據,其餘response常見屬性詳見後面小節。
需求:指定一個詞條,獲取搜狗的搜索結果所對應的頁面數據。
import requests # 只攜帶了重要參數,且中文不須要從新編碼,requests能夠自動處理 url = "https://www.sogou.com/web?query=周杰倫&ie=utf-8" response = requests.get(url=url) page_text = response.text # 字符串形式頁面數據 with open('./zhou.html', 'w', encoding="utf-8") as fp: fp.write(page_text)
打開zhou.html顯示效果:
requests.get()方法解析:
點擊「+」,查看params的詳細用法:
能夠url攜帶的參數抽取出來,封裝爲是字典或者是bytes的形式,將封裝好的數據賦值給params。
import requests url = 'https://www.sogou.com/web' # 將參數封裝到字典中 params = { 'query': "周杰倫", 'ie': 'utf-8' } response = requests.get(url=url, params=params) response.status_code # 響應狀態碼:200 # 數據持久化 page_text = response.text # 字符串形式頁面數據 with open('./zhou.html', 'w', encoding="utf-8") as fp: fp.write(page_text)
import requests url = 'https://www.sogou.com/web' # 將參數封裝到字典中 params = { 'query': "周杰倫", 'ie': 'utf-8' } # 自定義請求頭信息 headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 將自定義請求頭信息放入get方法第三個參數 response = requests.get(url=url, params=params, headers=headers) response.status_code # 200
(1)數據不會出如今地址欄中
(2)數據的大小沒有上限
(3)有請求體
(4)請求體中若是存在中文,會使用URL編碼
requests.post()用法與requests.get()徹底一致,特殊的是requests.post()有一個data參數,用來存放請求體數據。
import requests # 1.指定post請求的url url = "https://accounts.douban.com/login" # 封裝post請求的參數 data = { "source": "movie", "redir": "https://www.douban.com/", "form_email": "18907281232", "form_password": "uashudh282", "login": "登陸", } # 自定義請求頭信息 headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 2.發起post請求 response = requests.post(url=url, data=data, headers=headers) # 響應對象 # 3.獲取響應對象中的頁面數據 page_text = response.text # 4.持久化操做 with open("./douban.html", "w", encoding="utf-8") as fp: fp.write(page_text)
因爲豆瓣登陸添加了驗證碼,上述代碼已經沒法登陸,須要後面繼續優化。
注意:
1)查看Headers框的General信息,確認Request Method是POST。
2)攜帶的參數包含了用戶名密碼的值
一般,你想要發送一些編碼爲表單形式的數據——很是像一個 HTML 表單。要實現這個,只需簡單地傳遞一個字典給 data 參數。你的數據字典在發出請求時會自動編碼爲表單形式:
>>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.post("http://httpbin.org/post", data=payload) >>> print(r.text) { ... "form": { "key2": "value2", "key1": "value1" }, ... }
還能夠爲 data
參數傳入一個元組列表。在表單中多個元素使用同一 key 的時候,這種方式尤爲有效:
>>> payload = (('key1', 'value1'), ('key1', 'value2')) >>> r = requests.post('http://httpbin.org/post', data=payload) >>> print(r.text) { ... "form": { "key1": [ "value1", "value2" ] }, ... }
需求案例:爬取豆瓣電影分類排行榜 https://movie.douban.com/中的電影詳情數據。
import requests url = 'https://movie.douban.com/j/chart/top_list?' # 封裝ajax的請求中攜帶的參數 params = { "type": "13", "interval_id": "90:80", "action": "", "start": "0", "limit": "1" } # 自定義請求頭信息 headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 獲取響應對象 response = requests.get(url=url, params=params, headers=headers) print(response.text) """ [{"rating":["8.2","45"],"rank":305,"cover_url":"https://img1.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p1910831868.jpg",
"is_playable":false,"id":"1291564","types":["劇情","歌舞","愛情","科幻"],"regions":["英國","德國","法國"],"title":"十分鐘年華老去:大提琴篇",
"url":"https:\/\/movie.douban.com\/subject\/1291564\/","release_date":"2002-09-03","actor_count":12,"vote_count":12708,"score":"8.2",
"actors":["瓦萊麗亞·布魯尼·泰德斯基","Amit Arroz","Mark Long","Alexandra Staden","多米尼克·威斯特","畢碧安娜·貝格","伊爾姆·赫爾曼","魯道夫·霍辛斯基",
"Jean-Luc Nancy","Ana Samardzija","阿萊克斯·德斯卡","丹尼爾·克雷格"],"is_watched":false}] """
點擊進入豆瓣電影的排行榜——》愛情,在頁面中點選喜愛區間,獲取ajax的get請求。
獲得get請求的url地址和請求參數:https://movie.douban.com/j/chart/top_list?type=13&interval_id=95%3A85&action=&start=0&limit=1
點擊頁面百分比按鈕,查看get請求攜帶的參數:
需求:爬取肯德基城市餐廳位置數據。
import requests # 1.指定url # 基於ajax的post請求的url post_url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword" # 處理post請求的參數 data = { "cname": "", "pid": "", "keyword": "上海", "pageIndex": "1", "pageSize": "10" } # 自定義請求頭信息(假裝UA) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 2.發起基於ajax的post請求 response = requests.post(url=post_url, headers=headers, data=data) response.text
打開kfc首頁,在頁面找到餐廳查詢按鈕,就能夠進入kfc餐廳查詢頁面:
點擊查詢,地址欄url是不會發生變化的,所以不能經過地址欄獲取post請求url,須要經過抓包工具獲取異步請求所對應的url。
查看form表單攜帶的參數:
'{"Table":[{"rowcount":28}],"Table1":[{"rownum":1,"storeName":"開發區上海路",
"addressDetail":"開發區上海路80號利羣時代超市一樓","pro":"Wi-Fi,店內參觀,禮品卡,生日餐會",
"provinceName":"江蘇省","cityName":"南通市"},......,{"rownum":10,"storeName":"上海南路",
"addressDetail":"上海南路3號699生活空間3號樓","pro":"Wi-Fi,店內參觀,禮品卡,生日餐會","provinceName":"江西省",
"cityName":"南昌市"}]}'
需求:爬取搜狗知乎某一個詞條對應必定範圍頁面表示的頁面數據。
import requests import os # 建立一個文件夾(判斷,不能重複建立) if not os.path.exists("./pages"): os.mkdir("./pages") word = input("enter a word:") # 外部賦值詞條 # 動態指定頁碼的範圍 start_pageNum = int(input("enter a start pageNum:")) end_pageNum = int(input("enter a end pageNum:")) # 自定義請求頭信息(假裝UA) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 1.指定url:設計成一個具有通用的url url = "http://zhihu.sogou.com/zhihu" for page in range(start_pageNum, end_pageNum+1): # 注意區間開閉 params = { "query": word, "page": page, # 一、二、3... "ie": "utf-8" } # 有幾個不一樣的url就發起幾個請求,拿到的對應的響應對象 response = requests.get(url=url, params=params, headers=headers) # 獲取響應中的頁面數據(指定頁碼(page)) page_text = response.text # 進行持久化存儲 filename = word + str(page) + ".html" # 每一個文件名的拼接 filePath = "pages/" + filename with open(filePath, "w", encoding="utf-8") as fp: fp.write(page_text) print("第%d頁數據寫入成功" % page)
注意要點:
word = input("enter a word:") # 外部賦值詞條 # 動態指定頁碼的範圍 start_pageNum = int(input("enter a start pageNum:")) end_pageNum = int(input("enter a end pageNum:"))
# 動態指定頁碼的範圍 start_pageNum = int(input("enter a start pageNum:")) end_pageNum = int(input("enter a end pageNum:")) # 1.指定url:設計成一個具有通用的url url = "http://zhihu.sogou.com/zhihu" for page in range(start_pageNum, end_pageNum+1): # 注意區間開閉
從import requests中,查看requests源碼 ——>requests/__init__.py ——>查看from .api import request ——>requests/api.py:
def request(method, url, **kwargs): """Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers to add for the file. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. :param timeout: (optional) How many seconds to wait for the server to send data before giving up, as a float, or a :ref:`(connect timeout, read timeout) <timeouts>` tuple. :type timeout: float or tuple :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``. :type allow_redirects: bool :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. :param verify: (optional) Either a boolean, in which case it controls whether we verify the server's TLS certificate, or a string, in which case it must be a path to a CA bundle to use. Defaults to ``True``. :param stream: (optional) if ``False``, the response content will be immediately downloaded. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. :return: :class:`Response <Response>` object :rtype: requests.Response Usage:: >>> import requests >>> req = requests.request('GET', 'http://httpbin.org/get') <Response [200]> """
def param_method_url(): # requests.request(method='get', url='http://127.0.0.1:8000/test/') # requests.request(method='post', url='http://127.0.0.1:8000/test/') pass def param_param(): # - 能夠是字典 # - 能夠是字符串 # - 能夠是字節(ascii編碼之內) # requests.request(method='get', # url='http://127.0.0.1:8000/test/', # params={'k1': 'v1', 'k2': '水電費'}) # requests.request(method='get', # url='http://127.0.0.1:8000/test/', # params="k1=v1&k2=水電費&k3=v3&k3=vv3") # requests.request(method='get', # url='http://127.0.0.1:8000/test/', # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8')) # 錯誤 # requests.request(method='get', # url='http://127.0.0.1:8000/test/', # params=bytes("k1=v1&k2=水電費&k3=v3&k3=vv3", encoding='utf8')) pass def param_data(): # 能夠是字典 # 能夠是字符串 # 能夠是字節 # 能夠是文件對象 # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # data={'k1': 'v1', 'k2': '水電費'}) # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # data="k1=v1; k2=v2; k3=v3; k3=v4" # ) # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # data="k1=v1;k2=v2;k3=v3;k3=v4", # headers={'Content-Type': 'application/x-www-form-urlencoded'} # ) # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # data=open('data_file.py', mode='r', encoding='utf-8'), # 文件內容是:k1=v1;k2=v2;k3=v3;k3=v4 # headers={'Content-Type': 'application/x-www-form-urlencoded'} # ) pass def param_json(): # 將json中對應的數據進行序列化成一個字符串,json.dumps(...) # 而後發送到服務器端的body中,而且Content-Type是 {'Content-Type': 'application/json'} requests.request(method='POST', url='http://127.0.0.1:8000/test/', json={'k1': 'v1', 'k2': '水電費'}) def param_headers(): # 發送請求頭到服務器端 requests.request(method='POST', url='http://127.0.0.1:8000/test/', json={'k1': 'v1', 'k2': '水電費'}, headers={'Content-Type': 'application/x-www-form-urlencoded'} ) def param_cookies(): # 發送Cookie到服務器端 requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies={'cook1': 'value1'}, ) # 也可使用CookieJar(字典形式就是在此基礎上封裝) from http.cookiejar import CookieJar from http.cookiejar import Cookie obj = CookieJar() obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False, port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False) ) requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies=obj) def param_files(): # 發送文件 # file_dict = { # 'f1': open('readme', 'rb') # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 發送文件,定製文件名 # file_dict = { # 'f1': ('test.txt', open('readme', 'rb')) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 發送文件,定製文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf") # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 發送文件,定製文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'}) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) pass def param_auth():
# 認證設置:登陸網站時彈出一個框要求輸入帳戶密碼,此時是沒法獲取html的
# 但本質原理是拼接成請求頭髮送
# 通常的網站都不用默認的加密方式,都是本身寫
# 那麼咱們就須要按照網站的加密方式,本身寫一個相似的方法。 from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text) # ret = requests.get('http://192.168.1.1', # auth=HTTPBasicAuth('admin', 'admin')) # ret.encoding = 'gbk' # print(ret.text) # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass')) # print(ret) # def param_timeout():
# 設置超時時間 # ret = requests.get('http://google.com/', timeout=1) # 1秒能連上則連,不然放棄 # print(ret) # ret = requests.get('http://google.com/', timeout=(5, 1)) # 兩個參數時:請求連接等待時間、響應等待時間 # print(ret) pass def param_allow_redirects():
# 關閉重定向 ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False) print(ret.text) def param_proxies():
# 代理:先發送請求給代理,再由代理幫忙發送(封ip是常見的事情)
# 爲任意請求方法提供 參數來配置單個請求 # proxies = { # "http": "61.172.249.96:80", # "https": "http://61.185.219.126:3128", # } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # 定向地 前面的地址用後面這個代理 # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) # print(ret.headers) # 代理加密 # from requests.auth import HTTPProxyAuth # # proxyDict = { # 'http': '77.75.105.165', # 'https': '77.75.105.165' # } # auth = HTTPProxyAuth('username', 'mypassword') # # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) # print(r.text) pass def param_stream():
# 默認狀況下,當你進行網絡請求後,響應體會當即被下載。
# 響應體內容工做流:若是你在請求中把 設爲 ,Requests 沒法將鏈接釋放回鏈接池 ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close() # from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此處理響應。 # for i in r.iter_content(): # print(i) def requests_session(): import requests session = requests.Session() ### 一、首先登錄任何頁面,獲取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 二、用戶登錄,攜帶上一次的cookie,後臺對cookie中的 gpsd 進行受權 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text) proxiesstreamTrue
發請求須要帶證書的時候,須要帶上這個參數,這個證書通常是.pem文件。
能夠指定一個本地證書用做客戶端證書,能夠是單個文件(包含密鑰和證書)或一個包含兩個文件路徑的元組:
requests.get('https://kennethreitz.org', cert=('/path/client.cert', '/path/client.key'))
Requests 能夠爲 HTTPS 請求驗證 SSL 證書,就像 web 瀏覽器同樣。SSL 驗證默認是開啓的,若是證書驗證失敗,Requests 會拋出 SSLError:
>>> requests.get('https://github.com', verify=True) <Response [200]>
能夠爲 verify
傳入 CA_BUNDLE 文件的路徑,或者包含可信任 CA 證書文件的文件夾路徑:
>>> requests.get('https://github.com', verify='/path/to/certfile')
注意:若是 verify
設爲文件夾路徑,文件夾必須經過 OpenSSL 提供的 c_rehash 工具處理。
默認狀況下, verify
是設置爲 True 的。選項 verify
僅應用於主機證書。
session參數可用來自動管理cookie和headers。(不夠靈活,不建議使用)
抽屜新熱榜示例以下:
import requests session = requests.Session() i1 = session.get(url="http://dig.chouti.com/help/service") i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxooxxoo", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589523" ) print(i3.text)
import requests from requests.exceptions import * #能夠查看requests.exceptions獲取異常類型 try: r=requests.get('http://www.baidu.com',timeout=0.00001) except ReadTimeout: print('===:') # except ConnectionError: #網絡不通 # print('-----') # except Timeout: # print('aaaaa') except RequestException: print('Error')
http://cn.python-requests.org/zh_CN/latest/
import requests # 指定url併發起get請求 respone=requests.get('http://www.jianshu.com') # 獲取響應對象中字符串形式的頁面數據 print(respone.text) # 獲取response對象中二進制(bytes)類型的頁面數據 print(respone.content) # 獲取響應狀態碼 print(respone.status_code) """ 403 """ # 獲取響應頭信息(字典形式展現) print(respone.headers) """ {'Date': 'Fri, 02 Nov 2018 03:04:23 GMT', 'Content-Type': 'text/html',...} """ # 獲取響應頭中的cookies print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) # 獲取請求指定的url print(respone.url) """ https://www.jianshu.com/ """ # 獲取訪問的歷史記錄 print(respone.history) """ [<Response [301]>] """ # 獲取請求編碼 print(respone.encoding) """ ISO-8859-1 """ # 關閉:response.close() from contextlib import closing with closing(requests.get('xxx',stream=True)) as response: for line in response.iter_content(): pass
import requests response=requests.get('http://www.autohome.com/news') # response.encoding='gbk' #汽車之家網站返回的頁面內容爲gb2312編碼的,而requests的默認編碼爲ISO-8859-1,若是不設置成gbk則中文亂碼 print(response.text)
import requests response=requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg') with open('a.jpg','wb') as f: f.write(response.content)
stream參數:一點一點的取,好比下載視頻時,若是視頻100G,用response.content而後一會兒寫到文件中是不合理的。
import requests response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4', stream=True) with open('b.mp4','wb') as f: for line in response.iter_content(): f.write(line)
import requests response=requests.get('http://httpbin.org/get') import json res1=json.loads(response.text) #太麻煩 res2=response.json() #直接獲取json數據 print(res1 == res2) #True
By default Requests will perform location redirection for all verbs except HEAD. We can use the history property of the Response object to track redirection. The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response. For example, GitHub redirects all HTTP requests to HTTPS: >>> r = requests.get('http://github.com') >>> r.url 'https://github.com/' >>> r.status_code >>> r.history [<Response [301]>] If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter: >>> r = requests.get('http://github.com', allow_redirects=False) >>> r.status_code >>> r.history [] If you're using HEAD, you can enable redirection as well: >>> r = requests.head('http://github.com', allow_redirects=True) >>> r.url 'https://github.com/' >>> r.history [<Response [301]>]
import requests import re #第一次請求 r1=requests.get('https://github.com/login') r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被受權) authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #從頁面中拿到CSRF TOKEN #第二次請求:帶着初始cookie和TOKEN發送POST請求給登陸頁面,帶上帳號密碼 data={ 'commit':'Sign in', 'utf8':'✓', 'authenticity_token':authenticity_token, 'login':'317828332@qq.com', 'password':'alex3714' } #測試一:沒有指定allow_redirects=False,則響應頭中出現Location就跳轉到新頁面,r2表明新頁面的response r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie ) print(r2.status_code) #200 print(r2.url) #看到的是跳轉後的頁面 print(r2.history) #看到的是跳轉前的response print(r2.history[0].text) #看到的是跳轉前的response.text #測試二:指定allow_redirects=False,則響應頭中即使出現Location也不會跳轉到新頁面,r2表明的仍然是老頁面的response r2=requests.post('https://github.com/session', data=data, cookies=r1_cookie, allow_redirects=False ) print(r2.status_code) #302 print(r2.url) #看到的是跳轉前的頁面https://github.com/session print(r2.history) #[]
基於用戶的用戶數據。
需求:爬取張三用戶的豆瓣網的我的主頁頁面數據
import requests session = requests.session() # 獲取一個session對象 # 1.發起登陸請求:將cookie獲取,且存儲到session對象中 login_url = "https://accounts.douban.com/login" data = { "source": "index_nav", "redir": "https://www.douban.com/", "form_email": "1xxxxxx0", "form_password": "xxxxxxxx", "login": "登陸" } # 自定義請求頭信息(假裝UA) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 使用session發起post請求 login_response = session.post(url=login_url, data=data, headers=headers) # 這裏完成時,session對象已經攜帶了cookie # 2.對我的主頁發起請求(session(cookie)),獲取響應頁面數據 url = "https://www.douban.com/people/186757832/" # 地址欄對象的url response = session.get(url=url, headers=headers) # 獲取響應對象中對應的頁面數據 page_test = response.text with open('./douban110.html', 'w', encoding='utf-8') as fp: fp.write(page_text)
服務器端使用cookie來記錄客戶端的狀態信息。
1.執行登陸操做(獲取cookie)
2.在發起我的主頁請求時,須要將cookie攜帶到該請求中
注意:代碼實現中會用到session對象,能夠發送請求(會將cookie對象進行自動存儲)。若是發起某一個請求,將cookie對象存到session對象中,第二次使用session對象發請求的時候,就會攜帶cookie對服務器端發送請求。
在這次請求發送成功後,服務器端響應回客戶端一個cookie,這個cookie會自動存儲在session對象當中,下一次請求時,session對象會自動攜帶cookie。
代理:第三方代理本體執行相關事務。生活中:代購、微商、中介。
1)反爬操做關聯:某些網站會在某時間段內檢測某一個ip的訪問次數。若是發現訪問頻率太快,快到不像是正常訪客。典型的非正常訪客就是爬蟲和惡意程序等。
這些網站在發現非正常訪客後,就會把非正常訪客的ip禁用,致使他們沒法正常訪問網站。
2)反反爬手段:使用代理IP的話,若是一個地址被封了,還能夠換一個ip繼續訪問。
正向代理:代替客戶端獲取數據。
反向代理:代理服務器端提供數據。
爬蟲都是應用的正向代理,代替客戶端獲取數據。
www.goubanjia.com(推薦使用)、快代理、西祠代理。
1)查看本地ip:
2)查看獲得一個代理Ip:
import requests url = "https://www.baidu.com/s?ie=utf-8&wd=ip" # 將代理ip封裝到字典 proxy = { "https": "89.179.119.229:55205" # 注意ip對應的代理 } # 自定義請求頭信息(假裝UA) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36', } # 更換網絡ip response = requests.get(url=url, proxies=proxy, headers=headers) with open('./detail.html', "w", encoding="utf-8") as fp: fp.write(response.text)
查看detail.html頁面數據,顯示以下:
能夠看到地址已經改成了指定的代理ip,顯示本機地址爲俄羅斯。
不少網站在登陸時,都須要驗證驗證碼。
(1)手動識別驗證碼
(2)雲打碼平臺自動識別驗證碼
import requests from lxml import etree # 1.對攜帶驗證碼的頁面數據進行抓取 url = 'https://www.douban.com/accounts/login?source=movie' headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } page_text = requests.get(url=url, headers=headers).text
from lxml import etree # 2.能夠將頁面數據中驗證碼進行解析,驗證碼圖片下載到本地 tree = etree.HTML(page_text) # 利用xpath對指定內容解析:img標籤的src屬性 codeImg_url = tree.xpath('//*[@id="captcha_image"]/@src')[0] if codeImg_url: # 獲取了驗證碼圖片對應的二進制數據值 code_img = requests.get(url=codeImg_url, headers=headers).content # 獲取capture_id參數值 c_id = re.findall('<img id="captcha_image".*?id=(.*?)&.*?>', page_text, re.S)[0] with open('./code.png', 'wb') as fp: fp.write(code_img)
1)在官網中進行註冊
須要同時註冊普通用戶和開發者用戶。
2)登陸開發者用戶
3)實例代碼的下載
開發文檔-》調用實例及最新的DLL
解壓安裝包獲得以下文件:
4)建立一個軟件
開發者頁面中點擊個人軟件-》點擊添加新的軟件:
輸入軟件名稱後提交:
獲得的通信密鑰等下也會利用到。
使用前面下載的示例代碼中的源碼進行修改,就能夠幫助識別驗證碼圖片中的數據值了。
將YDMHTTPDemo3.x.py文件中,YDMHttp這個類的代碼截取出來:
import http.client, mimetypes, urllib, json, time, requests class YDMHttp: apiurl = 'http://api.yundama.com/api.php' username = '' password = '' appid = '' appkey = '' def __init__(self, username, password, appid, appkey): self.username = username self.password = password self.appid = str(appid) self.appkey = appkey def request(self, fields, files=[]): response = self.post_url(self.apiurl, fields, files) response = json.loads(response) return response def balance(self): data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey} response = self.request(data) if (response): if (response['ret'] and response['ret'] < 0): return response['ret'] else: return response['balance'] else: return -9001 def login(self): data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey} response = self.request(data) if (response): if (response['ret'] and response['ret'] < 0): return response['ret'] else: return response['uid'] else: return -9001 def upload(self, filename, codetype, timeout): data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)} file = {'file': filename} response = self.request(data, file) if (response): if (response['ret'] and response['ret'] < 0): return response['ret'] else: return response['cid'] else: return -9001 def result(self, cid): data = {'method': 'result', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey, 'cid': str(cid)} response = self.request(data) return response and response['text'] or '' def decode(self, filename, codetype, timeout): cid = self.upload(filename, codetype, timeout) if (cid > 0): for i in range(0, timeout): result = self.result(cid) if (result != ''): return cid, result else: time.sleep(1) return -3003, '' else: return cid, '' def report(self, cid): data = {'method': 'report', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey, 'cid': str(cid), 'flag': '0'} response = self.request(data) if (response): return response['ret'] else: return -9001 def post_url(self, url, fields, files=[]): for key in files: files[key] = open(files[key], 'rb'); res = requests.post(url, files=files, data=fields) return res.text
若是是ipython的話,直接複製到cell中執行,這樣就把這個類加載進內存中了。
利用YDMHTTPDemo3.x.py文件中剩餘的代碼定義一個獲取驗證碼的函數:
def getCode(codeImg): """ 該函數調用了打碼平臺的相關接口,對指定的驗證碼圖片進行識別 返回圖片上的數據值 """ # 雲打碼平臺註冊的普通用戶名 username = 'xxxxxxxx' # 密碼 password = 'xxxxxxxx' # 軟件ID,開發者分紅必要參數。登陸開發者後臺【個人軟件】得到! appid = 6206 # 軟件密鑰,開發者分紅必要參數。登陸開發者後臺【個人軟件】得到! appkey = 'afde6628xxxxxxxxx0b8eeaxxxxx899f' # 下載好的驗證碼圖片文件 filename = codeImg # 驗證碼類型,# 例:1004表示4位字母數字,不一樣類型收費不一樣。請準確填寫,不然影響識別率。 # 在此查詢全部類型 http://www.yundama.com/price.html codetype = 3000 # 豆瓣驗證碼是不定長的英文 # 超時時間,秒(10秒以上) timeout = 20 # 檢查 if (username == 'username'): print('請設置好相關參數再測試') else: # 初始化 yundama = YDMHttp(username, password, appid, appkey) # 登錄雲打碼 uid = yundama.login(); print('uid: %s' % uid) # 查詢餘額 balance = yundama.balance(); print('balance: %s' % balance) # 開始識別,圖片路徑,驗證碼類型ID,超時時間(秒),識別結果 cid, result = yundama.decode(filename, codetype, timeout); print('cid: %s, result: %s' % (cid, result)) return result
import re import requests from lxml import etree # 1.對攜帶驗證碼的頁面數據進行抓取 url = 'https://www.douban.com/accounts/login?source=movie' headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' } page_text = requests.get(url=url, headers=headers).text # 2.能夠將頁面數據中驗證碼進行解析,驗證碼圖片下載到本地 tree = etree.HTML(page_text) # 利用xpath對指定內容解析:img標籤的src屬性 codeImg_url = tree.xpath('//*[@id="captcha_image"]/@src')[0] if codeImg_url: # 獲取了驗證碼圖片對應的二進制數據值 code_img = requests.get(url=codeImg_url, headers=headers).content # 獲取capture_id參數值 c_id = re.findall('<img id="captcha_image".*?id=(.*?)&.*?>', page_text, re.S)[0] with open('./code.png', 'wb') as fp: fp.write(code_img) # 得到了驗證碼圖片上的數據值 codeText = getCode('./code.png') # 進行登陸操做 post = "https://accounts.douban.com/login" data = { 'source': 'movie', 'redir': 'https://movie.douban.com/', 'form_email': 'xxxxxx', 'form_password': 'xxxxxxx', 'captcha-solution': codeText, # 驗證碼 'captcha-id': c_id, # 跟隨驗證碼動態變化的id值 'login': '登陸' } print(c_id) # 登陸後獲取的信息 login_text = requests.post(url=post, data=data, headers=headers).text with open('./login.html', 'w', encoding='utf-8') as fp: fp.write(login_text) else: # 無驗證碼直接登陸獲取數據(未驗證) # 進行登陸操做 post = "https://accounts.douban.com/login" data = { 'source': 'movie', 'redir': 'https://movie.douban.com/', 'form_email': 'xxxxxxxx', 'form_password': 'xxxxxxx', 'login': '登陸' } # 登陸後獲取的信息 login_text = requests.post(url=post, data=data, headers=headers).text with open('./login_無驗證.html', 'w', encoding='utf-8') as fp: fp.write(login_text)
該代碼中區分了頁面登陸須要驗證碼和不須要驗證碼的狀況。
控制檯輸出:
爬取的驗證碼圖片:
爬取的頁面: