目錄html
請求方式有get,post兩種,另外還有head,put,delete,options等
請求URL,全球統一資源定位符,任何一個網頁圖片文檔等,均可以用URL惟一肯定
請求頭,包含請求時的頭部信息,如user-agent,host,cookies等
請求體,請求時額外攜帶的數據,如表單提交時的表單數據python
響應狀態,有多種相應狀態200成功,301跳轉,404找不到頁面,502服務器錯誤
響應頭,如內容類型,內容長度嗎,服務器信息,設置cookies等
響應體,最主要的部分,包括了請求資源的內容,如網頁HTML,圖片,二進制數據等git
import requests # 網頁 uheaders = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'} response = requests.get('http://www.baidu.com', headers=uheaders) print(response.text) print(response.headers) print(response.status_code) response = requests.get('https://www.baidu.com/img/baidu_jgylogo3.gif') # 圖片 res = response.content with open('1.gif','wb') as f: f.write(res)
直接處理
json解析
正則表達式
beautifulsoup
pyquery
xpathgithub
各類請求方法正則表達式
import requests requests.post('http://httpbin.org/post') requests.put('http://httpbin.org/put') requests.delete('http://httpbin.org/delete') requests.head('http://httpbin.org/get') requests.options('http://httpbin.org/get')
import requests response=requests.get('http://httpbin.org/get') print(response.text) { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.19.1" }, "origin": "115.214.23.142", "url": "http://httpbin.org/get" }
import requests response=requests.get('http://httpbin.org/get?name=germey&age=22') print(response.text) data={'name':'germey','age':22} response=requests.get('http://httpbin.org/get',params=data) { "args": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Connection": "close", "Host": "httpbin.org", "User-Agent": "python-requests/2.19.1" }, "origin": "115.214.23.142", "url": "http://httpbin.org/get?name=germey&age=22" }
import requests response=requests.get('http://httpbin.org/get') print(type(response.text)) print(response.json()) print(type(response.json())) <class 'str'> {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 'origin': '115.214.23.142', 'url': 'http://httpbin.org/get'} <class 'dict'>
import requests response=requests.get('http://github.com/favicon.ico') with open() as f: f.write(response.content)
import requests headers={'User-Agent':''} response=requests.get('http://www.zhihu.com/explore',headers=headers) print(response.text)
import requests data={'name':'germey','age':22} headers={'User-Agent':''} response=requests.post('http://httpbin.org/post',data=data,headers=headers) print(response.json()) {'args': {}, 'data': '', 'files': {}, 'form': {'age': '22', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '18', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': ''}, 'json': None, 'origin': '115.214.23.142', 'url': 'http://httpbin.org/post'}
response屬性數據庫
import requests response = requests.get('http://www.jianshu.com') print(type(response.status_code), response.status_code) print(type(response.headers), response.headers) print(type(response.cookies), response.cookies) print(type(response.url), response.url) print(type(response.history), response.history) <class 'int'> 403 <class 'requests.structures.CaseInsensitiveDict'> {'Date': 'Wed, 31 Oct 2018 06:25:29 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Tengine', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'X-Via': '1.1 dianxinxiazai180:5 (Cdn Cache Server V2.0), 1.1 PSzjjxdx10wx178:11 (Cdn Cache Server V2.0)'} <class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[]> <class 'str'> https://www.jianshu.com/ <class 'list'> [<Response [301]>]
好多種json
文件上傳瀏覽器
import requests files={'file':open('1.jpg','rb')} response=requests.post('http://httpbin.org/post',files=files) print(response.text)
獲取cookie服務器
import requests response=requests.get('http://www.baidu.com') print(response.cookies) for key,value in response.cookies.items(): print(key+'='+value) <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> BDORZ=27315
會話維持cookie
import requests s=requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') response=s.get('http://httpbin.org/cookies') print(response.text) {"cookies": {"number": "123456789"}}
證書驗證
代理設置
超時設置
import requests from requests.exceptions import ReadTimeout try: response=requests.get('https://www.taobao.com',timeout= 1) print(response.status_code) except ReadTimeout: print('Timeout')
認證設置
import requests r = requests.get('', auth=('user', '123')) print(r.status_code)
異常處理
import requests from requests.exceptions import ReadTimeout,ConnectionError,RequestException try: response=requests.get('http://httpbin.org/get',timeout=0.5) print(response.status_code) except ReadTimeout: print('Timeout') except ConnectionError: print('connect error') except RequestException: print('Error')
import requests # 僞造瀏覽器發起Http請求 from bs4 import BeautifulSoup # 將html格式的字符串解析成對象,對象.find/find__all response = requests.get('https://www.autohome.com.cn/news/') response.encoding = 'gbk' # 網站是gbk編碼的 soup = BeautifulSoup(response.text, 'html.parser') div = soup.find(name='div', attrs={'id': 'auto-channel-lazyload-article'}) li_list = div.find_all(name='li') for li in li_list: title = li.find(name='h3') if not title: continue p = li.find(name='p') a = li.find(name='a') print(title.text) # 標題 print(a.attrs.get('href')) # 標題連接,取屬性值,字典形式 print(p.text) # 摘要 img = li.find(name='img') # 圖片 src = img.get('src') src = 'https:' + src print(src) file_name = src.rsplit('/', maxsplit=1)[1] ret = requests.get(src) with open(file_name, 'wb') as f: f.write(ret.content) # 二進制