這周打算把學過的內容從新總結一下,便於之後翻閱查找資料。html
urllib庫是python的內置庫,不須要單獨下載。其主要分爲四個模塊:python
1.urllib.request——請求模塊nginx
2.urllib.error——異常處理模塊json
3.urllib.parse——url解析模塊cookie
4.urllib.robotparser——用來識別網站的robot.txt文件(看看哪些內容是能夠爬的,不經常使用)app
1.urlopensocket
import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8'))
超時讀取 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason, socket.timeout): print('TIME OUT')
響應內容分析 import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response)) print(response.status) print(response.getheaders()) print(response.getheader('Server'))
<class 'http.client.HTTPResponse'> 200 [('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '50069'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 26 Nov 2018 02:44:49 GMT'), ('Via', '1.1 varnish'), ('Age', '3121'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2150-IAD, cache-sjc3143-SJC'), ('X-Cache', 'MISS, HIT'), ('X-Cache-Hits', '0, 254'), ('X-Timer', 'S1543200290.644687,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')] nginx
2. requestxss
用來傳遞更多的請求參數,url,headers,data, method from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': 'httpbin.org' } dict = { 'name': 'Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))
{ "args": {}, "data": "", "files": {}, "form": { "name": "Germey" }, "headers": { "Accept-Encoding": "identity", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)" }, "json": null, "origin": "117.136.66.101", "url": "http://httpbin.org/post" }
另外一種方法添加headers from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name': 'asd' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') #若是有好多headers,能夠循環添加 response = request.urlopen(req) print(response.read().decode('utf-8'))
3. Handleride
代理 import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http': 'http://127.0.0.1:9743', 'https': 'https://127.0.0.1:9743' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('https://www.httpbin,org/get') print(response.read())
4. Cookiepost
獲取cookie
import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+"="+item.value)
存儲Cookie import http.cookiejar, urllib.request filename = "cookie.txt" cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
獲取Cookie import http.cookiejar, urllib.request cookie = http.cookiejar.MozillaCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
5. UrlError 和 HttpError
from urllib import request, error try: response = request.urlopen('http://cuiqingcai.com/index.htm') except error.HTTPError as e: print(e.reason, e.code, e.headers, sep='\n') except error.URLError as e: print(e.reason) else: print('Request Successfully')
對於urllib.parse,由於用的很少,就先不寫了。上傳一個文檔連接吧,萬一哪天用了還能夠看看。
https://files.cnblogs.com/files/zhangguoxv/urllib%E8%AE%B2%E8%A7%A3.zip