urllib庫是python中的一個基本網絡請求庫。用於模擬瀏覽器的行爲,向指定服務器發送請求,並接收返回的數據。html
在python3中全部的網絡請求相關函數都集中在urllib.request模塊下面python
向服務器發起請求json
urlopen函數的參數瀏覽器
from urllib import request res = request.urlopen("http://www.baidu.com") print(res.read())
這個函數能夠方便的將網頁的一個文件保存到本地。服務器
urlretrieve函數的參數cookie
from urllib import request request.urlretrieve("http://www.baidu.com","index.html") #下載百度首頁到index.html
用於完成url中中文以及特殊字符的編碼和解碼網絡
基本使用:cors
from urllib import parse params = { "name": "張三", "age": 14, "地址": "上海市海河大道1544弄3號樓302" } res = parse.urlencode(params) print(res)
執行結果:
age=14&name=%E5%BC%A0%E4%B8%89&%E5%9C%B0%E5%9D%80=%E4%B8%8A%E6%B5%B7%E5%B8%82%E6%B5%B7%E6%B2%B3%E5%A4%A7%E9%81%931544%E5%BC%843%E5%8F%B7%E6%A5%BC302ide
在百度上搜索劉德華函數
from urllib import request from urllib import parse # request.urlopen("http://www.baidu.com/s/?wd=劉德華") #直接這樣請求會報錯 url = "http://www.baidu.com/s/?" # 定義參數字典 params = { "wd": "劉德華" } # 參數轉碼 qs = parse.urlencode(params) # url拼接 url += qs # 發送請求 res = request.urlopen(url) print(res.read())
將已經編碼的url進行解碼
基本使用
from urllib import parse qs = "age=14&name=%E5%BC%A0%E4%B8%89&%E5%9C%B0%E5%9D%80=%E4%B8%8A%E6%B5%B7%E5%B8%82%E6%B5%B7%E6%B2%B3%E5%A4%A7%E9%81%931544%E5%BC%843%E5%8F%B7%E6%A5%BC302" res = parse.parse_qs(qs) print(res)
執行結果
{'name': ['張三'], 'age': ['14'], '地址': ['上海市海河大道1544弄3號樓302']}
用於將url各個部分進行分割
基本使用
from urllib import parse url = "http://www.baidu.com/s/?wd=python" res = parse.urlsplit(url) print(res) res = parse.urlparse(url) print(res)
執行結果:
SplitResult(scheme='http', netloc='www.baidu.com', path='/s/', query='wd=python', fragment='')
ParseResult(scheme='http', netloc='www.baidu.com', path='/s/', params='', query='wd=python', fragment='')
能夠發現兩個結果基本相同,惟一不一樣的是urlsplit()函數返回結果沒有params屬性
若是須要在請求中添加header信息,則必須用request.Request類實現
基本使用:
# 經過構造請求頭 獲取拉勾網的招聘信息 from urllib import request from urllib import parse url = "https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false" headers = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', "Cookie": "_ga=GA1.2.620765502.1560083999; _gid=GA1.2.758158058.1560083999; user_trace_token=20190609203959-b18d608c-8ab3-11e9-a228-5254005c3644; LGUID=20190609203959-b18d64d3-8ab3-11e9-a228-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAIAACBI2C1935D6770E19BC5BE4390354414026; X_HTTP_TOKEN=b6c2ab256a325419948821065120ec66a55a5e4b49; _gat=1; LGSID=20190610090729-1e5547bf-8b1c-11e9-a22c-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; LGRID=20190610090729-1e5549e6-8b1c-11e9-a22c-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1560084000,1560090525,1560128850; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1560128850; TG-TRACK-CODE=index_search; SEARCH_ID=60cd24c737344a6f98c48dd4fc94c39c" } data = { "first": "true", "pn": 1, "kd": "python" } req = request.Request(url, headers=headers, data=( parse.urlencode(data)).encode("utf-8"), method="POST") resp = request.urlopen(req) print(resp.read().decode("utf-8"))
由於爬蟲爬取信息頻率太高,容易被服務器的檢測機制經過ip地址斷定爲惡意訪問,經過更換代理ip是預防這種狀況的有效手段。
基本使用:
from urllib import request # 不使用代理 req = request.Request("http://httpbin.org/ip") resp = request.urlopen(req) print(resp.read()) # 使用代理 # 1.構建handler handler = request.ProxyHandler({"http": "175.23.43.193:8080"}) # 2.使用handler構建opener opener = request.build_opener(handler) # 3. 使用opener發送請求 resp = opener.open("http://httpbin.org/ip") print(resp.read())
執行結果
b'{\n "origin": "101.88.45.142, 101.88.45.142"\n}\n' b'{\n "origin": "175.23.43.193, 175.23.43.193"\n}\n'
代碼:
from urllib import request from http.cookiejar import CookieJar from urllib import parse # 1.人人網登陸 # 建立cookiejar對象 cookiejar = CookieJar() # 建立httpcookieprocess對象 handler = request.HTTPCookieProcessor(cookiejar=cookiejar) # 建立opener opener = request.build_opener(handler) # 使用opener發送登陸請求,須要傳遞用戶名和密碼 headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" } data = { "email": "970138074@qq.com", "password": "pythonspider" } data = parse.urlencode(data) url = "http://www.renren.com/PLogin.do" req = request.Request(url, data=data.encode("utf-8"), headers=headers) opener.open(req) # 2.訪問我的主頁 dapeng_url = "http://www.renren.com/880151247/profile" resp = opener.open(dapeng_url) res = resp.read().decode("utf-8") with open("renren.html", "w") as f: f.write(res)
經過opener攜帶cookie
# 保存cookie到本地 cookiejar = MozillaCookieJar('cookie.txt') handler = request.HTTPCookieProcessor(cookiejar) opener = request.build_opener(handler) resp = opener.open("http://httpbin.org/cookies/set?corse=python") cookiejar.save(ignore_discard=True) # 該參數設置保存即將過時的cookie # 加載本地的cookie cookiejar = MozillaCookieJar('cookie.txt') cookiejar.load(ignore_discard=True) handler = request.HTTPCookieProcessor(cookiejar) opener = request.build_opener(handler) resp = opener.open("http://httpbin.org/cookies") cookiejar.save(ignore_discard=True) # 該參數設置保存即將過時的cookie for cookie in cookiejar: print(cookie)