目錄php
正文html
請求百度首頁www.baidu.com ,不添加請求頭信息:python
1 import urllib.requests 2 3 4 def get_page(): 5 url = 'http://www.baidu.com/' 6 res = urllib.request.urlopen(url=url) 7 page_source = res.read().decode('utf-8') 8 print(page_source) 9 10 11 if __name__ == '__main__': 12 get_page()
輸出顯示百度首頁的源碼。可是有的網站進行了反爬蟲設置,上述代碼可能會返回一個40X之類的響應碼,由於該網站識別出了是爬蟲在訪問網站,這時須要假裝一下爬蟲,讓爬蟲模擬用戶行爲,給爬蟲設置headers(User-Agent)屬性,模擬瀏覽器請求網站。瀏覽器
因爲urllib.request.urlopen() 函數不接受headers參數,因此須要構建一個urllib.request.Request對象來實現請求頭的設置:cookie
1 import urllib.request 2 3 4 def get_page(): 5 url = 'http://www.baidu.com' 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 8 } 9 request = urllib.request.Request(url=url, headers=headers) 10 res = urllib.request.urlopen(request) 11 page_source = res.read().decode('utf-8') 12 print(page_source) 13 14 15 if __name__ == '__main__': 16 get_page()
添加headers參數,來模擬瀏覽器的行爲。網絡
登錄ChinaUnix論壇,獲取首頁源碼,而後訪問一個文章。首先不使用Cookie看一下什麼效果:session
1 import urllib.request 2 import urllib.parse 3 4 5 def get_page(): 6 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 7 headers = { 8 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 9 } 10 data = { 11 'username': 'StrivePy', 12 'password': 'XXX' 13 } 14 postdata = urllib.parse.urlencode(data).encode('utf-8') 15 req = urllib.request.Request(url=url, data=postdata, headers=headers) 16 res = urllib.request.urlopen(req) 17 page_source = res.read().decode('gbk') 18 print(page_source) 19 20 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 21 res1 = urllib.request.urlopen(url=url1) 22 page_source1 = res1.read().decode('gbk') 23 print(page_source1) 24 25 26 if __name__ == '__main__': 27 get_page()
搜索源碼中是否能看見用戶名StrivePy,發現登錄成功,可是再請求其它文章時,顯示爲遊客狀態,會話狀態沒有保持。如今使用Cookie看一下效果:函數
1 import urllib.request 2 import urllib.parse 3 import http.cookiejar 4 5 6 def get_page(): 7 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 8 headers = { 9 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 10 } 11 data = { 12 'username': 'StrivePy', 13 'password': 'XXX' 14 } 15 postdata = urllib.parse.urlencode(data).encode('utf-8') 16 req = urllib.request.Request(url=url, data=postdata, headers=headers) 17 # 建立CookieJar對象 18 cjar = http.cookiejar.CookieJar() 19 # 以CookieJar對象爲參數建立Cookie 20 cookie = urllib.request.HTTPCookieProcessor(cjar) 21 # 以Cookie對象爲參數建立Opener對象 22 opener = urllib.request.build_opener(cookie) 23 # 將Opener安裝位全局,覆蓋urlopen函數,也能夠臨時使用opener.open()函數 24 urllib.request.install_opener(opener) 25 res = urllib.request.urlopen(req) 26 page_source = res.read().decode('gbk') 27 print(page_source) 28 29 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 30 res1 = urllib.request.urlopen(url=url1) 31 page_source1 = res1.read().decode('gbk') 32 print(page_source1) 33 34 35 if __name__ == '__main__': 36 get_page()
結果顯示登錄成功後,再訪問其它文章時,顯示爲登錄狀態。若要將Cookie保存爲文件待下次使用,可使用MozillaCookieJar對象將Cookie保存爲文件。post
1 import urllib.request 2 import urllib.parse 3 import http.cookiejar 4 5 6 def get_page(): 7 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 8 headers = { 9 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 10 } 11 data = { 12 'username': 'StrivePy', 13 'password': 'XXX' 14 } 15 postdata = urllib.parse.urlencode(data).encode('utf-8') 16 req = urllib.request.Request(url=url, data=postdata, headers=headers) 17 filename = 'cookies.txt' 18 # 建立CookieJar對象 19 cjar = http.cookiejar.MozillaCookieJar(filename) 20 # 以CookieJar對象爲參數建立Cookie 21 cookie = urllib.request.HTTPCookieProcessor(cjar) 22 # 以Cookie對象爲參數建立Opener對象 23 opener = urllib.request.build_opener(cookie) 24 # 臨時使用opener來請求 25 opener.open(req) 26 # 將cookie保存爲文件 27 cjar.save(ignore_discard=True, ignore_expires=True)
會在當前工做目錄生成一個名爲cookies.txt的cookie文件,下次就能夠不用登錄(若是cookie沒有失效的話)直接讀取這個文件來實現免登陸訪問。例如不進行登錄直接訪問其中一篇文章(沒登錄也能夠訪問,主要是看擡頭是否是登錄狀態):測試
1 import http.cookiejar 2 3 4 def get_page(): 5 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 6 filename = 'cookies.txt' 7 cjar = http.cookiejar.MozillaCookieJar(filename) 8 cjar.load(ignore_discard=True, ignore_expires=True) 9 cookie = urllib.request.HTTPCookieProcessor(cjar) 10 opener = urllib.request.build_opener(cookie) 11 res1 = opener.open(url1) 12 page_source1 = res1.read().decode('gbk') 13 print(page_source1) 14 15 16 if __name__ == '__main__': 17 get_page()
結果顯示是以登錄狀態在查看這篇文章。
使用代理能夠有效規避爬蟲被封。
1 import urllib.request 2 3 4 def proxy_test(): 5 url = 'http://myip.kkcha.com/' 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 8 } 9 request = urllib.request.Request(url=url, headers=headers) 10 proxy = { 11 'http': '180.137.232.101:53281' 12 } 13 # 建立代理Handler對象 14 proxy_handler = urllib.request.ProxyHandler(proxy) 15 # 以Handler對象爲參數建立Opener對象 16 opener = urllib.request.build_opener(proxy_handler) 17 # 將Opener安裝爲全局 18 urllib.request.install_opener(opener) 19 response = urllib.request.urlopen(request) 20 page_source = response.read().decode('utf-8') 21 print(page_source) 22 23 24 if __name__ == '__main__': 25 proxy_test()
抓取到的頁面應該顯示代理IP,不知道什麼緣由,有時候能正常顯示,有時候跳轉到有道詞典廣告頁!!!問題有待更進一步研究。
以GET方式請求http://httpbin.org測試網站。
1 import requests 2 3 4 def request_test(): 5 url = 'http://httpbin.org/get' 6 response = requests.get(url) 7 print(type(response.text), response.text) 8 print(type(response.content), response.content) 9 10 11 if __name__ == '__main__': 12 request_test()
直接獲得響應體。
1 <class 'str'> {"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"121.61.132.191","url":"http://httpbin.org/get"} 2 3 <class 'bytes'> b'{"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.18.4"},"origin":"121.61.132.191","url":"http://httpbin.org/get"}\n
在GET方法中傳遞參數的三種方式:
1 import urllib.parse 2 3 if __name__ == '__main__': 4 base_url = 'http://httpbin.org/' 5 params = { 6 'key1': 'value1', 7 'key2': 'value2' 8 } 9 full_url = base_url + urllib.parse.urlencode(params) 10 print(full_url)
1 http://httpbin.org/key1=value1&key2=value2
1 import requests 2 3 if __name__ == '__main__': 4 payload = { 5 'key1': 'value1', 6 'key2': 'value2' 7 } 8 response = requests.get('http://httpbin.org/get', params=payload) 9 print(response.url)
1 http://httpbin.org/key1=value1&key2=value2
1 http://httpbin.org/get?key2=value2&key1=value1
登錄ChinaUnix論壇,獲取首頁源碼,而後訪問一個文章。首先不使用Session看一下什麼效果:
1 import requests 3 4 5 def get_page(): 6 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 7 headers = { 8 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 9 } 10 data = { 11 'username': 'StrivePy', 12 'password': 'XXX' 13 } 14 response = requests.post(url=url, data=data, headers=headers) 15 page_source = response.text 16 print(response.status_code) 17 print(page_source) 18 19 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 20 response1 = requests.get(url=url1, headers=headers) 21 page_source1 = response1.text 22 print(response1.status_code) 23 print(page_source1) 24 25 26 if __name__ == '__main__': 27 get_page()
結果顯示訪問其它文章時爲遊客模式。接下來用session來維持會話看一下效果:
1 import requests 2 3 4 def get_page(): 5 url = 'http://bbs.chinaunix.net/member.php?mod=logging&action=login&loginsubmit=yes&loginhash=LcN2z' 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 8 } 9 data = { 10 'username': 'StrivePy', 11 'password': 'XXX' 12 } 13 session = requests.session() 14 response = session.post(url=url, data=data, headers=headers) 15 page_source = response.text 16 print(response.status_code) 17 print(page_source) 18 19 url1 = 'http://bbs.chinaunix.net/thread-4263876-1-1.html' 20 response1 = session.get(url=url1, headers=headers) 21 page_source1 = response1.text 22 print(response1.status_code) 23 print(page_source1) 24 25 26 if __name__ == '__main__': 27 get_page()
結果顯示訪問其它文章時,顯示爲登錄狀態,會話保持住了。使用session的效果相似於urllib庫臨時使用opener或者將opener安裝爲全局的效果。
在requests庫中使用代理:
1 import requests 2 3 4 def proxy_test(): 5 url = 'http://myip.kkcha.com/' 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36' 8 } 9 proxy = { 10 'https': '61.135.217.7: 80' 11 } 12 response = requests.get(url=url, headers=headers, proxies=proxy) 13 print(response.text) 14 15 16 if __name__ == '__main__': 17 proxy_test()
這個請求到的代碼顯示IP仍是本地的網絡IP,代理沒起做用,具體緣由有待研究。