爬蟲基礎之requests模塊

時間 2019-12-01

標籤爬蟲基礎 requests 模塊欄目網絡爬蟲简体版

原文原文鏈接

1. 爬蟲簡介

1.1 概述

網絡爬蟲（又被稱爲網頁蜘蛛，網絡機器人，在FOAF社區中間，更常常的稱爲網頁追逐者），是一種按照必定的規則，自動地抓取萬維網信息的程序或者腳本。html

1.2 爬蟲的價值

在互聯網的世界裏最有價值的即是數據, 誰掌握了某個行業的行業內的第一手數據, 誰就是該行業的主宰. 掌握了爬蟲技能，你就成了全部互聯網信息公司幕後的老闆，python

換言之，它們都在免費爲你提供有價值的數據。git

1.3 robots.txt協議

若是本身的門戶網站中的指定頁面中的數據不想讓爬蟲程序爬取到的話，那麼則能夠經過編寫一個robots.txt的協議文件來約束爬蟲程序的數據爬取。robots協議的編寫github

格式能夠觀察淘寶網的robots（訪問www.taobao.com/robots.txt便可）。可是須要注意的是，該協議只是至關於口頭的協議，並無使用相關技術進行強制管制，因此正則表達式

該協議是防君子不防小人。redis

1.4 爬蟲的基本流程

發送請求: 經過相關模塊或者庫如瀏覽器通常向目標站點發送請求, 即一個request, 請求能夠攜帶headers和參數等信息, 而後等待服務器響應
獲取響應: 服務器正常響應, 會返回一個response, 即頁面內容, 類型多是html, json或者二進制數據(音頻視頻圖片等)
解析數據: 響應的字符串能夠經過正則表達式或者BeautifulSoup, xpath等解析器提煉出咱們須要的數據
存儲數據: 將解析出來的數據進行持久化保存, 能夠存儲到文件中, 也能夠存儲到redis, mondodb等數據庫中

2 requests模塊

Requests是用python語言基於urllib編寫的, 採用的是Apache2 Licensed開源協議的HTTP庫, Requests它會比urllib更加方便, 能夠節約咱們大量的工做. 一句話,數據庫

requests是python實現的最簡單易用的HTTP庫, 建議爬蟲使用requests庫. 默認安裝好python以後, 是沒有安裝requests模塊的, 須要單獨經過pip安裝.json

2.1 基本語法

requests模塊支持的請求

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

get請求

1. 基本請求瀏覽器

import requests
response=requests.get('https://www.jd.com/',)
 
with open("jd.html","wb") as f:
    f.write(response.content)

2. 含參數請求服務器

import requests
response=requests.get('https://s.taobao.com/search?q=手機')
response=requests.get('https://s.taobao.com/search',params={"q":"美女"}):
    f.write(res.content)

3. 含請求頭請求

import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
#反爬機制: UA反爬
res = requests.get("https://www.baidu.com/s",params={"wd":"劉亦菲"}, headers=headers)
with open("jay.html","wb") as f:
    f.write(res.content)

4. 含cookies請求

import uuid
import requests

url = 'http://httpbin.org/cookies'
cookies = dict(sbid=str(uuid.uuid4()))

res = requests.get(url, cookies=cookies)
print(res.text)

post請求

1 data參數

requests.post()用法與requests.get()徹底一致，特殊的是requests.post()多了一個data參數，用來存放請求體數據

import requests
res = requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"Alex"} )
# 沒有指定請求頭 #默認的請求頭: application/x-www-form-urlencoded
print(res.text)

2. 發送json數據

import requests
# 發送json數據
res = requests.post(url="http://httpbin.org/post", json={"age":"38"})
# # 默認的請求頭: application/json
print(res.text)
# 同時有data和json參數時 data參數優先

session對象

import requests

# 1 建立session對象 使用方法和requests.get() .post 一致
session = requests.session()
res1 = session.get("https://github.com/login")
#後面的訪問都會帶着獲取的session去訪問
res2 = session.post("https://github.com/session")

response對象

1. 常見屬性

import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone屬性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)

2. 編碼問題

import requests
res = requests.get("http://www.autohome.com/news")
# res.encoding = "gbk"
# with open("autohome.html","w") as f:
#     f.write(res.text)
#汽車之家網站返回的頁面內容爲gb2312編碼的，而requests的默認編碼爲ISO-8859-1，若是不設置成gbk則中文亂碼
# 或者也能夠以"wb" 模式寫入文件
with open("autohome.html","wb") as f:
    f.write(res.content)

3. 下載二進制文件(音頻,視頻,圖片)

import requests
res = requests.get("https://video.pearvideo.com/mp4/adshort/20190222/cont-1520612-13609117_adpkg-ad_hd.mp4")
with open("lsp.mp4","wb") as f:
    # f.write(res.content)
    # 好比下載視頻時,若是視頻100G,用response.content而後一會兒寫到文件中是不合理的
    for line in res.iter_content():
        f.write(line)

4. 解析json數據

import requests
import json
 
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #太麻煩
res2=response.json() #直接獲取json數據
print(res1==res2

5. Redirection and History

默認狀況下, 除了head, requests會自動處理全部重定向. 可使用響應對象的history方法來追蹤重定向. Response.history是一個Response對象的列表,

爲了完成請求而建立了這些對象. 這個對象列表按照從時間前後順序進行排序.

import requests

res = requests.get("http://www.jd.com")
print(res.url)
# https://www.jd.com/
print(res.status_code)
# 200
print(res.history)
# [<Response [302]>]

另外, 能夠經過allow_redirects 參數禁用重定向處理

import requests
res = requests.get("http://www.jd.com", allow_redirects=False)
print(res.status_code)
# 302
print(res.history)
# []

2.2 requests進階用法

IP代理

一些網站會有相應的反爬蟲措施，例如不少網站會檢測某一段時間某個IP的訪問次數，若是訪問頻率太快以致於看起來不像正常訪客，它可能就會會禁止

這個IP的訪問。因此咱們須要設置一些代理服務器，每隔一段時間換一個代理，就算IP被禁止，依然能夠換個IP繼續爬取。

res=requests.get('http://httpbin.org/ip', proxies={'http':'112.17.121.88:8060'}).json()
print(res)

免費代理

2.3 簡單爬蟲案例

爬取github的主頁內容

import requests,re

# 1 請求獲取token, 以便經過post請求校驗
session = requests.session()
res = session.get("https://github.com/login")
token = re.findall('<input type="hidden" name="authenticity_token" value="(.*?)"',res.text)[0]
print(token)

# 2 構建post請求數據
data={
    "commit":"Sign in",
    "Sign in": "✓",
    "authenticity_token": token,
    "login": "aflychen",
    "password":"afly264028"
}
response = session.post("https://github.com/session",data=data)
with open("github.html","wb") as f:
    f.write(response.content)