爬蟲之requests

時間 2019-11-30

標籤爬蟲 requests 欄目網絡爬蟲简体版

原文原文鏈接

1.爬蟲介紹

1.1 什麼是爬蟲

　　互聯網最有價值的就是資源，爬蟲要作的就是爬取資源，好比鏈家網的租房信息，拉勾網的求職信息，島國的資源等等html

1.2 爬蟲流程

　　發送請求------>獲取響應------>爬取資源(下載資源)------>解析數據------>數據持久化(mongodb數據庫，redis數據庫)python

請求模塊：requests模塊，selenium模塊git

解析模塊：BeautifulSoup模塊，xpath模塊github

2. requests模塊

　　Requests 惟一的一個非轉基因的 Python HTTP 庫，人類能夠安全享用ajax

　　Requests 容許你發送純自然，植物飼養的 HTTP/1.1 請求，無需手工勞動。你不須要手動爲 URL 添加查詢字串，也不須要對 POST 數據進行表單編碼。Keep-alive 和 HTTP 鏈接池的功能是 100% 自動化的，一切動力都來自於根植在 Requests 內部的 urllib3。(來源官網)redis

#各類請求方式：經常使用的就是requests.get()和requests.post()
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

官網連接：猛戳此處mongodb

2.1 對於get請求

#GET請求
HTTP默認的請求方法就是GET
     * 沒有請求體
     * 數據必須在1K以內！
     * GET請求數據會暴露在瀏覽器的地址欄中

GET請求經常使用的操做：
       1. 在瀏覽器的地址欄中直接給出URL，那麼就必定是GET請求
       2. 點擊頁面上的超連接也必定是GET請求
       3. 提交表單時，表單默認使用GET請求，但能夠設置爲POST

常見get請求數據庫

import requests

response = requests.get("https://www.jd.com/")
with open('jingdong.html','wb') as f:
    f.write(response.content)

或者獲取它的文本信息：json

import requests

response = requests.get("https://www.jd.com/")
print(response.text)

帶參數的get請求：api

　　對於某些網站(好比百度)，咱們須要假裝成流程器，必須攜帶瀏覽器可識別的請求頭信息

在百度搜索關於python的文本信息

import requests

response=requests.get('https://www.baidu.com/s?wd=python&pn=1',
                      headers={
                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                      })
print(response.text)

# 對於某些網站，咱們的請求頭可能不止僅攜帶user-agent信息，請求頭是將自身假裝成瀏覽器的關鍵，常見的有用的請求頭以下

Host
Referer #大型網站一般都會根據該參數判斷請求的來源
User-Agent #客戶端
Cookie #Cookie信息雖然包含在請求頭裏，但requests模塊有單獨的參數來處理他，headers={}內就不要放它了

request.session()的使用

import requests

# res=requests.get("https://www.zhihu.com/explore")
# print(res.cookies.get_dict())
# {'_xsrf': '4oL216OENatT6LIshv3zBlXFpvTW4lcM', 'tgw_l7_route': 'b3dca7eade474617fe4df56e6c4934a3'}


session = requests.session()
res1 = session.get("https://www.zhihu.com/explore")     # 訪問該頁面後會生成cookie
res2 = session.get("https://www.zhihu.com/question/40031734/answer/499507903",  # 302重定向到該問題
                   headers={
                       "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36"
                   }    # 必須還要攜帶請求頭相關信息
                   )
print(res2.text)

2.2 對於post請求

#POST請求
　　*  數據不會出如今地址欄中
　　*  數據的大小沒有上限
　　*  有請求體
　　*  請求體中若是存在中文，會使用URL編碼！

requests.post()用法與requests.get()徹底一致，特殊的是requests.post()有一個data參數，用來存放請求體數據

模擬github登錄，獲取登錄信息

import requests
import re

#請求1:
r1=requests.get('https://github.com/login')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被受權)
authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #從頁面中拿到CSRF TOKEN
print("authenticity_token",authenticity_token)
#第二次請求：帶着初始cookie和TOKEN發送POST請求給登陸頁面，帶上帳號密碼
data={
    'commit':'Sign in',
    'utf8':'✓',
    'authenticity_token':authenticity_token,
    'login':'yuanchenqi0316@163.com',
    'password':'yuanchenqi0316'
}

#請求2:
r2=requests.post('https://github.com/session',
             data=data,
             cookies=r1_cookie,
             # allow_redirects=False
             )
print(r2.status_code)      #200
print(r2.url)              #看到的是跳轉後的頁面:https://github.com/
print(r2.history)          #看到的是跳轉前的response:[<Response [302]>]
print(r2.history[0].text)  #看到的是跳轉前的response.text

with open("result.html","wb") as f:

    f.write(r2.content)

這裏其實咱們也能夠經過使用session請求來獲取咱們的cookie

3.響應Response

常見屬性

import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')

# respone屬性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)

下載二進制文件(圖片，視頻，音頻)　　

import requests

response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/'
        'tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',
                      stream=True)
    #stream參數:表示一點一點的取
with open('b.mp4','wb') as f:
    for line in response.iter_content():
        f.write(line)   # 一行一行的寫入

json解析

　　對於獲取某些數據須要進行反序列化，此時咱們能夠直接使用response.json()

import requests
import json
 
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #太麻煩
res2=response.json() #直接獲取json數據
print(res1==res2)

Response.history

　　Response.history 是一個 Response 對象的列表，爲了完成請求而建立了這些對象。這個對象列表按照從最老到最近的請求進行排序。

>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [301]>]

　# 還能夠經過 allow_redirects 參數禁用重定向處理　

4. 爬取拉勾網信息

　　咱們進入首頁，輸入python職位搜索，按F12查看搜索信息，發現response中除了職位信息，其餘東西都有

　說明這裏可能發生了重定向，或者咱們須要的信息後臺經過ajax發送來的，再次檢查發現他的端口號是200，沒有重定向，那只有是第二種可能

　　此時咱們能夠經過XHR篩選查看異步請求，能夠發現咱們須要的數據,從中也看到咱們的訪問域名

代碼演示：

import requests

def pharse(pag):
    res = requests.post("https://www.lagou.com/jobs/positionAjax.json",
                        headers={
    'Referer':"https://www.lagou.com/jobs/list_python",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36',
    },
                        # 篩選條件
                        params={
                        'px': 'new',
                        'xl': '本科',
                        'city': '北京',
                        },
                        data={
                        'first': False,
                        'pn': pag,        # 頁數
                        'kd': 'Python'
                        }
                        )
    with open('lagou.txt','wb') as f:
        f.write(res.content)

for i in range(1,4):
    pharse(i)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。