爬蟲（一）—— 請求庫（一）requests請求庫

時間 2019-12-08

標籤爬蟲請求 requests 欄目網絡爬蟲简体版

原文原文鏈接

目錄html

requests請求庫

requests請求庫

爬蟲：爬取、解析、存儲

1、請求

一、基本有用的參數

# 1.請求的url
https://www.cnblogs.com/linagcheng/

# 2.請求的方法
post，get，header...

# 3.請求頭須要攜帶參數
Cookie
User-Agent    # 說明本身是瀏覽器
Refer    # 從哪一個網站跳過來

# 4.請求體（formdata）——post請求才有
密碼可能加密，可使用錯誤的用戶名+正確的密碼，獲取加密過的密碼

二、請求url編碼

# 1.url編碼 --- 參數爲中文，實際url使用的是原參數編碼過的數據
# 例如：https://www.baidu.com/s?wd=汽車
#   其實是：https://www.baidu.com/s?wd=%E6%B1%BD%E8%BD%A6

from urllib.parse import urlencode

keyword = input(">>>:")
res = urlencode({'wd':keyword})
print(res)   # wd=%E6%B1%BD%E8%BD%A6

url = 'https://www.baidu.com/s?'+res
print(url)

三、headers參數——添加請求頭中的數據

import requests

response = requests.get(url='https://www.baidu.com/s?wd=%E6%B1%BD%E8%BD%A6',
                        headers={
                            'User-Agent':Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36
                        },
                       )

四、params參數——不用urlencode

keyword = input('>>>:').strip()
response = requests.get(url='https://www.baidu.com/s?',
                        params={
                          'wd':keyword，    # url中的參數，不用將中文進行編碼
                          'pn':2,        # 頁碼                    
                        },
                        headers={
                            'User-Agent':Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36
                        },
                       )

五、requests的其餘參數

cookies    # cookie 既能夠寫在requests的cookie中，也能夠寫在header中的Cookie中
allow_redirects:True/False  # requests遇到location也會自動跳轉，allow_redirects能夠修改是否跳轉

六、get方法與post方法

requests.get(url='....', headers={...}, cookie={...}, params={...})

requests.post(url='......', headers={...}, cookie={...}, data={...})

七、請求的總體流程

1.GET方式請求
可能會返回一個token，用於第二次請求攜帶

2.POST請求

2、響應

一、基本數據

# 響應狀態碼
200   成功
300   重定向

# 響應頭
Location
Cookie

# 響應體
html   # 正則篩選數據
json   # 能夠反序列化獲得內容
二進制   # 以rb方式打開文件，寫入文件

二、響應的參數

response.status_code    # 響應狀態碼

response.text    # 文本，獲取頁面HTML字符串
response.content  # 二進制數據，能夠寫入文件

response.history     # 獲取頁面跳轉前的內容

response.cookies.get_dict()   # 獲取響應的cookie，並轉換成字典

response.encode ='gbk'  # 指定響應內容的編碼

三、響應數據

（1）文本數據

response.text

（2）二進制數據

# 1.數據量過大時，至關於一次大數據放入內存，致使內存佔用過大
with open('b.mp4','wb') as f:
    f.write(response.content)

# 2.終極版：使用二進制流解決內存佔用過大問題
with open('b.mp4','wb') as f:
    for line in response.iter_content():     # response.iter_content() 獲得一個迭代器對象
        f.write(line)    # 逐行寫數據

（3）json格式數據

# 解析json格式字符串
res = response.json()    # 至關於json.loads(response.text)，反序列化

3、簡單爬蟲

一、普通爬取

import requests
import re

# 1.經過requests來獲取爬取的頁面
def get_page(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.text
    except Exception:
        pass

# 2.在獲取到頁面之後，從頁面中解析出須要的數據：例如 能夠獲取視頻詳情頁面的url
def parse_index(index_page):
    urls = re.findall('',index_page,re.S)    # re.S 去掉換行
    for url in urls:
        if not url.startswith('http'):
            url = 'http://www.baidu.com' + url
        yield url
    

# 3.解析完之後，根據解析所得數據獲取更深層次的數據：例如根據詳情頁面的url，獲取視頻的url
def get_detail_page():
    # 正則匹配，從詳情頁面中獲取視屏的url
    pass

# 4.獲取根據視頻的url下載視頻，保存在文件中
def get_movie(url):
    try:    # 可能視頻請求不到會報異常
        response = requests.get(url)
        if response.status_code == 200:

            with open('xxx.mp4','wb') as f:
                f.write(response.content)
                print('%s如今成功'%url)
    except Exception:
        pass


if __name__ == "__main__":
    pass

二、併發爬取

import requests
import re
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor

pool=ThreadPoolExecutor(50)
movie_path=r'C:\mp4'

def get_page(url):
    try:
        response=requests.get(url)
        if response.status_code == 200:
            return response.text
    except Exception:
        pass

def parse_index(index_page):
    index_page=index_page.result()
    urls=re.findall('class="items".*?href="(.*?)"',index_page,re.S)
    for detail_url in urls:
        if not detail_url.startswith('http'):
            detail_url='http://www.xiaohuar.com'+detail_url
        pool.submit(get_page,detail_url).add_done_callback(parse_detail)# 提交任務，成功回調

def parse_detail(detail_page):
    detail_page=detail_page.result()   # 回調函數返回的是一個對象，從 對象.result() 中獲取數據
    l=re.findall('id="media".*?src="(.*?)"',detail_page,re.S)
    if l:
        movie_url=l[0]
        if movie_url.endswith('mp4'):
            pool.submit(get_movie,movie_url)   # 提交任務，成功回調

def get_movie(url):
    try:
        response=requests.get(url)  # 回調函數返回的是一個對象，從 對象.result() 中獲取數據
        if response.status_code == 200:
            m=hashlib.md5()
            m.update(str(time.time()).encode('utf-8'))
            m.update(url.encode('utf-8'))
            filepath='%s\%s.mp4' %(movie_path,m.hexdigest())
            with open(filepath,'wb') as f:
                f.write(response.content)
                print('%s 下載成功' %url)
    except Exception:
        pass

def main():
    base_url='http://www.xiaohuar.com/list-3-{page_num}.html'
    for i in range(5):
        url=base_url.format(page_num=i)
        pool.submit(get_page,url).add_done_callback(parse_index) # 提交任務，成功回調

if __name__ == '__main__':
    main()

4、requests高級用法

一、SSL認證

# https的網站是須要帶證書的

# 訪問方式:
# 1.不驗證證書，而且去掉警告。大部分網站是可帶可不帶證書
import requests
from requests.packages import urllib3
urllib3.disable_warnings()    # 關閉警告
respone=requests.get('https://www.12306.cn',verify=False)   # verify=False不驗證證書

# 2.帶上證書。部分網站強制攜帶證書訪問，例如：內部網站，金融網站
import requests
respone=requests.get('https://www.12306.cn',
                     cert=('/path/server.crt',
                           '/path/key'))    # 證書的目錄，key的目錄，能夠衝網站上下載
print(respone.status_code)

二、使用IP代理

# 一個網站的訪問頻率太高，可能致使當前IP被封
# 使用代理訪問網站——先把請求轉發給代理主機，而後由代理主機訪問目標網站

# 使用方式
# 1.http代理    ——   轉發http協議 
import requests
response = requests.get('https://www.baidu.com',
                        proxies={
                            'http':'http://代理主機ip:port',
                            'https':'https://代理主機ip:port',
                        })

# 2.sock代理    ——   不只能夠轉發http協議，亦能夠轉發ftp協議等其餘協議
import requests
response = requests.get('https://www.baidu.com',
                        proxies={
                            'sock':'http://代理主機ip:port',
                        })

三、超時設置

# 不設置超時時間，會一直髮請求，直到有返回；設置超時時間，當發請求時間超出限制，直接拋出異常

import requests
respone=requests.get('https://www.baidu.com', timeout=0.0001)

四、認證設置（通常用不到，公司內網用的多）

# 例如訪問某個加密過的網站，先要驗證用戶名密碼才能訪問。例如:加密的博客，若是博客密碼不對，沒法獲取博客內容

# HTTPBasicAuth是基本的加密，通常使用自定義的加密算法
# 要認證須要知道加密算法，或者手動認證獲取cookie
import requests
from requests.auth import HTTPBasicAuth
res=requests.get('xxx',auth=HTTPBasicAuth('user','password'))
print(res.status_code)

# HTTPBasicAuth能夠簡寫爲以下格式   ——————  auth參數
import requests
res=requests.get('xxx',auth=('user','password'))
print(res.status_code)

五、異常處理

# 異常處理

import requests
from requests.exceptions import * # 能夠查看requests.exceptions獲取異常類型

try:
    r=requests.get('http://www.baidu.com',timeout=0.00001)
except ReadTimeout:
    print('===:')
except RequestException:
    print('Error')

六、上傳文件

# post方法中有一個 files參數

import requests
files={'file':open('a.jpg','rb')}    # 字典
respone=requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)

5、session方法（建議使用）

import requests

session = requests.session()
sesssion.get()    # 當發請求時，就會返回一個cookie，session方法獲得的對象能夠保存這個cookie，不用手動獲取cookie用來第二次請求
session.post()

6、selenium模塊

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。