Python網絡爬蟲-你的第一個爬蟲（requests庫）

時間 2019-12-06

標籤 python 網絡爬蟲第一個 requests 欄目 Python 简体版

原文原文鏈接

0.採用requests庫

雖然urllib庫應用也很普遍，並且做爲Python自帶的庫無需安裝，可是大部分的如今python爬蟲都應用requests庫來處理複雜的http請求。requests庫語法上簡潔明瞭，使用上簡單易懂，並且正逐步成爲大多數網絡爬取的標準。在學習中有迷茫不知如何學習的朋友小編推薦一個學Python的學習q u n 227 -435- 450能夠來了解一塊兒進步一塊兒學習！免費分享視頻資料html

1. requests庫的安裝

採用pip安裝方式，在cmd界面輸入：python

pip install requests

2. 示例代碼

咱們將處理http請求的頭部處理來簡單進行反反爬蟲處理，以及代理的參數設置，異常處理等。服務器

import requests


def download(url, num_retries=2, user_agent='wswp', proxies=None):
    '''下載一個指定的URL並返回網頁內容
        參數：
            url(str): URL
        關鍵字參數：
            user_agent(str):用戶代理（默認值：wswp）
            proxies（dict）： 代理（字典）: 鍵：‘http’'https'
            值：字符串（‘http(s)://IP’）
            num_retries(int):若是有5xx錯誤就重試（默認：2）
            #5xx服務器錯誤，表示服務器沒法完成明顯有效的請求。
            #https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81
    '''
    print('==========================================')
    print('Downloading:', url)
    headers = {'User-Agent': user_agent} #頭部設置，默認頭部有時候會被網頁反扒而出錯
    try:
        resp = requests.get(url, headers=headers, proxies=proxies) #簡單粗暴，.get(url)
        html = resp.text #獲取網頁內容，字符串形式
        if resp.status_code >= 400: #異常處理，4xx客戶端錯誤 返回None
            print('Download error:', resp.text)
            html = None
            if num_retries and 500 <= resp.status_code < 600:
                # 5類錯誤
                return download(url, num_retries - 1)#若是有服務器錯誤就重試兩次

    except requests.exceptions.RequestException as e: #其餘錯誤，正常報錯
        print('Download error:', e)
        html = None
    return html #返回html


print(download('http://www.baidu.com'))

結果：cookie

Downloading: http://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK-->
...
</script>

<script>
if(navigator.cookieEnabled){
    document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
}
</script>



</body>
</html>