python學習之爬蟲網絡數據採集

時間 2020-04-14

原文原文鏈接

Python 給人的印象是抓取網頁很是方便,提供這種生產力的,主要依靠的就是
urllib、requests這兩個模塊。html

網絡數據採集之urllib

urllib庫
官方文檔地址:https://docs.python.org/3/library/urllib.html
urllib庫是python的內置HTTP請求庫,包含如下各個模塊內容:
(1)urllib.request:請求模塊
(2)urllib.error:異常處理模塊
(3)urllib.parse:解析模塊
(4)urllib.robotparser:robots.txt解析模塊python

urllib庫:urlopen
urlopen進行簡單的網站請求,不支持複雜功能如驗證、cookie和其餘HTTP高級功能,
若要支持這些功能必須使用build_opener()函數返回的OpenerDirector對象。

urllib庫:User-Agent假裝後請求網站
不少網站爲了防止程序爬蟲爬網站照成網站癱瘓,會須要攜帶一些headers頭部信息才能
訪問, 咱們能夠經過urllib.request.Request對象指定請求頭部信息
瀏覽器

網絡數據採集之requests庫

requests庫
requests官方網址: https://requests.readthedocs.io/en/master/
Requests is an elegant and simple HTTP library for Python, built for human
beings.
request方法彙總

Response對象包含服務器返回的全部信息,也包含請求的Request信息。

reqursts.py服務器

from urllib.error import HTTPError

import requests

def get():
    # get方法能夠獲取頁面數據，也能夠提交非敏感數據
    #url = 'http://127.0.0.1:5000/'
    #url = 'http://127.0.0.1:5000/?username=fentiao&page=1&per_page=5'
    url = 'http://127.0.0.1:5000/'
    try:
        params = {
            'username': 'fentiao',
            'page': 1,
            'per_page': 5
        }
        response = requests.get(url, params=params)
        print(response.text, response.url)
        #print(response)
        #print(response.status_code)
        #print(response.text)
        #print(response.content)
        #print(response.encoding)
    except HTTPError as e:
        print("爬蟲爬取%s失敗: %s" % (url, e.reason))

def post():
    url = 'http://127.0.0.1:5000/post'
    try:
        data = {
            'username': 'admin',
            'password': 'westos12'
        }
        response = requests.post(url, data=data)
        print(response.text)
    except HTTPError as e:
        print("爬蟲爬取%s失敗: %s" % (url, e.reason))

if __name__ == '__main__':
    get()
    #post()

高級應用一: 添加 headers

有些網站訪問時必須帶有瀏覽器等信息,若是不傳入headers就會報錯。
headers = { 'User-Agent': useragent}
response = requests.get(url, headers=headers)
UserAgent是識別瀏覽器的一串字符串,至關於瀏覽器的身份證,在利用爬蟲爬取網站數據時,
頻繁更換UserAgent能夠避免觸發相應的反爬機制。fake-useragent對頻繁更換UserAgent提供
了很好的支持,可謂防反扒利器。
user_agent = UserAgent().randomcookie

import requests
from fake_useragent import  UserAgent

def add_headers():
    # headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'}
    #UserAgent實質上是從網絡獲取全部的用戶代理， 再經過random隨機選取一個用戶代理。
    #https://fake-useragent.herokuapp.com/browsers/0.1.11
    ua = UserAgent()
    #默認狀況下， python爬蟲的用戶代理是python-requests/2.22.0。
    response = requests.get('http://127.0.0.1:5000', headers={'User-Agent': ua.random})
    print(response)

if __name__ == '__main__':
    add_headers()

高級應用二: IP代理設置

在進行爬蟲爬取時,有時候爬蟲會被服務器給屏蔽掉,這時採用的方法主要有下降訪問時
間,經過代理IP訪問。ip能夠從網上抓取,或者某寶購買。
proxies = { "http": "http://127.0.0.1:9743", "https": "https://127.0.0.1:9743",}
response = requests.get(url, proxies=proxies)
百度的關鍵詞接口:
https://www.baidu.com/baidu?wd=xxx&tn=monline_4_dg
360的關鍵詞接口:http://www.so.com/s?q=keyword
網絡

import requests
from fake_useragent import UserAgent

ua = UserAgent()
proxies = {
    'http': 'http://222.95.144.65:3000',
    'https': 'https://182.92.220.212:8080'
}
response = requests.get('http://47.92.255.98:8000',
                        headers={'User-Agent': ua.random},
                        proxies=proxies
                        )

print(response)
#這是由於服務器端會返回數據: get提交的數據和請求的客戶端IP
#如何判斷是否成功? 返回的客戶端IP恰好是代理IP， 表明成功。
print(response.text)