【Python爬蟲實例學習篇】——二、獲取免費IP代理

時間 2020-06-13

原文原文鏈接

【Python爬蟲實例學習篇】——二、獲取免費IP代理

因爲在使用爬蟲時常常會檢查IP地址，所以有必要找到一個獲取IP代理的地方。通過一番騷操做，終於構建了本人第一個代理庫，代理庫的返回值類型均爲列表類型。（說明，這些免費代理天天實時更新，通過測試可用率超60%）另外，爲保證代理庫能長時間穩定運行，本文對requests庫的get請求再一次進行了封裝。html

使用工具

1.Python 3.6
2.requests庫
3.免費代理網站python

一、API獲取一個免費代理

該免費代理網站提供了兩個，一個是提供一個免費代理，另外一個是提供一頁免費代理（一頁最多15個）。web

import requests

def GetFreeProxy():
    # 獲取一個免費代理
    url='https://www.freeip.top/api/proxy_ip'
    ip=list(range(1))
    try:
        res=requests.get(url=url,timeout=20)
        # 將返回數據進行json解析
        result = res.json()
        ip[0]=result['data']['ip']+':'+result['data']['port']
        return ip
    except Exception:
        print('獲取代理ip失敗！正在重試···')
        # 異常重調
        GetFreeProxy()
        return 0

二、API獲取一頁免費代理

def GetFreeProxyListAPI(page=1, country='', isp='', order_by='validated_at', order_rule='DESC'):
    # 獲取一個免費代理列表
    # 返回值爲list
    # 參數名            數據類型 必傳    說明        例子
    # page            int        N    第幾頁    1
    # country        string    N    所屬國    中國,美國
    # isp            string    N    ISP        電信,阿里雲
    # order_by        string    N    排序字段    speed:響應速度,validated_at:最新校驗時間 created_at:存活時間
    # order_rule    string    N    排序方向    DESC:降序 ASC:升序
    data = {
        'page': str(page),
        'country': country,
        'isp': isp,
        'order_by': order_by,
        'order_rule': order_rule
    }
    url = 'https://www.freeip.top/api/proxy_ips' + '?' + str(parse.urlencode(data))
    ip = list(range(1))
    headers = {
        'User-Agent': str(choice(user_agent_list))
    }
    session = requests.session()
    res = GET(session, url=url, headers=headers)
    # 解析數據
    result = res.json()
    ip = list(range(int(result['data']['to'])))
    for i in range(int(result['data']['to'])):
        ip[i] = result['data']['data'][i]['ip'] + ':' + result['data']['data'][i]['port']
    return ip

三、網頁獲取一個免費代理

def GetFreeProxy():
    # method=2 is pure-HTTP
    # 返回值爲list
    headers = {
        'User-Agent': str(random.choice(user_agent_list))
    }
    session = requests.session()
    # Choose the proxy web
    homeurl = 'https://www.freeip.top/'
    url = 'https://www.freeip.top/?page='
    GET(session=session, url=homeurl, headers=headers)
    # //tr[1]/td[1]
    res = GET(session=session, url=url, headers=headers)
    html = etree.HTML(res.text)
    # 選擇IP數據和端口數據
    IP_list_1 = html.xpath('//tr[1]/td[1]')
    IP_list_2 = html.xpath('//tr[1]/td[2]')
    IP_list = list(range(1))
    IP_list[0] = IP_list_1[0].text + ':' + IP_list_2[0].text
    return IP_list

四、網頁獲取一頁免費代理（推薦）

def GetFreeProxyList(GetType=1, protocol='https'):
    # 代理可用率超50%  推薦使用
    # method=2 is pure-HTTP
    # 返回值爲list
    headers = {
        'User-Agent': str(choice(user_agent_list)),
    }
    session = requests.session()
    # Choose the proxy web
    if GetType == 1:
        homeurl = 'https://www.freeip.top/'
        url = 'https://www.freeip.top/?page=1&protocol=' + protocol
        GET(session=session, url=homeurl, headers=headers)
    elif GetType == 2:
        homeurl = 'https://www.kuaidaili.com/'
        url = 'https://www.kuaidaili.com/free/inha/'
        GET(session=session, url=homeurl, headers=headers)
    else:
        print('其餘方法暫未支持！')
        return 0
    # Get the IP list
    num = 1
    if _exists('IP.txt'):
        remove('IP.txt')
    IP_list = []
    while True:
        res = GET(session=session, url=url, headers=headers)
        html = etree.HTML(res.content.decode('utf-8'))
        # 選擇IP數據和端口數據
        IP_list_1 = html.xpath('//tr/td[1]')
        IP_list_2 = html.xpath('//tr/td[2]')
        if GetType == 1:
            url = 'https://www.freeip.top/?page=' + str(num) + '&protocol=' + protocol
        IP_list.extend(
            list(map(lambda ip_list_1, ip_list_2: (ip_list_1.text + ':' + ip_list_2.text), IP_list_1, IP_list_2)))
        num = num + 1
        if len(IP_list_1):
            continue
        else:
            break
    return IP_list

五、再次封裝的GET請求

# 因爲免費代理網站不穩定，獲取大量代理時容易出現503錯誤，所以須要屢次重傳
def GET(session, url, headers, timeout=15, num=0):
    try:
        response = session.get(url=url, headers=headers, timeout=timeout,verify=False)
        if response.status_code == 200:
            return response
        else:
            print('對方服務器錯誤，正在進行第%i' % (num + 1) + '次重試···')
            sleep(0.8)
            response = GET(session=session, url=url, headers=headers, num=num + 1)
            return response
    except Exception:
        print('鏈接錯誤，正在進行第%i' % (num + 1) + '次重試···')
        sleep(0.8)
        response = GET(session=session, url=url, headers=headers, num=num + 1)
        return response