Python構建本身的代理IP池

時間 2019-12-19

標籤 python 構建本身代理欄目 Python 简体版

原文原文鏈接

代碼

GITHUBhtml

目的

爬蟲過程當中，遭遇站點反爬蟲策略，須要按期切換IP。因此我構建一個有效的IP池，用於以後的爬蟲工做mysql

作法

爬取西刺免費代理IP網，篩選有效的代理IP入庫jquery

依賴

requests: HTTP請求git
pyquery: Python 版的jquery ,解析HTML元素github
PyMySQL：mysql ，本實例存儲在mysql中。對於數據的操做，數據庫仍是更加方便。sql

實現

爬取網頁，獲取數據

def getProxy(protocal, link, page=1):
    try:
        url = f'https://www.xicidaili.com/{link}/{page}'
        res = requests.get(url, headers={'User-Agent': UA['PC']})
        if (res and res.status_code == 200):
            html = pq(res.text)('#ip_list tr')

            for i in range(html.length):
                host = pq(tds[1]).text()
                port = pq(tds[2]).text()
                
複製代碼

如上所示的代碼(截取了部分),咱們解析西刺免費代理IP網,獲取目標IP 和端口數據庫

檢測IP，端口的有效性

西刺免費代理IP網提供的不少IP不具有有效性，因此須要作出過濾纔可入庫bash

def checkProxy(proxylink):
    try:
        ret = requests.get(
            'https://www.baidu.com',
            proxies={'https': proxylink},
            timeout=5,
        )
        if (ret and ret.status_code == 200):
            print(proxylink)
            return True
    except Exception as e:
         pass
複製代碼

咱們使用上面的方法，代理請求百度地址，檢測代理的有效性多線程

將有效的IP入庫，已在數據庫中可是無效的IP，移除

# 鏈接數據庫
def connect():
    try:
        db = pymysql.connect(
            MYSQL['host'],
            MYSQL['username'],
            MYSQL['password'],
            MYSQL['dbname'],
        )
        cursor = db.cursor()
        return {'db': db, 'cursor': cursor}
    except Exception as e:
        print('connect error:', e)

*****
mysql = connect()
# 檢查DB 是否已存在某代理地址
mysql['cursor'].execute(
    f'select count(*) from proxy where host = "{host}" and port = "{port}"',
)
# 若是代理有效，且不存在DB中，代理入庫
mysql['cursor'].execute(
    f'insert into proxy(host,port,protocal) values("{host}","{port}","{protocal}")'
)
mysql['db'].commit()
# 若是代理無效，可是又存在於DB中，刪除代理
mysql['cursor'].execute(
    f'delete from proxy where host = "{host}" and port = "{port}"'
)
mysql['db'].commit()
# 關閉鏈接
mysql['db'].close()
複製代碼