Python 爬取可用代理 IP

一般狀況下爬蟲超過必定頻率或次數,對應的公網 IP 會被封掉,爲了能穩定爬取大量數據,咱們通常從淘寶購買大量代理ip,通常 10元 10w IP/天,然而這些 IP 大量都是無效 IP,須要本身不斷重試或驗證,其實這些 IP 也是賣家從一些代理網站爬下來的。html

既然如此,爲何咱們不本身動手爬呢?基本思路其實挺簡單:python

(1)找一個專門的 proxy ip 網站,解析出其中 IPgit

(2)驗證 IP 有效性github

(3)存儲有效 IP 或者作成服務json

一個 demo 以下:scrapy

import requests
from bs4 import BeautifulSoup
import re
import socket
import logging

logging.basicConfig(level=logging.DEBUG)


def proxy_spider(page_num):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    for i in range(page_num):
        url = 'http://www.xicidaili.com/wt/' + str(i + 1)
        r = requests.get(url=url, headers=headers)
        html = r.text
        # print r.status_code
        soup = BeautifulSoup(html, "html.parser")
        datas = soup.find_all(name='tr', attrs={'class': re.compile('|[^odd]')})
        for data in datas:
            soup_proxy = BeautifulSoup(str(data), "html.parser")
            proxy_contents = soup_proxy.find_all(name='td')
            ip_org = str(proxy_contents[1].string)
            ip = ip_org
            port = str(proxy_contents[2].string)
            protocol = str(proxy_contents[5].string)
            wan_proxy_check(ip, port, protocol)
            # print(ip, port, protocol)


def local_proxy_check(ip, port, protocol):
    proxy = {}
    proxy[protocol.lower()] = '%s:%s' % (ip, port)
    # print proxy
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.settimeout(1)
        s.connect((ip, int(port)))
        s.shutdown(2)
        logging.debug("{} {}".format(ip, port))
        return True
    except:
        logging.debug("-------- {} {}".format(ip, port))
        return False


"""
幾種在Linux下查詢外網IP的辦法
https://my.oschina.net/epstar/blog/513186
各大巨頭電商提供的IP庫API接口-新浪、搜狐、阿里
http://zhaoshijie.iteye.com/blog/2205033
"""


def wan_proxy_check(ip, port, protocol):
    proxy = {}
    proxy[protocol.lower()] = '%s:%s' % (ip, port)
    # proxy =  {protocol:protocol+ "://" +ip + ":" + port}
    # print(proxy)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    try:
        result = requests.get("http://pv.sohu.com/cityjson", headers=headers, proxies=proxy, timeout=1).text.strip(
            "\n")
        wan_ip = re.findall(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b", result)[0]
        if wan_ip == ip:
            logging.info("{} {} {}".format(protocol, wan_ip, port))
            logging.debug("========================")
        else:
            logging.debug("//// Porxy bad: {} {}".format(wan_ip, port))
    except Exception as e:
        logging.debug("#### Exception: {}".format(str(e)))


if __name__ == '__main__':
    proxy_spider(1)

Refer:

[1] Python爬蟲代理IP池(proxy pool)socket

https://github.com/jhao104/proxy_poolide

[2] Python爬蟲代理IP池網站

http://www.spiderpy.cn/blog/detail/13url

[3] python ip proxy tool scrapy crawl. 抓取大量免費代理 ip,提取有效 ip 使用

https://github.com/awolfly9/IPProxyTool

相關文章
相關標籤/搜索