爬蟲中常常遇到被封殺IP的狀況,最有效的方式就是使用代理IP。咱們能夠在一些平臺上購買代理IP,可是價格比較昂貴。另外不少IP代理網站也提供了一些免費的代理IP,能夠爬取下這些代理IP,並使用webAPI方式提供代理IP服務。python
項目完整代碼已託管到github: https://github.com/panjings/p...
項目結構以下:git
從程序的入口run.py
開始分析:github
from proxypool.api import app from proxypool.schedule import Schedule def main(): s = Schedule() // 運行調度器 s.run() // 運行接口 app.run() if __name__ == '__main__': main()
從run.py
中不難看出,首先運行了一個調度器,接着運行了一個接口。web
調度器schedule.py
代碼:redis
class Schedule(object): @staticmethod def valid_proxy(cycle=VALID_CHECK_CYCLE): """ Get half of proxies which in redis """ conn = RedisClient() tester = ValidityTester() while True: print('Refreshing ip') count = int(0.5 * conn.queue_len) if count == 0: print('Waiting for adding') time.sleep(cycle) continue raw_proxies = conn.get(count) tester.set_raw_proxies(raw_proxies) tester.test() time.sleep(cycle) @staticmethod def check_pool(lower_threshold=POOL_LOWER_THRESHOLD, upper_threshold=POOL_UPPER_THRESHOLD, cycle=POOL_LEN_CHECK_CYCLE): """ If the number of proxies less than lower_threshold, add proxy """ conn = RedisClient() adder = PoolAdder(upper_threshold) while True: if conn.queue_len < lower_threshold: adder.add_to_queue() time.sleep(cycle) def run(self): print('Ip processing running') valid_process = Process(target=Schedule.valid_proxy) check_process = Process(target=Schedule.check_pool) valid_process.start() check_process.start()
在Schedule
中首先聲明瞭valid_proxy()
,用來檢測代理是否可用,其中ValidityTester()
方法中的test_single_proxy()
方法是實現異步檢測的關鍵。
接着check_pool()
方法裏面傳入了三個參數:兩個代理池的上下界限,一個時間。其中PoolAdder()
的add_to_queue()
方法中使用了一個從網站抓取ip的類FreeProxyGetter()
,FreeProxyGetter()
定義在getter.py裏面。flask
接口api.py
的代碼:api
from flask import Flask, g from .db import RedisClient __all__ = ['app'] app = Flask(__name__) def get_conn(): """ Opens a new redis connection if there is none yet for the current application context. """ if not hasattr(g, 'redis_client'): g.redis_client = RedisClient() return g.redis_client @app.route('/') def index(): return '<h2>Welcome to Proxy Pool System</h2>' @app.route('/get') def get_proxy(): """ Get a proxy """ conn = get_conn() return conn.pop() @app.route('/count') def get_counts(): """ Get the count of proxies """ conn = get_conn() return str(conn.queue_len) if __name__ == '__main__': app.run()
不難看出,在api.py
中利用了flask
框架的特性定義了各類接口。架構
具體代碼實現請參考github。app