【蟲術】資深爬蟲師帶你爬取代理IP

時間 2019-12-05

原文原文鏈接

有時候在網站看小說，會莫名跳出來一個「疑似機器惡意爬取，暫時沒法訪問」這樣相似的網站提示，須要刷新一下或者輸入一個驗證碼才能從新進入，這樣的狀況偶有發生，相信你們都有遇到過。出現這個現象的緣由就是咱們瀏覽的網頁採起了反爬蟲的措施，特別作爬蟲爬取網頁，在某個ip單位時間請求網頁次數過多時，服務器會拒絕服務，這種狀況就是因爲訪問頻率引發的封ip，這種狀況靠解封不能很好的解決，因此咱們就想到了假裝本機ip去請求網頁，也就是咱們今天要講的使用代理ip。python

目前網上有許多代理ip，有免費的也有付費的，例如西刺代理，豌豆代理，快代理等等，免費的雖然不用花錢但有效的代理不多且不穩定，付費的可能會好一點，不過今天我只爬取免費的西刺代理並將檢測是否可用，將可用ip存入MongoDB，方便下次取出。web

運行平臺：Windows面試

Python版本：Python3.6數據庫

**IDE: **Sublime Text編程

其餘：Chrome瀏覽器數組

簡述流程爲：瀏覽器

步驟1：瞭解requests代理如何使用服務器

步驟2：從西刺代理網頁爬取到ip和端口網絡

步驟3：檢測爬取到的ip是否可用app

步驟4：將爬取的可用代理存入MongoDB

步驟5：從存入可用ip的數據庫裏隨機抽取一個ip,測試成功後返回

對於requests來講，代理的設置比較簡單，只須要傳入proxies參數便可。

不過須要注意的是，這裏我是在本機安裝了抓包工具Fiddler，並用它在本地端口8888建立了一個HTTP代理服務（用Chrome插件SwitchyOmega），即代理服務爲：127.0.0.1:8888，咱們只要設置好這個代理，就能夠成功將本機ip切換成代理軟件鏈接的服務器ip了。

import requests

proxy = '127.0.0.1:8888'
proxies = {
    'http':'http://' + proxy,
    'https':'http://' + proxy
}

try:
    response = requests.get('http://httpbin.org/get',proxies=proxies)
    print(response.text)
except requests.exceptions.ConnectionError as e:
    print('Error',e.args)
http://httpbin.org/get

這裏我是用來http://httpbin.erg/get做爲測試網站，咱們訪問該網頁能夠獲得請求的有關信息，其中origin字段就是客戶端ip，咱們能夠根據返回的結果判斷代理是否成功。返回結果以下：

{
    "args":{}，
    "headers":{
        "Accept":"*/*",
        "Accept-Encoding":"gzip, deflate",
        "Connection":"close",
        "Host":"httpbin.org",
        "User-Agent":"python-requests/2.18.4"
    },
    "origin":"xx.xxx.xxx.xxx",
    "url":"http://httpbin.org/get"
}

接下來咱們便開始爬取西刺代理，首先咱們打開Chrome瀏覽器查看網頁，並找到ip和端口元素的信息。

能夠看到，西刺代理以表格存儲ip地址及其相關信息，因此咱們用BeautifulSoup提取時很方便便能提取出相關信息，可是咱們須要注意的是，爬取的ip頗有可能出現重複的現象，尤爲是咱們同時爬取多個代理網頁又存儲到同一數組中時，因此咱們可使用集合來去除重複的ip。

27   def scrawl_xici_ip(num):
 28    '''
 29    爬取代理ip地址，代理的url是西刺代理
 30    '''  
 31    ip_list = []
 32    for num_page in range(1,num):
 33        url = url_ip + str(num_page)
 34        response = requests.get(url,headers=headers)
 35        if response.status_code == 200:
 36            content = response.text
 37            soup = BeautifulSoup(content,'lxml')
 38            trs = soup.find_all('tr')
 39            for i in range(1,len(trs)):
 40                tr = trs[i]
 41                tds = tr.find_all('td')      
 42                ip_item = tds[1].text + ':' + tds[2].text
 43                # print(ip_item)
 44                ip_list.append(ip_item)
 45                ip_set = set(ip_list) # 去掉可能重複的ip
 46                ip_list = list(ip_set)
 47            time.sleep(count_time) # 等待5秒
 48    return ip_list

將要爬取頁數的ip爬取好後存入數組，而後再對其中的ip逐一測試。

51def ip_test(url_for_test,ip_info):
 52    '''
 53    測試爬取到的ip，測試成功則存入MongoDB
 54    '''
 55    for ip_for_test in ip_info:
 56        # 設置代理
 57        proxies = {
 58            'http': 'http://' + ip_for_test,
 59            'https': 'http://' + ip_for_test,
 60            }
 61        print(proxies)
 62        try:
 63            response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
 64            if response.status_code == 200:
 65                ip = {'ip':ip_for_test}
 66                print(response.text)
 67                print('測試經過')
 68                write_to_MongoDB(ip)    
 69        except Exception as e:
 70            print(e)
 71            continue

這裏就用到了上面提到的requests設置代理的方法，咱們使用http://httpbin.org/ip做爲測試網站，它能夠直接返回咱們的ip地址，測試經過後再存入MomgoDB數據庫。

存入MongoDB的方法在上一篇糗事百科爬取已經提過了。鏈接數據庫而後指定數據庫和集合，再將數據插入就OK了。

74def write_to_MongoDB(proxies):
 75    '''
 76    將測試經過的ip存入MongoDB
 77    '''
 78    client = pymongo.MongoClient(host='localhost',port=27017)
 79    db = client.PROXY
 80    collection = db.proxies
 81    result = collection.insert(proxies)
 82    print(result)
 83    print('存儲MongoDB成功')

最後運行查看一下結果吧

若是對Python編程、網絡爬蟲、機器學習、數據挖掘、web開發、人工智能、面試經驗交流。感興趣能夠519970686，羣內會有不按期的發放免費的資料連接，這些資料都是從各個技術網站蒐集、整理出來的，若是你有好的學習資料能夠私聊發我，我會註明出處以後分享給你們。

稍等，運行了一段時間後，可貴看到一連三個測試經過，趕忙截圖保存一下，事實上是，畢竟是免費代理，有效的仍是不多的，而且存活時間確實很短，不過，爬取的量大，仍是能找到可用的，咱們只是用做練習的話，仍是勉強夠用的。如今看看數據庫裏存儲的吧。

由於爬取的頁數很少，加上有效ip也少，再加上我沒怎麼爬，因此如今數據庫裏的ip並很少，不過也算是將這些ip給存了下來。如今就來看看怎麼隨機取出來吧。

85
 86def get_random_ip():
 87    '''
 88    隨機取出一個ip
 89    '''
 90    client = pymongo.MongoClient(host='localhost',port=27017)
 91    db = client.PROXY
 92    collection = db.proxies
 93    items = collection.find()
 94    length = items.count()
 95    ind = random.randint(0,length-1)
 96    useful_proxy = items[ind]['ip'].replace('\n','')
 97    proxy = {
 98        'http': 'http://' + useful_proxy,
 99        'https': 'http://' + useful_proxy,
100        }   
101    response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
102    if response.status_code == 200:
103        return useful_proxy
104    else:
105        print('此{ip}已失效'.format(useful_proxy))
106        collection.remove(useful_proxy)
107        print('已經從MongoDB移除')
108        get_random_ip()
109

因爲擔憂放入數據庫一段時間後ip會失效，因此取出前我從新進行了一次測試，若是成功再返回ip，不成功的話就直接將其移出數據庫。

這樣咱們須要使用代理的時候，就能經過數據庫隨時取出來了。

總的代碼以下：

import random
import requests
import time
import pymongo
from bs4 import BeautifulSoup

# 爬取代理的URL地址，選擇的是西刺代理
url_ip = "http://www.xicidaili.com/nt/"

# 設定等待時間
set_timeout = 5

# 爬取代理的頁數，2表示爬取2頁的ip地址
num = 2

# 代理的使用次數
count_time = 5

# 構造headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

# 測試ip的URL
url_for_test = 'http://httpbin.org/ip'

def scrawl_xici_ip(num):
    '''
    爬取代理ip地址，代理的url是西刺代理
    '''  
    ip_list = []
    for num_page in range(1,num):
        url = url_ip + str(num_page)
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            content = response.text
            soup = BeautifulSoup(content,'lxml')
            trs = soup.find_all('tr')
            for i in range(1,len(trs)):
                tr = trs[i]
                tds = tr.find_all('td')      
                ip_item = tds[1].text + ':' + tds[2].text
                # print(ip_item)
                ip_list.append(ip_item)
                ip_set = set(ip_list) # 去掉可能重複的ip
                ip_list = list(ip_set)
            time.sleep(count_time) # 等待5秒
    return ip_list

def ip_test(url_for_test,ip_info):
    '''
    測試爬取到的ip，測試成功則存入MongoDB
    '''
    for ip_for_test in ip_info:
        # 設置代理
        proxies = {
            'http': 'http://' + ip_for_test,
            'https': 'http://' + ip_for_test,
            }
        print(proxies)
        try:
            response = requests.get(url_for_test,headers=headers,proxies=proxies,timeout=10)
            if response.status_code == 200:
                ip = {'ip':ip_for_test}
                print(response.text)
                print('測試經過')
                write_to_MongoDB(ip)    
        except Exception as e:
            print(e)
            continue

def write_to_MongoDB(proxies):
    '''
    將測試經過的ip存入MongoDB
    '''
    client = pymongo.MongoClient(host='localhost',port=27017)
    db = client.PROXY
    collection = db.proxies
    result = collection.insert(proxies)
    print(result)
    print('存儲MongoDB成功')

def get_random_ip():
    '''
    隨機取出一個ip
    '''
    client = pymongo.MongoClient(host='localhost',port=27017)
    db = client.PROXY
    collection = db.proxies
    items = collection.find()
    length = items.count()
    ind = random.randint(0,length-1)
    useful_proxy = items[ind]['ip'].replace('\n','')
    proxy = {
        'http': 'http://' + useful_proxy,
        'https': 'http://' + useful_proxy,
        }   
    response = requests.get(url_for_test,headers=headers,proxies=proxy,timeout=10)
    if response.status_code == 200:
        return useful_proxy
    else:
        print('此{ip}已失效'.format(useful_proxy))
        collection.remove(useful_proxy)
        print('已經從MongoDB移除')
        get_random_ip()

def main():
    ip_info = []
    ip_info = scrawl_xici_ip(2)
    sucess_proxy = ip_test(url_for_test,ip_info)
    finally_ip = get_random_ip()
    print('取出的ip爲：' + finally_ip)

if __name__ == '__main__':
    main()

【給技術人一點關愛！！！】