開箱即用的高匿代理抓取工具

golang-proxy v3.0

golang-proxy
download

golang-proxy是一個開箱即用的高匿代理抓取工具, 它是語言無關的
項目地址: https://github.com/storyicon/golang-proxyhtml

golang-proxy

中文文檔

Golang-Proxy -- 簡單高效的免費代理抓取工具經過抓取網絡上公開的免費代理,來維護一個屬於本身的高匿代理池,用於網絡爬蟲、資源下載等用途。node

v3.0 有哪些新特性

  1. 依舊提供了高度靈活的 API 接口,在啓動主程序後,便可經過在瀏覽器訪問localhost:9999/alllocalhost:9999/random 直接獲取抓到的代理!甚至可使用 localhost:9999/sql?query=來執行一些簡單的 SQL 語句來自定義代理篩選規則!
  2. 依舊提供 WindowsLinuxMac 開箱即用版
    Download Release v3.0
  3. 支持自動對代理類型進行判斷, 能夠經過 schemeType 斷定代理對httphttps的支持程度
  4. 支持了MySQL數據庫, 詳情請見 Config
  5. 支持單獨啓動服務, 在啓動編譯好的二進制文件時, 經過 -mode= 來指定是否單獨啓動 producer/consumer/assessor/service
  6. 從新設計了數據表, 請注意, 這意味着 API 接口發生了變更
  7. 從新設計了 的數據結構, 去除了 filter 等字段, 請注意, 這意味着 v2.0 的源在直接提供給v3.0 使用時可能會出現一些問題
  8. 更新了一些
  9. 再也不支持 -source 啓動參數

如何使用 golang-proxy

1. 使用開箱即用版本

Release 頁面 根據系統環境提供了一些壓縮包,將他們解壓後執行便可。mysql

開箱即用版下載地址: Download Release v3.0git

下載完成後, 將壓縮包中的二進制文件和 source 目錄解壓到同一個位置, 啓動二進制文件便可, 程序將會啓動下面這些服務:github

  1. producer : 週期性的抓取source目錄中定義的源, 將抓取到的代理寫入到 crude_proxy 表中
  2. consumer : 週期性的從 crude_proxy 中讀取必定數量的代理, 判斷它們的代理類型以及可用性, 將它們寫入到 proxy表中
  3. assessor : 週期性的從 proxy 表中讀取必定數量的代理, 評估它們的質量
  4. service : golang-proxy 提供的 http api 接口, 使你能夠經過 localhost:9999/all, localhost:9999/random, localhost:9999/sql?query= 這三個接口來篩選和獲取 crude_proxyproxy 表中的代理

當你啓動編譯好的二進制文件時, 默認這些服務會依次啓動, 可是在 v3.0 版本, 你能夠經過添加 -mode 啓動參數來指定單獨啓動某個服務, 好比:golang

golang-proxy -mode=service

這樣運行, 將只會啓動 service 服務, 在啓動了 service 以後, 你能夠在瀏覽器中訪問如下接口, 得到相應的代理:算法

url description
localhost:9999/all 獲取 proxy 表中全部已經抓取到的代理
localhost:9999/all?table=proxy 獲取 proxy 表中全部已經抓取到的代理
localhost:9999/all?table=crude_proxy 獲取 crude_proxy 表中全部已經抓取到的代理
localhost:9999/random proxy 表中隨機獲取一條代理
localhost:9999/random?table=proxy proxy 表中隨機獲取一條代理
localhost:9999/random?table=crude_proxy crude_proxy 表中隨機獲取一條代理
localhost:9999/sql?query= query=後加上SQL語句, 返回SQL執行結果, 只支持較爲簡單的查詢語句

請注意, crude_proxy 只是抓取到的代理的臨時儲存表, 不能保證它們的質量, 而proxy 表中的代理將會不斷獲得 assessor 的評估, proxy 表中的 score 字段能夠較爲全面的反映一個代理的質量, 質量較低時會被刪除sql

接口示例: localhost:9999/sql

例如訪問 localhost:9999/sql?query=SELECT * FROM PROXY WHERE SCORE > 5 ORDER BY SCORE DESC, 將會返回 proxy 表中全部分數大於5的代理, 並按照分數從高到低返回數據庫

{
    "error": "",
    "message": [
        {
            "id": 2,
            "ip": "45.113.69.177",
            "port": "1080",
            // scheme_type 能夠取如下值:
            // 0: 代理只支持 http
            // 1: 代理只支持 https
            // 2: 代理同時支持 http 和 https
            "scheme_type": 0,
            "content": "45.113.69.177:1080",
            // 評估次數
            "assess_times": 9,
            // 評估成功次數, 能夠經過 success_times/assess_times得到代理鏈接成功率
            "success_times": 9,
            // 平均響應時間
            "avg_response_time": 0.098,
            // 連續失敗次數
            "continuous_failed_times": 0,
            // 分數, 推薦使用 5 分以上的代理
            "score": 68.45106053570785,
            "insert_time": 1540793312,
            "update_time": 1540797880
        },
    ]
}

2. 使用源碼編譯

go get -u github.com/storyicon/golang-proxy

進入到 golang-proxy 目錄,執行 go build main.go,執行生成的二進制的執行程序便可。json

注意:

項目根目錄下的 ./source 是項目執行必須的文件夾,裏面存儲了各種網站源,其餘的文件夾儲存的均爲項目源碼。因此在編譯後獲得二進制程序 main 文件後,便可將 main 文件和 source 文件夾一同移動到任意地方,main 文件能夠任意命名。

爲何要用 Golang-Proxy

  1. 穩定、快速。
    抓取模塊,單核併發能夠到達 1000 個頁面/秒
  2. 高可配置性、高拓展性。
    你不須要寫任何代碼,花一兩分鐘填寫一個配置文件就能夠添加一個新的網站源。
  3. 評估功能。
    經過 Assessor 評估模塊,週期性測試代理質量,根據代理的測試成功率、高匿性、測試次數、突變性、響應速度等獨立影響因子進行綜合評分,算法具備高度可配置性,能夠根據項目的須要能夠對因子的權重進行獨立調整。
  4. 提供了高度靈活的 API 接口,在啓動主程序後,便可經過在瀏覽器訪問localhost:9999/alllocalhost:9999/random 直接獲取抓到的代理!甚至可使用 localhost:9999/sql?query=來執行 SQL 語句來自定義代理篩選規則!
  5. 不依賴任何服務型數據庫,一鍵下載,開箱即用!

如何配置一個新的源

./source/下的全部 yml 格式的文件都是,你能夠增長源,也能夠經過在文件名前加上一個 . 來使程序忽略這個源,固然你也能夠直接刪除,來讓一個源永遠的消失,下面進行 Source 參數介紹:

#Page配置項
page:
    entry: "https://xxx/1.html"
    template: "https://xxx/{page}.html"
    from: 2
    to: 10
#publisher將會首先抓取entry,即 https://xxx/1.html
#而後根據 template、from 和 to 依次抓取
#  https://xxx/2.html
#  https://xxx/3.html
#  https://xxx/4.html
#  ...
#  https://xxx/10.html
#Selector配置項
selector:
    iterator: ".table tbody tr"
    ip: "td:nth-child(1)"
    port: "td:nth-child(2)"
# 以上配置用於抓取下面這種 HTML 結構
# <table class="table">
#     <tbody>
#         <tr>
#             <td>187.3.0.1</td>
#             <td>8080</td>
#             <td>HTTP</td>
#         <tr>
#         <tr>
#             <td>164.23.1.2</td>
#             <td>80</td>
#             <td>HTTPS</td>
#         <tr>
#         <tr>
#             <td>131.9.2.3</td>
#             <td>8080</td>
#             <td>HTTP</td>
#         <tr>
#     <tbody>
# <table>
# 選擇器爲通用的JQuery選擇器,iterator爲循環對象,好比表格裏的行,每行一條代理,那這個行的選擇器就是iterator,而ip、port、protocal則是在iterator選擇器的基礎上進行子元素的查找。
category:
    # 並行數
    parallelnumber: 1
    # 對於這個源,每抓取一個頁面
    # 將會隨機等待5~20s再抓下一個頁面
    delayRange: [5, 20]
    # 間隔多長時間啓用一次這個源
    # @every 10s , @every 10h...
    interval: "@every 10m"
debug: true

徵求意見

  1. 使用中任何問題提 issues 便可
  2. 若是發現了新的好用的源,歡迎提交上來分享
  3. 來都來了點個 Star 再走唄 : )

English Document

Golang-proxy is an efficient free proxy crawler that ensures that the captured proxies are highly anonymous and at the same time guarantee their quality. You can use these captured proxies to download network resources and ensure the privacy of your own identity.

1. Feature

  • Very high speed of proxy crawler, which can download 1000 pages per second.
  • You can customize the source of proxy crawler. The configuration file is extremely simple.
  • Provide a compiled version, comes with a SQLite database, and supports mysql
  • Comes with an API interface, all functions can be used with one click
  • Proxy evaluation system to ensure the quality of the proxy pool

2. How to use

golang-proxy provides compiled binary files so that you do not need golang on the machine. Download binary compression pack to Release Page
According to your system type, download the corresponding compression package, unzip it and run it. After a few minutes, you can access localhost:9999/all in the browser to see the proxy's crawl results.

Before I go into the detailed introduction of golang-proxy, I think it's best to tell you the most useful information first.

API interface

After you start the binary, you can access the following interface in the browser to get the proxy

url description
localhost:9999/all Get all highly available proxies
localhost:9999/all?table=proxy Get all highly available proxies
localhost:9999/random Randomly acquire a highly available proxy
localhost:9999/all?table=crude_proxy Obtain the proxies in the temporary table (the quality of them cannot be guaranteed)
localhost:9999/random?table=proxy Randomly get an proxy from the temporary table (the quality of them cannot be guaranteed)
localhost:9999/sql?query= Write the SQL statement you want to execute after query=, customize your filter rules.

Having mastered the above content, you have been able to use the 50% function of golang-proxy. But the last interface allows you to execute custom SQL statements, and you'll find that you need to know at least the structure of the tables. The following will tell you.

3. Advanced

golang-proxy consists of the following parts:

  • two data tables
  • one configuration file
  • one source folder
  • four modules

two data tables

1. Table Crude Proxy

In order to store temporary proxies, we designed the data table crude_proxy, the table is defined as follows.

field type example description
id int - -
ip string 192.168.0.1 -
port string 255 -
content string 192.168.0.1:255 -
insert_time int 1540798717 -
update_time int 1540798717 -

table crude_proxy stores the proxies that are crawled out, and cannot guarantee their quality.

2. Table Proxy

When the agent in the crude_proxy table passes through pre assess ( pre assess roughly verifies the availability of the proxy and tests the proxy's support for https and http ), it will enter the proxy table.

field type example description
id int - -
ip string 192.168.0.1 -
port string 255 -
scheme_type int 2 Identify the extent to which the proxy supports http and https, 0: http only, 1 https only, 2 https & http
content string 192.168.0.1:255
assess_times int 5 proxy evaluation times
success_times int 5 The number of times the proxy successfully passed the evaluation
avg_response_time float 0.001 -
continuous_failed_times int 0 The number of consecutive failures during the proxy evaluation process
score float 25 The higher the better
insert_time int 1540798717 -
update_time int 1540798717 -

The proxy in the proxy table will be evaluated periodically and their scores will be modified. Low scores will be deleted.

one configuration file

For convenience, the proxy in golang-proxy is stored in the portable database sqlite by default. You can make golang-proxy use the mysql database by adding the config.yml file in the executable directory.

For details, see Config page.

one source folder

golang-proxy needs source to define its crawling contents and rules. Therefore, the run directory of golang-proxy needs at least one source folder, and the source folder should have at least one source in yml format.
The source is defined as follows:

page: 
    entry: "http://www.xxx.com/http/?page=1"
    template: "http://www.xxx.com/http/?page={page}"
    from: 1
    to: 2000
selector:
    iterator: ".list item"
    ip: ".ip"
    port: ".port"
category:
    parallelnumber: 3
    delayRange: [10, 30]
    interval: "@every 10m"
debug: true

In the definition above, producer will first crawl the entry page, then crawl:

http://www.xxx.com/http/?page=1      
http://www.xxx.com/http/?page=2      
http://www.xxx.com/http/?page=3      
...      
http://www.xxx.com/http/?page=2000

This source definition page expects this format:

<html>
    ...
    <div class="list">
        <div class="item">
            <div class="ip"> 127.0.0.1 </div>
            <div class="port"> 80 </div>
            ...
        </div>
        <div class="item">
            <div class="ip"> 125.4.0.1 </div>
            <div class="port"> 8080 </div>
            ...
        </div>
        ...
    </div>
    ...
</html>

When producer parses a single page, it always traverses the nodes defined by iterator first, and then gets the elements defined by ip and port selectors from these nodes. The source definition above is still valid for the following HTML structure.

<html>
    ...
    <div class="list">
        <div class="item">
            <div class="ip"> 127.0.0.1:80 </div>
        </div>
        <div class="item">
            <div class="ip"> 125.4.0.1:8080</div>
        </div>
        ...
    </div>
    ...
</html>

Because when the port selector cannot get the content, it will try to parse the port from the text selected by the ip selector.

The source is stored in the source folder in yml format, and a source definition is completed. Golang-proxy will read it and crawl it the next time it starts. So you successfully define a source, store it in the source folder in YML format, and the next time you start golang-proxy, the source will enter the crawl list.

If a source file name starts with a . , the source will not be read.

four modules

golang-proxy consists of four modules, which cooperate to complete the task that golang-proxy wants to accomplish.

module name description
producer Periodically fetch the source defined in the source directory, and write the fetched proxy to the crude_proxy table.
consumer Periodically read a certain number of proxies from crude_proxy, determine their proxy scheme type and availability, and write them to the proxy table.
assessor Periodically read a number of proxies from the proxy table to evaluate their quality.
service Be responsible for the HTTP API interface provided by golang-proxy, allows you to filter and obtain the proxies in the crude_proxy and proxy tables by localhost: 9999/all, localhost: 9999/random, and localhost: 9999/sql.

When you start the executable file of golang-proxy, you will start these module in turn. But you can add the -mode startup parameter after the golang-proxy executable to command golang-proxy to start only one module. Like below:

golang-proxy -mode=service

This will only start the HTTP API interface service.

At this point, you have mastered the 95% function of golang-proxy. If you want to find more, you can read the source code provided above, and improve them.

Request for comments

Welcome to submit issue. If you feel that golang-proxy is helping you, you can order a star or watch, thanks !

相關文章
相關標籤/搜索