c#批量抓取免費代理並驗證有效性

以前看到某公司的官網的文章的瀏覽量刷新一次網頁就會增長一次,給人的感受不太好,一個公司的官網給人如此直白的漏洞,我批量發起請求的時候發現頁面打開都報錯,100多人的公司的官網文章刷新一次你給我看這個,這公司之前來過咱們學校宣傳招人+在園子裏搜招聘的時候發現竟然之前招xamarin,挺好奇的,因此就關注過。好吧不說這些了,只是扯扯蛋而已,迴歸主題,我想說的是csdn的文章能夠經過設置代理ip刷新文章的瀏覽量,因此首先要作的就是這篇文章的主題「使用c#驗證代理ip有效性」。php

固然代理IP來源確定是免費,因此嘛效率通常,從一些免費的代理ip的網頁抓取的代理IP並不必定都是有用的,因此須要咱們對咱們抓取的代理ip進行驗證,代理ip的有效時間也是有限,從10幾秒到1個小時不限,大多數時間很是短,因此好比說,咱們1分鐘須要100個代理ip,那就1分鐘獲取一次,每次獲取100個(這裏是理想狀態下的,抓取的代理ip都是有效的),原則上來講抓取下來後應該當即立刻被使用。html

固然這篇文章比較基礎,一直以爲爬蟲比較有趣,其實我在爬蟲方面也是個小白,只是作一個簡單的記錄,若是有什麼錯誤的地方,但願能提出建議。針對下面幾個問題,咱們就能夠完成如何驗證代理IP有效性的檢測了。正則表達式

1.從哪些網頁上能夠抓取免費的代理IP?

http://www.xicidaili.comredis

http://www.ip3366.netc#

http://www.66ip.cnapi

百度一下「免費代理ip」挺多的。數據結構

2.代理IP穩定嗎?有什麼做用?

這種免費的代理ip時效性和有效性都不強,上面這三個免費的代理網站,時效性大概在十幾秒到1個小時不等,通常須要本身處理驗證後使用,提升命中率。可適用於隱藏網頁IP(有些網站還不許使用代理ip,好比豆瓣,其實挺尷尬的,內容這麼貴嗎),通常經常使用於空間留言、刷網站流量、網賺任務、批量註冊帳號等,只要沒有其餘限制,須要頻繁更換ip均可以使用。多線程

3.ping通IP就是有效的嗎?如何驗證代理是否有效

好吧,這有點廢話,進行端口測試纔是最有效的,能ping通並不表明代理有效,不能平通也不必定代理不可用。可使用HttpWebRequest,也可使用Scoket,固然HttpWebRequest比Socket鏈接代理ip、port要慢。異步

4.一次提取多少代理合適?

代理ip時效性不強、而且有效性也不高,因此只能從一些代理ip的網站上批量定時去獲取,有的代理在一分鐘內使用是有限制的,因此說限制比較多。async

5.http代理和https代理有什麼區別?

須要訪問https的網站就須要使用https代理了,好比百度,須要訪問http的代理,可使用http。這個並非100%的。

檢測代理ip有效性步驟以下:

1.使用HttpWebRequest、HttpWebResponse請求代理ip的網頁,獲取包含代理的網頁內容

2.使用HtmlAgilityPack或者正則表達式對抓取的內容進行截取,保存到代理集合

3.拿到代理集合,多線程發起http請求,好比訪問百度,是否成功,成功則存到Redis裏面。

效果圖以下:

  • 使用HttpWebRequest發起請求

Request.cs以下,主要就是兩個方法,一個方法是驗證代理ip是否有效,設置HttpWebRequest的Proxy屬性,請求百度,看到有些文章大多數會獲取響應的內容,若是內容符合請求的網址則證實代理喲有效,實際上根據HttpStatusCode 200就能夠判斷是否驗證有效。

【注意】建的是控制檯程序,使用了異步,因此仍是建.net core吧,c#語言的版本7.1。C#如何在控制檯程序中使用異步

 1  public class Request
 2     {
 3         /// <summary>
 4         /// 驗證代理ip有效性
 5         /// </summary>
 6         /// <param name="proxyIp">代理IP</param>
 7         /// <param name="proxyPort">代理IP 端口</param>
 8         /// <param name="timeout">詳情超時</param>
 9         /// <param name="url">請求的地址</param>
10         /// <param name="success">成功的回調</param>
11         /// <param name="fail">失敗的回調</param>
12         /// <returns></returns>
13         public static async System.Threading.Tasks.Task getAsync(string proxyIp,int  proxyPort, int  timeout,string url, Action success, Action<string> fail)
14         {
15             System.GC.Collect();
16             HttpWebRequest request = null;
17             HttpWebResponse response = null;
18             try
19             {
20                 request = (HttpWebRequest)WebRequest.Create(url);
21                 //HttpWebRequest request = HttpWebRequest.CreateHttp(url);
22                 request.Timeout =timeout;
23                 request.KeepAlive = false;
24                 request.Proxy = new WebProxy(proxyIp,proxyPort);
25                 response =  await  request.GetResponseAsync() as HttpWebResponse;
26                 if (response.StatusCode == HttpStatusCode.OK)
27                 {
28                     success();
29                 }
30                 else
31                 {
32                     fail(response.StatusCode+":"+response.StatusDescription);
33                 }
34             }
35             catch (Exception ex)
36             {
37                 fail("請求異常"+ex.Message.ToString());
38             }
39             finally
40             {
41                 if (request != null)
42                 {
43                     request.Abort();
44                     request = null;
45                 }
46                 if (response != null)
47                 {
48                     response.Close();
49                 }
50             }
51         }
52 
53         /// <summary>
54         ///  發起http請求
55         /// </summary>
56         /// <param name="url"></param>
57         /// <param name="success">成功的回調</param>
58         /// <param name="fail">失敗的回調</param>
59         public static void get(string url,Action<string> success,Action<string> fail)
60         {
61             StreamReader reader = null;
62             Stream stream = null;
63             WebRequest request = null;
64             HttpWebResponse response = null;
65             try
66             {
67                 request = WebRequest.Create(url);
68                 request.Timeout = 2000;
69                 response = (HttpWebResponse)request.GetResponse();
70                 if (response.StatusCode == HttpStatusCode.OK)
71                 {
72                     stream = response.GetResponseStream();
73                     reader = new StreamReader(stream);
74                     string result = reader.ReadToEnd();
75                     success(result);
76                 }
77                 else
78                 {
79                     fail(response.StatusCode+":"+response.StatusDescription);
80                 }
81             }
82             catch (Exception ex)
83             {
84                 fail(ex.ToString());
85             }
86             finally
87             {
88                 if (reader != null)
89                     reader.Close();
90                 if (stream != null)
91                     stream.Close();
92                 if(response!=null)
93                     response.Close();
94                 if(request!=null)
95                     request.Abort();
96             }
97         }
98     }
  • 抓取免費代理,並檢查是否有效

ProxyIpHelper.cs 中主要有四個方法,檢查ip是否可用CheckProxyIpAsync、抓取xicidaili.com的代理GetXicidailiProxy、抓取ip3366.net的代理GetIp3366Proxy、抓取66ip.cn的代理GetIp3366Proxy。若是想多抓取幾個網站能夠多寫幾個。

 public class ProxyIpHelper
    {
        private static string address_xicidaili = "http://www.xicidaili.com/wn/{0}";
        private static string address_66ip = "http://www.66ip.cn/nmtq.php?getnum=20&isp=0&anonymoustype=0&start=&ports=&export=&ipaddress=&area=1&proxytype=1&api=66ip";
        private static string address_ip3366 = "http://www.ip3366.net/?stype=1&page={0}";
        /// <summary>
        /// 檢查代理IP是否可用
        /// </summary>
        /// <param name="ipAddress">ip</param>
        /// <param name="success">成功的回調</param>
        /// <param name="fail">失敗的回調</param>
        /// <returns></returns>
        public  static async Task CheckProxyIpAsync(string ipAddress, Action success, Action<string> fail)
        {
            int index = ipAddress.IndexOf(":");
            string proxyIp = ipAddress.Substring(0, index);
            int proxyPort = int.Parse(ipAddress.Substring(index + 1));
            await Request.getAsync(proxyIp, proxyPort, 3000, "https://www.baidu.com/", () =>
            {
                success();
            }, (error) =>
            {
                fail(error);
            });
        }
        /// <summary>
        /// 從xicidaili.com網頁上去獲取代理IP,能夠分頁
        /// </summary>
        /// <param name="page"></param>
        /// <returns></returns>
        public static List<string> GetXicidailiProxy(int page)
        {
            List<string> list = new List<string>();
            for (int p = 1; p <= page; p++)
            {
                string url = string.Format(address_xicidaili, p);
               Request.get(url,(docText)=> {
                   if (!string.IsNullOrWhiteSpace(docText))
                   {
                       HtmlDocument doc = new HtmlDocument();
                       doc.LoadHtml(docText);
                       var trNodes = doc.DocumentNode.SelectNodes("//table[@id='ip_list']")[0].SelectNodes("./tr");
                       if (trNodes != null && trNodes.Count > 0)
                       {
                           for (int i = 1; i < trNodes.Count; i++)
                           {
                               var tds = trNodes[i].SelectNodes("./td");
                               string ipAddress = tds[1].InnerText + ":" + int.Parse(tds[2].InnerText); ;
                               list.Add(ipAddress);
                           }
                       }
                   }
               },(error)=> {
                   Console.WriteLine(error);
               });
            }
            return list;  
         }
        /// <summary>
        /// 從ip3366.net網頁上去獲取代理IP,能夠分頁
        /// </summary>
        /// <param name="page"></param>
        /// <returns></returns>
        public static List<string> GetIp3366Proxy(int page)
        {
            List<string> list = new List<string>();
            for (int p = 1; p <= page; p++)
            {
                string url = string.Format(address_ip3366, p);
                Request.get(url, (docText) => {
                    if (!string.IsNullOrWhiteSpace(docText))
                    {
                        HtmlDocument doc = new HtmlDocument();
                        doc.LoadHtml(docText);
                        var trNodes1 = doc.DocumentNode.SelectNodes("//table")[0];
                        var trNodes2 = doc.DocumentNode.SelectNodes("//table")[0].SelectSingleNode("//tbody");
                        var trNodes = doc.DocumentNode.SelectNodes("//table")[0].SelectSingleNode("//tbody").SelectNodes("./tr");
                        if (trNodes != null && trNodes.Count > 0)
                        {
                            for (int i = 1; i < trNodes.Count; i++)
                            {
                                var tds = trNodes[i].SelectNodes("./td");
                                if (tds[3].InnerHtml == "HTTPS")
                                {
                                    string ipAddress = tds[0].InnerText + ":" + int.Parse(tds[1].InnerText); ;
                                    list.Add(ipAddress);
                                }
                            }
                        }
                    }
                }, (error) => {
                    Console.WriteLine(error);
                });
            }
            return list;
         }
        /// <summary>
        /// 從66ip.cn中去獲取,不須要分頁
        /// </summary>
        /// <returns></returns>
        public static List<string> Get66ipProxy()
        {
            List<string> list = new List<string>();
            Request.get(address_66ip,
            (docText)=> {
                int count = 0;
                if (string.IsNullOrWhiteSpace(docText) == false)
                {
                    string regex = "\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\:\\d{1,5}";
                    Match mstr = Regex.Match(docText, regex);
                    while (mstr.Success && count < 20)
                    {
                        string tempIp = mstr.Groups[0].Value;
                        list.Add(tempIp);
                        mstr = mstr.NextMatch();
                        count++;
                    }
                }
            },
            (error)=> {
                Console.WriteLine(error);
            });
            return list;
        }
    }
  • 使用Timer定時抓取,並檢查,成功則保存到redis

c#有三種定時器,這裏定時器是使用System.Threading命名空間, 這個Timer會開啓新的線程,抓取三個網頁定義了三個Timer對象。每一次抓取都會保存上一次抓取的集合,檢查前,會進行對比,取出新的集合也就是沒有重複的那部分。有效性的ip比較低,這裏沒有作統計,若是代碼再優化一下,能夠作一下統計,看看程序的主入口吧,最終的實現以下:

  1   class Program
  2     {
  3         static bool timer_ip3366_isCompleted = true;
  4         static bool timer_xicidaili_isCompleted = true;
  5         static bool timer_66ip_isCompleted = true;
  6         static  Timer timer_ip3366, timer_xicidaili, timer_66ip;
  7         private static List<string> lastListip3366,lastList66ip,lastListxicidaili;//保存上一次抓取的代理,與下一次進行對比,取新的集合進行檢查篩選
  8         static async Task Main(string[] args)
  9         {
 10             System.Net.ServicePointManager.DefaultConnectionLimit = 2000;
 11             Console.WriteLine("hellow proxyIp");
 12             Console.ReadLine();
 13             lastList66ip = new List<string>();
 14             lastListip3366 = new List<string>();
 15             lastListxicidaili = new List<string>();
 16             timer_ip3366 = new Timer(async (state) =>
 17             {
 18                 await TimerIp3366Async();
 19             }, "processing timer_ip3366 event", 0,1000*30);
 20             timer_xicidaili = new Timer(async (state) =>
 21             {
 22                 await TimerXicidailiAsync();
 23             }, "processing timer_xicidaili event", 0, 1000 * 60);
 24             timer_66ip = new Timer(async (state) =>
 25             {
 26                 await Timer66ipAsync();
 27             }, "processing timer_66ip event", 0, 1000*30);
 28             
 29             Console.ReadLine();
 30         }
 31 
 32 
 33 
 34         private static async Task Timer66ipAsync()
 35         {
 36             if (timer_66ip_isCompleted)
 37             {
 38                 timer_66ip_isCompleted = false;
 39                 List<string> checkList = new List<string>();
 40                 var listProxyIp = ProxyIpHelper.Get66ipProxy();
 41 
 42                 if (listProxyIp.Count > 0)
 43                 {
 44                     Console.ForegroundColor = ConsoleColor.DarkCyan;
 45                     Console.WriteLine("66ip.cn 抓取到" + listProxyIp.Count + "條記錄,正在對比.........");
 46                     listProxyIp.ForEach(f =>
 47                     {
 48                         if (!lastList66ip.Contains(f))
 49                         {
 50                             checkList.Add(f);
 51                         }
 52                     });
 53                     lastList66ip = listProxyIp;
 54                     if (checkList.Count > 0)
 55                     {
 56                         Console.ForegroundColor = ConsoleColor.DarkCyan;
 57                         Console.WriteLine("66ip.cn 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效..........");
 58                         for (int i = 0; i < checkList.Count; i++)
 59                         {
 60                             string ipAddress = checkList[i];
 61                             await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () =>
 62                             {
 63                                 bool insertSuccess = RedisHelper.InsertSet(ipAddress);
 64                                 Console.ForegroundColor = ConsoleColor.White;
 65                                 Console.WriteLine("66ip.cn");
 66                                 if (insertSuccess)
 67                                 {
 68                                     Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
 69                                 }
 70                                 Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
 71                             }, (error) =>
 72                             {
 73                                 Console.ForegroundColor = ConsoleColor.Green;
 74                                 Console.WriteLine("66ip.cn");
 75                                 Console.WriteLine("error:" + ipAddress + error + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
 76                             });
 77                         }
 78                         timer_66ip_isCompleted = true;
 79                         Console.ForegroundColor = ConsoleColor.DarkCyan;
 80                         Console.WriteLine("66ip.cn" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查");
 81                     }
 82                     else
 83                     {
 84                         timer_66ip_isCompleted = true;
 85                         Console.ForegroundColor = ConsoleColor.DarkCyan;
 86                         Console.WriteLine("66ip.cn沒有須要檢查的代理ip");
 87                     }
 88                 }
 89                 else
 90                 {
 91                     timer_66ip_isCompleted = true;
 92                     Console.ForegroundColor = ConsoleColor.DarkCyan;
 93                     Console.WriteLine("66ip.cn沒有獲取到代理ip");
 94                 }
 95             }
 96         }
 97 
 98         private static async Task TimerXicidailiAsync()
 99         {
100             if (timer_xicidaili_isCompleted)
101             {
102                 //取出須要檢查的ip地址,第一次100條則checklist就是100條記錄,
103                 //第二次的100條中只有10是和上一次的不重複,則第二次只須要檢查這10條記錄
104                 timer_xicidaili_isCompleted = false;
105                 List<string> checkList = new List<string>();
106                 var listProxyIp = ProxyIpHelper.GetXicidailiProxy(1);
107                 if (listProxyIp.Count > 0)
108                 {
109                     Console.WriteLine("xicidaili.com 抓取到" + listProxyIp.Count + "條記錄,正在對比............");
110                     listProxyIp.ForEach(f =>
111                     {
112                         if (!lastListxicidaili.Contains(f))
113                         {
114                             checkList.Add(f);
115                         }
116                     });
117                     lastListxicidaili = listProxyIp;
118                     if (checkList.Count > 0)
119                     {
120                         Console.ForegroundColor = ConsoleColor.DarkCyan;
121                         Console.WriteLine("xicidaili.com 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效..........");
122                         for (int i = 0; i < checkList.Count; i++)
123                         {
124                             string ipAddress = checkList[i];
125                             await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () =>
126                             {
127                                 bool insertSuccess = RedisHelper.InsertSet(ipAddress);
128                                 Console.ForegroundColor = ConsoleColor.White;
129                                 Console.WriteLine("xicidaili.com");
130                                 if (insertSuccess)
131                                 {
132                                     Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
133                                 }
134                                 else
135                                     Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
136                             }, (error) =>
137                             {
138                                 Console.WriteLine("xicidaili.com");
139                                 Console.ForegroundColor = ConsoleColor.Red;
140                                 Console.WriteLine("error:" + ipAddress + error + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
141                             });
142                         }
143                         timer_xicidaili_isCompleted = true;
144                         Console.ForegroundColor = ConsoleColor.DarkCyan;
145                         Console.WriteLine("xicidaili.com" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查");
146                     }
147                     else
148                     {
149                         timer_xicidaili_isCompleted = true;
150                         Console.ForegroundColor = ConsoleColor.DarkCyan;
151                         Console.WriteLine("xicidaili.com沒有須要檢查的代理ip");
152                     }
153                 }
154                 else
155                 {
156                     timer_xicidaili_isCompleted = true;
157                     Console.ForegroundColor = ConsoleColor.DarkCyan;
158                     Console.WriteLine("xicidaili.com沒有獲取到代理ip");
159                 }
160             }
161         }
162         private static async Task TimerIp3366Async()
163         {
164             if (timer_ip3366_isCompleted)
165             {
166                 timer_ip3366_isCompleted = false;
167                 List<string> checkList = new List<string>();
168                 var listProxyIp = ProxyIpHelper.GetIp3366Proxy(4);
169                 if (listProxyIp.Count > 0)
170                 {
171                     Console.ForegroundColor = ConsoleColor.DarkCyan;
172                     Console.WriteLine("ip3366.net 抓取到" + listProxyIp.Count + "條記錄,正在進行檢測是否有效..........");
173                     listProxyIp.ForEach(f =>
174                     {
175                         if (!lastListip3366.Contains(f))
176                         {
177                             checkList.Add(f);
178                         }
179                     });
180                     lastListip3366 = listProxyIp;
181                     if (checkList.Count != 0)
182                     {
183                         Console.ForegroundColor = ConsoleColor.DarkCyan;
184                         Console.WriteLine("ip3366.net 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效..........");
185                         for (int i = 0; i < checkList.Count; i++)
186                         {
187                             string ipAddress = checkList[i];
188                             await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () =>
189                             {
190                                 bool insertSuccess = RedisHelper.InsertSet(ipAddress);
191                                 Console.ForegroundColor = ConsoleColor.White;
192                                 Console.WriteLine("ip3366.net");
193                                 if (insertSuccess)
194                                 {
195                                     Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
196                                 }
197                                 else
198                                 {
199                                     Console.ForegroundColor = ConsoleColor.Red;
200                                     Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
201                                 }
202                             }, (error) =>
203                             {
204                                 Console.ForegroundColor = ConsoleColor.Yellow;
205                                 Console.WriteLine("ip3366.net");
206                                 Console.WriteLine("error " + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId);
207                             });
208                         }
209                         timer_ip3366_isCompleted = true;
210                         Console.WriteLine("ip3366.net" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查");
211                     }
212                     else
213                     {
214                         timer_ip3366_isCompleted = true;
215                         Console.ForegroundColor = ConsoleColor.DarkCyan;
216                         Console.WriteLine("ip3366.net沒有須要檢查的代理ip");
217                     }
218                 }
219                 else
220                 {
221                     timer_ip3366_isCompleted = true;
222                     Console.ForegroundColor = ConsoleColor.DarkCyan;
223                     Console.WriteLine("ip3366.net沒有獲取到代理ip");
224                 }
225 
226             }
227         }
228     }
View Code

Redis第三庫使用的stackoverflow的 StackExchange.Redis,代理ip不能重複儲存,因此採用的數據結構是Set。存的值很是簡單就一個ip加上port,也能夠存入更多相關信息,感受不必。即便有這些其餘的信息,也很難發揮做用。RedisHelper.cs以下

 1  public class RedisHelper
 2     {
 3         private static readonly object Locker = new object();
 4         private static ConnectionMultiplexer _redis;
 5         private const string CONNECTTIONSTRING = "127.0.0.1:6379,DefaultDatabase=3";
 6         public const string REDIS_SET_KET_SUCCESS = "set_success_ip";
 7         private static ConnectionMultiplexer Manager
 8         {
 9             get
10             {
11                 if (_redis == null)
12                 {
13                     lock (Locker)
14                     {
15                         if (_redis != null) return _redis;
16                         _redis = GetManager();
17                         return _redis;
18                     }
19                 }
20                 return _redis;
21             }
22         }
23         private static ConnectionMultiplexer GetManager(string connectionString = null)
24         {
25             if (string.IsNullOrEmpty(connectionString))
26             {
27                 connectionString = CONNECTTIONSTRING;
28             }
29             return ConnectionMultiplexer.Connect(connectionString);
30         }
31         public static  bool InsertSet(string value)
32         {
33             var db = Manager.GetDatabase();
34             return db.SetAdd(REDIS_SET_KET_SUCCESS,value);
35         }
36     }
View Code

 

總結

明天補上刷新網頁瀏覽量的文章吧,代碼還不夠好,ip的有效性還不高,對多線程的使用還不是很熟練

--------

7月6號的補充:完成csdn刷文章的瀏覽量了:C#使用代理Ip刷新csdn文章瀏覽量

 

--------

有興趣關注一下個人我的公衆號,專一dotNet開發,謝謝

相關文章
相關標籤/搜索