以前看到某公司的官網的文章的瀏覽量刷新一次網頁就會增長一次,給人的感受不太好,一個公司的官網給人如此直白的漏洞,我批量發起請求的時候發現頁面打開都報錯,100多人的公司的官網文章刷新一次你給我看這個,這公司之前來過咱們學校宣傳招人+在園子裏搜招聘的時候發現竟然之前招xamarin,挺好奇的,因此就關注過。好吧不說這些了,只是扯扯蛋而已,迴歸主題,我想說的是csdn的文章能夠經過設置代理ip刷新文章的瀏覽量,因此首先要作的就是這篇文章的主題「使用c#驗證代理ip有效性」。php
固然代理IP來源確定是免費,因此嘛效率通常,從一些免費的代理ip的網頁抓取的代理IP並不必定都是有用的,因此須要咱們對咱們抓取的代理ip進行驗證,代理ip的有效時間也是有限,從10幾秒到1個小時不限,大多數時間很是短,因此好比說,咱們1分鐘須要100個代理ip,那就1分鐘獲取一次,每次獲取100個(這裏是理想狀態下的,抓取的代理ip都是有效的),原則上來講抓取下來後應該當即立刻被使用。html
固然這篇文章比較基礎,一直以爲爬蟲比較有趣,其實我在爬蟲方面也是個小白,只是作一個簡單的記錄,若是有什麼錯誤的地方,但願能提出建議。針對下面幾個問題,咱們就能夠完成如何驗證代理IP有效性的檢測了。正則表達式
百度一下「免費代理ip」挺多的。數據結構
這種免費的代理ip時效性和有效性都不強,上面這三個免費的代理網站,時效性大概在十幾秒到1個小時不等,通常須要本身處理驗證後使用,提升命中率。可適用於隱藏網頁IP(有些網站還不許使用代理ip,好比豆瓣,其實挺尷尬的,內容這麼貴嗎),通常經常使用於空間留言、刷網站流量、網賺任務、批量註冊帳號等,只要沒有其餘限制,須要頻繁更換ip均可以使用。多線程
好吧,這有點廢話,進行端口測試纔是最有效的,能ping通並不表明代理有效,不能平通也不必定代理不可用。可使用HttpWebRequest,也可使用Scoket,固然HttpWebRequest比Socket鏈接代理ip、port要慢。異步
代理ip時效性不強、而且有效性也不高,因此只能從一些代理ip的網站上批量定時去獲取,有的代理在一分鐘內使用是有限制的,因此說限制比較多。async
須要訪問https的網站就須要使用https代理了,好比百度,須要訪問http的代理,可使用http。這個並非100%的。
檢測代理ip有效性步驟以下:
1.使用HttpWebRequest、HttpWebResponse請求代理ip的網頁,獲取包含代理的網頁內容
2.使用HtmlAgilityPack或者正則表達式對抓取的內容進行截取,保存到代理集合
3.拿到代理集合,多線程發起http請求,好比訪問百度,是否成功,成功則存到Redis裏面。
效果圖以下:
Request.cs以下,主要就是兩個方法,一個方法是驗證代理ip是否有效,設置HttpWebRequest的Proxy屬性,請求百度,看到有些文章大多數會獲取響應的內容,若是內容符合請求的網址則證實代理喲有效,實際上根據HttpStatusCode 200就能夠判斷是否驗證有效。
【注意】建的是控制檯程序,使用了異步,因此仍是建.net core吧,c#語言的版本7.1。C#如何在控制檯程序中使用異步
1 public class Request 2 { 3 /// <summary> 4 /// 驗證代理ip有效性 5 /// </summary> 6 /// <param name="proxyIp">代理IP</param> 7 /// <param name="proxyPort">代理IP 端口</param> 8 /// <param name="timeout">詳情超時</param> 9 /// <param name="url">請求的地址</param> 10 /// <param name="success">成功的回調</param> 11 /// <param name="fail">失敗的回調</param> 12 /// <returns></returns> 13 public static async System.Threading.Tasks.Task getAsync(string proxyIp,int proxyPort, int timeout,string url, Action success, Action<string> fail) 14 { 15 System.GC.Collect(); 16 HttpWebRequest request = null; 17 HttpWebResponse response = null; 18 try 19 { 20 request = (HttpWebRequest)WebRequest.Create(url); 21 //HttpWebRequest request = HttpWebRequest.CreateHttp(url); 22 request.Timeout =timeout; 23 request.KeepAlive = false; 24 request.Proxy = new WebProxy(proxyIp,proxyPort); 25 response = await request.GetResponseAsync() as HttpWebResponse; 26 if (response.StatusCode == HttpStatusCode.OK) 27 { 28 success(); 29 } 30 else 31 { 32 fail(response.StatusCode+":"+response.StatusDescription); 33 } 34 } 35 catch (Exception ex) 36 { 37 fail("請求異常"+ex.Message.ToString()); 38 } 39 finally 40 { 41 if (request != null) 42 { 43 request.Abort(); 44 request = null; 45 } 46 if (response != null) 47 { 48 response.Close(); 49 } 50 } 51 } 52 53 /// <summary> 54 /// 發起http請求 55 /// </summary> 56 /// <param name="url"></param> 57 /// <param name="success">成功的回調</param> 58 /// <param name="fail">失敗的回調</param> 59 public static void get(string url,Action<string> success,Action<string> fail) 60 { 61 StreamReader reader = null; 62 Stream stream = null; 63 WebRequest request = null; 64 HttpWebResponse response = null; 65 try 66 { 67 request = WebRequest.Create(url); 68 request.Timeout = 2000; 69 response = (HttpWebResponse)request.GetResponse(); 70 if (response.StatusCode == HttpStatusCode.OK) 71 { 72 stream = response.GetResponseStream(); 73 reader = new StreamReader(stream); 74 string result = reader.ReadToEnd(); 75 success(result); 76 } 77 else 78 { 79 fail(response.StatusCode+":"+response.StatusDescription); 80 } 81 } 82 catch (Exception ex) 83 { 84 fail(ex.ToString()); 85 } 86 finally 87 { 88 if (reader != null) 89 reader.Close(); 90 if (stream != null) 91 stream.Close(); 92 if(response!=null) 93 response.Close(); 94 if(request!=null) 95 request.Abort(); 96 } 97 } 98 }
ProxyIpHelper.cs 中主要有四個方法,檢查ip是否可用CheckProxyIpAsync、抓取xicidaili.com的代理GetXicidailiProxy、抓取ip3366.net的代理GetIp3366Proxy、抓取66ip.cn的代理GetIp3366Proxy。若是想多抓取幾個網站能夠多寫幾個。
public class ProxyIpHelper { private static string address_xicidaili = "http://www.xicidaili.com/wn/{0}"; private static string address_66ip = "http://www.66ip.cn/nmtq.php?getnum=20&isp=0&anonymoustype=0&start=&ports=&export=&ipaddress=&area=1&proxytype=1&api=66ip"; private static string address_ip3366 = "http://www.ip3366.net/?stype=1&page={0}"; /// <summary> /// 檢查代理IP是否可用 /// </summary> /// <param name="ipAddress">ip</param> /// <param name="success">成功的回調</param> /// <param name="fail">失敗的回調</param> /// <returns></returns> public static async Task CheckProxyIpAsync(string ipAddress, Action success, Action<string> fail) { int index = ipAddress.IndexOf(":"); string proxyIp = ipAddress.Substring(0, index); int proxyPort = int.Parse(ipAddress.Substring(index + 1)); await Request.getAsync(proxyIp, proxyPort, 3000, "https://www.baidu.com/", () => { success(); }, (error) => { fail(error); }); } /// <summary> /// 從xicidaili.com網頁上去獲取代理IP,能夠分頁 /// </summary> /// <param name="page"></param> /// <returns></returns> public static List<string> GetXicidailiProxy(int page) { List<string> list = new List<string>(); for (int p = 1; p <= page; p++) { string url = string.Format(address_xicidaili, p); Request.get(url,(docText)=> { if (!string.IsNullOrWhiteSpace(docText)) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(docText); var trNodes = doc.DocumentNode.SelectNodes("//table[@id='ip_list']")[0].SelectNodes("./tr"); if (trNodes != null && trNodes.Count > 0) { for (int i = 1; i < trNodes.Count; i++) { var tds = trNodes[i].SelectNodes("./td"); string ipAddress = tds[1].InnerText + ":" + int.Parse(tds[2].InnerText); ; list.Add(ipAddress); } } } },(error)=> { Console.WriteLine(error); }); } return list; } /// <summary> /// 從ip3366.net網頁上去獲取代理IP,能夠分頁 /// </summary> /// <param name="page"></param> /// <returns></returns> public static List<string> GetIp3366Proxy(int page) { List<string> list = new List<string>(); for (int p = 1; p <= page; p++) { string url = string.Format(address_ip3366, p); Request.get(url, (docText) => { if (!string.IsNullOrWhiteSpace(docText)) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(docText); var trNodes1 = doc.DocumentNode.SelectNodes("//table")[0]; var trNodes2 = doc.DocumentNode.SelectNodes("//table")[0].SelectSingleNode("//tbody"); var trNodes = doc.DocumentNode.SelectNodes("//table")[0].SelectSingleNode("//tbody").SelectNodes("./tr"); if (trNodes != null && trNodes.Count > 0) { for (int i = 1; i < trNodes.Count; i++) { var tds = trNodes[i].SelectNodes("./td"); if (tds[3].InnerHtml == "HTTPS") { string ipAddress = tds[0].InnerText + ":" + int.Parse(tds[1].InnerText); ; list.Add(ipAddress); } } } } }, (error) => { Console.WriteLine(error); }); } return list; } /// <summary> /// 從66ip.cn中去獲取,不須要分頁 /// </summary> /// <returns></returns> public static List<string> Get66ipProxy() { List<string> list = new List<string>(); Request.get(address_66ip, (docText)=> { int count = 0; if (string.IsNullOrWhiteSpace(docText) == false) { string regex = "\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\:\\d{1,5}"; Match mstr = Regex.Match(docText, regex); while (mstr.Success && count < 20) { string tempIp = mstr.Groups[0].Value; list.Add(tempIp); mstr = mstr.NextMatch(); count++; } } }, (error)=> { Console.WriteLine(error); }); return list; } }
c#有三種定時器,這裏定時器是使用System.Threading命名空間, 這個Timer會開啓新的線程,抓取三個網頁定義了三個Timer對象。每一次抓取都會保存上一次抓取的集合,檢查前,會進行對比,取出新的集合也就是沒有重複的那部分。有效性的ip比較低,這裏沒有作統計,若是代碼再優化一下,能夠作一下統計,看看程序的主入口吧,最終的實現以下:
1 class Program 2 { 3 static bool timer_ip3366_isCompleted = true; 4 static bool timer_xicidaili_isCompleted = true; 5 static bool timer_66ip_isCompleted = true; 6 static Timer timer_ip3366, timer_xicidaili, timer_66ip; 7 private static List<string> lastListip3366,lastList66ip,lastListxicidaili;//保存上一次抓取的代理,與下一次進行對比,取新的集合進行檢查篩選 8 static async Task Main(string[] args) 9 { 10 System.Net.ServicePointManager.DefaultConnectionLimit = 2000; 11 Console.WriteLine("hellow proxyIp"); 12 Console.ReadLine(); 13 lastList66ip = new List<string>(); 14 lastListip3366 = new List<string>(); 15 lastListxicidaili = new List<string>(); 16 timer_ip3366 = new Timer(async (state) => 17 { 18 await TimerIp3366Async(); 19 }, "processing timer_ip3366 event", 0,1000*30); 20 timer_xicidaili = new Timer(async (state) => 21 { 22 await TimerXicidailiAsync(); 23 }, "processing timer_xicidaili event", 0, 1000 * 60); 24 timer_66ip = new Timer(async (state) => 25 { 26 await Timer66ipAsync(); 27 }, "processing timer_66ip event", 0, 1000*30); 28 29 Console.ReadLine(); 30 } 31 32 33 34 private static async Task Timer66ipAsync() 35 { 36 if (timer_66ip_isCompleted) 37 { 38 timer_66ip_isCompleted = false; 39 List<string> checkList = new List<string>(); 40 var listProxyIp = ProxyIpHelper.Get66ipProxy(); 41 42 if (listProxyIp.Count > 0) 43 { 44 Console.ForegroundColor = ConsoleColor.DarkCyan; 45 Console.WriteLine("66ip.cn 抓取到" + listProxyIp.Count + "條記錄,正在對比........."); 46 listProxyIp.ForEach(f => 47 { 48 if (!lastList66ip.Contains(f)) 49 { 50 checkList.Add(f); 51 } 52 }); 53 lastList66ip = listProxyIp; 54 if (checkList.Count > 0) 55 { 56 Console.ForegroundColor = ConsoleColor.DarkCyan; 57 Console.WriteLine("66ip.cn 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效.........."); 58 for (int i = 0; i < checkList.Count; i++) 59 { 60 string ipAddress = checkList[i]; 61 await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () => 62 { 63 bool insertSuccess = RedisHelper.InsertSet(ipAddress); 64 Console.ForegroundColor = ConsoleColor.White; 65 Console.WriteLine("66ip.cn"); 66 if (insertSuccess) 67 { 68 Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 69 } 70 Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 71 }, (error) => 72 { 73 Console.ForegroundColor = ConsoleColor.Green; 74 Console.WriteLine("66ip.cn"); 75 Console.WriteLine("error:" + ipAddress + error + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 76 }); 77 } 78 timer_66ip_isCompleted = true; 79 Console.ForegroundColor = ConsoleColor.DarkCyan; 80 Console.WriteLine("66ip.cn" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查"); 81 } 82 else 83 { 84 timer_66ip_isCompleted = true; 85 Console.ForegroundColor = ConsoleColor.DarkCyan; 86 Console.WriteLine("66ip.cn沒有須要檢查的代理ip"); 87 } 88 } 89 else 90 { 91 timer_66ip_isCompleted = true; 92 Console.ForegroundColor = ConsoleColor.DarkCyan; 93 Console.WriteLine("66ip.cn沒有獲取到代理ip"); 94 } 95 } 96 } 97 98 private static async Task TimerXicidailiAsync() 99 { 100 if (timer_xicidaili_isCompleted) 101 { 102 //取出須要檢查的ip地址,第一次100條則checklist就是100條記錄, 103 //第二次的100條中只有10是和上一次的不重複,則第二次只須要檢查這10條記錄 104 timer_xicidaili_isCompleted = false; 105 List<string> checkList = new List<string>(); 106 var listProxyIp = ProxyIpHelper.GetXicidailiProxy(1); 107 if (listProxyIp.Count > 0) 108 { 109 Console.WriteLine("xicidaili.com 抓取到" + listProxyIp.Count + "條記錄,正在對比............"); 110 listProxyIp.ForEach(f => 111 { 112 if (!lastListxicidaili.Contains(f)) 113 { 114 checkList.Add(f); 115 } 116 }); 117 lastListxicidaili = listProxyIp; 118 if (checkList.Count > 0) 119 { 120 Console.ForegroundColor = ConsoleColor.DarkCyan; 121 Console.WriteLine("xicidaili.com 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效.........."); 122 for (int i = 0; i < checkList.Count; i++) 123 { 124 string ipAddress = checkList[i]; 125 await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () => 126 { 127 bool insertSuccess = RedisHelper.InsertSet(ipAddress); 128 Console.ForegroundColor = ConsoleColor.White; 129 Console.WriteLine("xicidaili.com"); 130 if (insertSuccess) 131 { 132 Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 133 } 134 else 135 Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 136 }, (error) => 137 { 138 Console.WriteLine("xicidaili.com"); 139 Console.ForegroundColor = ConsoleColor.Red; 140 Console.WriteLine("error:" + ipAddress + error + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 141 }); 142 } 143 timer_xicidaili_isCompleted = true; 144 Console.ForegroundColor = ConsoleColor.DarkCyan; 145 Console.WriteLine("xicidaili.com" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查"); 146 } 147 else 148 { 149 timer_xicidaili_isCompleted = true; 150 Console.ForegroundColor = ConsoleColor.DarkCyan; 151 Console.WriteLine("xicidaili.com沒有須要檢查的代理ip"); 152 } 153 } 154 else 155 { 156 timer_xicidaili_isCompleted = true; 157 Console.ForegroundColor = ConsoleColor.DarkCyan; 158 Console.WriteLine("xicidaili.com沒有獲取到代理ip"); 159 } 160 } 161 } 162 private static async Task TimerIp3366Async() 163 { 164 if (timer_ip3366_isCompleted) 165 { 166 timer_ip3366_isCompleted = false; 167 List<string> checkList = new List<string>(); 168 var listProxyIp = ProxyIpHelper.GetIp3366Proxy(4); 169 if (listProxyIp.Count > 0) 170 { 171 Console.ForegroundColor = ConsoleColor.DarkCyan; 172 Console.WriteLine("ip3366.net 抓取到" + listProxyIp.Count + "條記錄,正在進行檢測是否有效.........."); 173 listProxyIp.ForEach(f => 174 { 175 if (!lastListip3366.Contains(f)) 176 { 177 checkList.Add(f); 178 } 179 }); 180 lastListip3366 = listProxyIp; 181 if (checkList.Count != 0) 182 { 183 Console.ForegroundColor = ConsoleColor.DarkCyan; 184 Console.WriteLine("ip3366.net 須要檢查" + checkList.Count + "條記錄,正在進行檢測是否有效.........."); 185 for (int i = 0; i < checkList.Count; i++) 186 { 187 string ipAddress = checkList[i]; 188 await ProxyIpHelper.CheckProxyIpAsync(ipAddress, () => 189 { 190 bool insertSuccess = RedisHelper.InsertSet(ipAddress); 191 Console.ForegroundColor = ConsoleColor.White; 192 Console.WriteLine("ip3366.net"); 193 if (insertSuccess) 194 { 195 Console.WriteLine("success" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 196 } 197 else 198 { 199 Console.ForegroundColor = ConsoleColor.Red; 200 Console.WriteLine("重複插入" + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 201 } 202 }, (error) => 203 { 204 Console.ForegroundColor = ConsoleColor.Yellow; 205 Console.WriteLine("ip3366.net"); 206 Console.WriteLine("error " + ipAddress + "任務編號:" + i + "當前任務線程:" + Thread.CurrentThread.ManagedThreadId); 207 }); 208 } 209 timer_ip3366_isCompleted = true; 210 Console.WriteLine("ip3366.net" + checkList.Count + "條記錄,已經檢測完成,正在進行下一次檢查"); 211 } 212 else 213 { 214 timer_ip3366_isCompleted = true; 215 Console.ForegroundColor = ConsoleColor.DarkCyan; 216 Console.WriteLine("ip3366.net沒有須要檢查的代理ip"); 217 } 218 } 219 else 220 { 221 timer_ip3366_isCompleted = true; 222 Console.ForegroundColor = ConsoleColor.DarkCyan; 223 Console.WriteLine("ip3366.net沒有獲取到代理ip"); 224 } 225 226 } 227 } 228 }
Redis第三庫使用的stackoverflow的 StackExchange.Redis,代理ip不能重複儲存,因此採用的數據結構是Set。存的值很是簡單就一個ip加上port,也能夠存入更多相關信息,感受不必。即便有這些其餘的信息,也很難發揮做用。RedisHelper.cs以下
1 public class RedisHelper 2 { 3 private static readonly object Locker = new object(); 4 private static ConnectionMultiplexer _redis; 5 private const string CONNECTTIONSTRING = "127.0.0.1:6379,DefaultDatabase=3"; 6 public const string REDIS_SET_KET_SUCCESS = "set_success_ip"; 7 private static ConnectionMultiplexer Manager 8 { 9 get 10 { 11 if (_redis == null) 12 { 13 lock (Locker) 14 { 15 if (_redis != null) return _redis; 16 _redis = GetManager(); 17 return _redis; 18 } 19 } 20 return _redis; 21 } 22 } 23 private static ConnectionMultiplexer GetManager(string connectionString = null) 24 { 25 if (string.IsNullOrEmpty(connectionString)) 26 { 27 connectionString = CONNECTTIONSTRING; 28 } 29 return ConnectionMultiplexer.Connect(connectionString); 30 } 31 public static bool InsertSet(string value) 32 { 33 var db = Manager.GetDatabase(); 34 return db.SetAdd(REDIS_SET_KET_SUCCESS,value); 35 } 36 }
明天補上刷新網頁瀏覽量的文章吧,代碼還不夠好,ip的有效性還不高,對多線程的使用還不是很熟練
--------
7月6號的補充:完成csdn刷文章的瀏覽量了:C#使用代理Ip刷新csdn文章瀏覽量
--------
有興趣關注一下個人我的公衆號,專一dotNet開發,謝謝