前言:html
首先表示抱歉,春節後一直較忙,未及時更新該系列文章。node
近期,因爲監控的站源愈來愈多,就偶有站源作了反爬機制,形成咱們的SupportYun系統小爬蟲服務時常被封IP,不能進行數據採集。web
這時候,前面有園友提到的IP代理就該上場表演了。redis
IP代理池設計:c#
博主查閱與調研了多方資料,最終決定先經過爬取網絡上各大IP代理網站免費代理的方式,來創建本身的IP代理池。windows
最終爬取了五家較爲優質的IP代理站點:服務器
1.西刺代理網絡
2.快代理框架
3.逼格代理dom
4.proxy360
5.66免費代理
IP代理池方案設計以下:
簡單點說就是把在採集的站源裏面已知具備反爬機制的站源打上標籤,修改全部的爬蟲服務,遇到有此標籤的站源先從IP代理池隨機獲取可用的代理IP再進行數據爬取。
安裝Redis:
首先,咱們須要一臺服務器來部署咱們的Redis服務(先不考慮集羣什麼的)。
博主一貫不喜歡彈個小黑框,不停敲命令行進行操做的各類方式。我的認爲,GUI是推進計算機快速發展的重要因素之一(非喜勿噴)。
翻閱了資料,找到了簡易的redis安裝客戶端(windows版本,安裝簡單到爆),地址以下:
http://download.csdn.net/detail/cb511612371/9784687
在博客園找到一篇介紹redis配置文件的博文,貼出來供你們參考:http://www.cnblogs.com/kreo/p/4423362.html
話說博主就簡單的修改了一下內存限制,設置了容許外網鏈接,設置了一個密碼,也沒多改其餘東西。
注意,配置文件在安裝完成後的目錄下,名稱是:Redis.window-server.conf
熟悉一點都知道,redis的c#驅動ServiceStack.Redis,NuGet就能夠直接安裝。比較坑的是4.0版本後商業化了,限制每小時6000次,要麼下載3.9版本,要麼考慮其餘的驅動,例如:StackExchange。
博主使用的是ServiceStack V3.9版本,附上下載地址:http://download.csdn.net/detail/cb511612371/9784626
下面附上博主基於ServiceStack寫的RedisManageService,因爲業務簡單,只使用到了幾個API,你們湊合着看。
1 /// <summary> 2 /// 基於ServiceStack的redis操做管理服務 3 /// 當前用到set存儲 4 /// </summary> 5 public class RedisManageService 6 { 7 private static readonly string redisAddress = ConfigurationManager.AppSettings["RedisAddress"]; 8 private static readonly string redisPassword = "myRedisPassword"; 9 10 11 /// <summary> 12 /// 獲取某set集合 隨機一條數據 13 /// </summary> 14 /// <param name="setName"></param> 15 /// <returns></returns> 16 public static string GetRandomItemFromSet(RedisSetNameEnum setName) 17 { 18 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 19 { 20 var result = client.GetRandomItemFromSet(setName.ToString()); 21 if (result == null) 22 { 23 throw new Exception("redis set集合"+setName.ToString()+"已無數據!"); 24 } 25 return result; 26 } 27 } 28 29 /// <summary> 30 /// 從某set集合 刪除指定數據 31 /// </summary> 32 /// <param name="setName"></param> 33 /// <param name="value"></param> 34 /// <returns></returns> 35 public static void RemoveItemFromSet(RedisSetNameEnum setName, string value) 36 { 37 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 38 { 39 client.RemoveItemFromSet(setName.ToString(), value); 40 } 41 } 42 43 /// <summary> 44 /// 添加一條數據到某set集合 45 /// </summary> 46 /// <param name="setName"></param> 47 /// <param name="value"></param> 48 public static void AddItemToSet(RedisSetNameEnum setName, string value) 49 { 50 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 51 { 52 client.AddItemToSet(setName.ToString(), value); 53 } 54 } 55 56 /// <summary> 57 /// 添加一個列表到某set集合 58 /// </summary> 59 /// <param name="setName"></param> 60 /// <param name="values"></param> 61 public static void AddItemListToSet(RedisSetNameEnum setName, List<string> values) 62 { 63 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 64 { 65 client.AddRangeToSet(setName.ToString(), values); 66 } 67 } 68 69 /// <summary> 70 /// 判斷某值是否已存在某set集合中 71 /// </summary> 72 /// <param name="setName"></param> 73 /// <param name="value"></param> 74 /// <returns></returns> 75 public static bool JudgeItemInSet(RedisSetNameEnum setName, string value) 76 { 77 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 78 { 79 return client.Sets[setName.ToString()].Any(t => t == value); 80 } 81 } 82 83 /// <summary> 84 /// 獲取某set數據總數 85 /// </summary> 86 /// <param name="setName"></param> 87 /// <returns></returns> 88 public static long GetSetCount(RedisSetNameEnum setName) 89 { 90 using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword)) 91 { 92 return client.GetSetCount(setName.ToString()); 93 } 94 } 95 }
免費代理IP抓取服務實現:
咱們首先設計一個最簡單的IpProxy對象:
1 /// <summary> 2 /// Ip代理對象 3 /// </summary> 4 public class IpProxy 5 { 6 /// <summary> 7 /// IP地址 8 /// </summary> 9 public string Address { get; set; } 10 11 /// <summary> 12 /// 端口 13 /// </summary> 14 public int Port { get; set; } 15 }
而後實現一個基於Redis的Ip代理池操做服務:
1 /// <summary> 2 /// 基於Redis的代理池管理服務 3 /// </summary> 4 public class PoolManageService 5 { 6 /// <summary> 7 /// 從代理池隨機獲取一條代理 8 /// </summary> 9 /// <returns></returns> 10 public static string GetProxy() 11 { 12 string result = string.Empty; 13 14 try 15 { 16 result = RedisManageService.GetRandomItemFromSet(RedisSetNameEnum.ProxyPool); 17 if (result != null) 18 { 19 if ( 20 !HttpHelper.IsAvailable(result.Split(new[] { ':' })[0], 21 int.Parse(result.Split(new[] { ':' })[1]))) 22 { 23 DeleteProxy(result); 24 return GetProxy(); 25 } 26 } 27 } 28 catch (Exception e) 29 { 30 LogUtils.ErrorLog(new Exception("從代理池獲取代理數據出錯", e)); 31 } 32 return result; 33 } 34 35 /// <summary> 36 /// 從代理池刪除一條代理 37 /// </summary> 38 /// <param name="value"></param> 39 public static void DeleteProxy(string value) 40 { 41 try 42 { 43 RedisManageService.RemoveItemFromSet(RedisSetNameEnum.ProxyPool, value); 44 } 45 catch (Exception e) 46 { 47 LogUtils.ErrorLog(new Exception("從代理池刪除代理數據出錯", e)); 48 } 49 } 50 51 /// <summary> 52 /// 添加一條代理到代理池 53 /// </summary> 54 /// <param name="proxy"></param> 55 public static void Add(IpProxy proxy) 56 { 57 try 58 { 59 if (HttpHelper.IsAvailable(proxy.Address, proxy.Port)) 60 { 61 RedisManageService.AddItemToSet(RedisSetNameEnum.ProxyPool, proxy.Address + ":" + proxy.Port.ToString()); 62 } 63 } 64 catch (Exception e) 65 { 66 LogUtils.ErrorLog(new Exception("添加一條代理數據到代理池出錯", e)); 67 } 68 } 69 }
提供簡易的三個方法:添加代理IP、刪除代理IP、隨機獲取一條代理IP
咱們還須要一個爬蟲服務,來爬取咱們須要的免費代理IP數據:
1 /// <summary> 2 /// IP池 抓取蜘蛛 3 /// TODO:代理池站點變化較快,時常關注日誌監控 4 /// </summary> 5 public class IpPoolSpider 6 { 7 public void Initial() 8 { 9 ThreadPool.QueueUserWorkItem(Downloadproxy360); 10 ThreadPool.QueueUserWorkItem(DownloadproxyBiGe); 11 ThreadPool.QueueUserWorkItem(Downloadproxy66); 12 ThreadPool.QueueUserWorkItem(Downloadxicidaili); 13 } 14 15 // 下載西刺代理的html頁面 16 public void Downloadxicidaili(object DATA) 17 { 18 try 19 { 20 List<string> list = new List<string>() 21 { 22 "http://www.xicidaili.com/nt/", 23 "http://www.xicidaili.com/nn/", 24 "http://www.xicidaili.com/wn/", 25 "http://www.xicidaili.com/wt/" 26 27 }; 28 foreach (var utlitem in list) 29 { 30 for (int i = 1; i < 5; i++) 31 { 32 string url = utlitem + i.ToString(); 33 var ipProxy = PoolManageService.GetProxy(); 34 if (string.IsNullOrEmpty(ipProxy)) 35 { 36 LogUtils.ErrorLog(new Exception("Ip代理池暫無可用代理IP")); 37 return; 38 } 39 var ip = ipProxy; 40 WebProxy webproxy; 41 if (ipProxy.Contains(":")) 42 { 43 ip = ipProxy.Split(new[] { ':' })[0]; 44 var port = int.Parse(ipProxy.Split(new[] { ':' })[1]); 45 webproxy = new WebProxy(ip, port); 46 } 47 else 48 { 49 webproxy = new WebProxy(ip); 50 } 51 string html = HttpHelper.DownloadHtml(url, webproxy); 52 if (string.IsNullOrEmpty(html)) 53 { 54 LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗")); 55 continue; 56 } 57 58 HtmlDocument doc = new HtmlDocument(); 59 doc.LoadHtml(html); 60 HtmlNode node = doc.DocumentNode; 61 string xpathstring = "//tr[@class='odd']"; 62 HtmlNodeCollection collection = node.SelectNodes(xpathstring); 63 foreach (var item in collection) 64 { 65 var proxy = new IpProxy(); 66 string xpath = "td[2]"; 67 proxy.Address = item.SelectSingleNode(xpath).InnerHtml; 68 xpath = "td[3]"; 69 proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml); 70 Task.Run(() => 71 { 72 PoolManageService.Add(proxy); 73 }); 74 } 75 } 76 } 77 } 78 catch (Exception e) 79 { 80 LogUtils.ErrorLog(new Exception("下載西刺代理IP池出現故障", e)); 81 } 82 } 83 84 // 下載快代理 85 public void Downkuaidaili(object DATA) 86 { 87 try 88 { 89 string url = "http://www.kuaidaili.com/proxylist/"; 90 for (int i = 1; i < 4; i++) 91 { 92 string html = HttpHelper.DownloadHtml(url + i.ToString() + "/", null); 93 string xpath = "//tbody/tr"; 94 HtmlDocument doc = new HtmlDocument(); 95 doc.LoadHtml(html); 96 HtmlNode node = doc.DocumentNode; 97 HtmlNodeCollection collection = node.SelectNodes(xpath); 98 foreach (var item in collection) 99 { 100 var proxy = new IpProxy(); 101 proxy.Address = item.FirstChild.InnerHtml; 102 xpath = "td[2]"; 103 proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml); 104 Task.Run(() => 105 { 106 PoolManageService.Add(proxy); 107 }); 108 } 109 } 110 } 111 catch (Exception e) 112 { 113 LogUtils.ErrorLog(new Exception("下載快代理IP池出現故障", e)); 114 } 115 } 116 117 // 下載proxy360 118 public void Downloadproxy360(object DATA) 119 { 120 try 121 { 122 string url = "http://www.proxy360.cn/default.aspx"; 123 string html = HttpHelper.DownloadHtml(url, null); 124 if (string.IsNullOrEmpty(html)) 125 { 126 LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗")); 127 return; 128 } 129 HtmlDocument doc = new HtmlDocument(); 130 doc.LoadHtml(html); 131 string xpathstring = "//div[@class='proxylistitem']"; 132 HtmlNode node = doc.DocumentNode; 133 HtmlNodeCollection collection = node.SelectNodes(xpathstring); 134 135 foreach (var item in collection) 136 { 137 var proxy = new IpProxy(); 138 var childnode = item.ChildNodes[1]; 139 xpathstring = "span[1]"; 140 proxy.Address = childnode.SelectSingleNode(xpathstring).InnerHtml.Trim(); 141 xpathstring = "span[2]"; 142 proxy.Port = int.Parse(childnode.SelectSingleNode(xpathstring).InnerHtml); 143 Task.Run(() => 144 { 145 PoolManageService.Add(proxy); 146 }); 147 } 148 } 149 catch (Exception e) 150 { 151 LogUtils.ErrorLog(new Exception("下載proxy360IP池出現故障", e)); 152 } 153 } 154 155 // 下載逼格代理 156 public void DownloadproxyBiGe(object DATA) 157 { 158 try 159 { 160 List<string> list = new List<string>() 161 { 162 "http://www.bigdaili.com/dailiip/1/{0}.html", 163 "http://www.bigdaili.com/dailiip/2/{0}.html", 164 "http://www.bigdaili.com/dailiip/3/{0}.html", 165 "http://www.bigdaili.com/dailiip/4/{0}.html" 166 }; 167 foreach (var utlitem in list) 168 { 169 for (int i = 1; i < 5; i++) 170 { 171 string url = String.Format(utlitem, i); 172 string html = HttpHelper.DownloadHtml(url, null); 173 if (string.IsNullOrEmpty(html)) 174 { 175 LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗")); 176 continue; 177 } 178 179 HtmlDocument doc = new HtmlDocument(); 180 doc.LoadHtml(html); 181 HtmlNode node = doc.DocumentNode; 182 string xpathstring = "//tbody/tr"; 183 HtmlNodeCollection collection = node.SelectNodes(xpathstring); 184 foreach (var item in collection) 185 { 186 var proxy = new IpProxy(); 187 var xpath = "td[1]"; 188 proxy.Address = item.SelectSingleNode(xpath).InnerHtml; 189 xpath = "td[2]"; 190 proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml); 191 Task.Run(() => 192 { 193 PoolManageService.Add(proxy); 194 }); 195 } 196 } 197 } 198 } 199 catch (Exception e) 200 { 201 LogUtils.ErrorLog(new Exception("下載逼格代理IP池出現故障", e)); 202 } 203 } 204 205 // 下載66免費代理 206 public void Downloadproxy66(object DATA) 207 { 208 try 209 { 210 List<string> list = new List<string>() 211 { 212 "http://www.66ip.cn/areaindex_35/index.html", 213 "http://www.66ip.cn/areaindex_35/2.html", 214 "http://www.66ip.cn/areaindex_35/3.html" 215 }; 216 foreach (var utlitem in list) 217 { 218 string url = utlitem; 219 string html = HttpHelper.DownloadHtml(url, null); 220 if (string.IsNullOrEmpty(html)) 221 { 222 LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗")); 223 break; 224 } 225 226 HtmlDocument doc = new HtmlDocument(); 227 doc.LoadHtml(html); 228 HtmlNode node = doc.DocumentNode; 229 string xpathstring = "//table[@bordercolor='#6699ff']/tr"; 230 HtmlNodeCollection collection = node.SelectNodes(xpathstring); 231 foreach (var item in collection) 232 { 233 var proxy = new IpProxy(); 234 var xpath = "td[1]"; 235 proxy.Address = item.SelectSingleNode(xpath).InnerHtml; 236 if (proxy.Address.Contains("ip")) 237 { 238 continue; 239 } 240 xpath = "td[2]"; 241 proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml); 242 Task.Run(() => 243 { 244 PoolManageService.Add(proxy); 245 }); 246 } 247 } 248 } 249 catch (Exception e) 250 { 251 LogUtils.ErrorLog(new Exception("下載66免費代理IP池出現故障", e)); 252 } 253 } 254 }
這段代碼也沒什麼養分,就不仔細解釋了。
前面有說到,博主的爬蟲服務都是以windows服務的方式部署的。之前一直用Timer來實現固定間隔屢次循環,此次博主引用了Quartz.NET任務調度框架來作,代碼看起來更優美一點。
Quartz.NET可直接在NuGet下載安裝。
先寫一個代理池的總調度任務類ProxyPoolTotalJob,繼承IJob接口:
1 /// <summary> 2 /// 代理池總調度任務 3 /// </summary> 4 class ProxyPoolTotalJob : IJob 5 { 6 public void Execute(IJobExecutionContext context) 7 { 8 var spider = new IpPoolSpider(); 9 spider.Initial(); 10 } 11 }
接下來是在OnStart中運行的Run()方法實現:
1 private static void Run() 2 { 3 try 4 { 5 StdSchedulerFactory factory = new StdSchedulerFactory(); 6 IScheduler scheduler = factory.GetScheduler(); 7 scheduler.Start(); 8 IJobDetail job = JobBuilder.Create<ProxyPoolTotalJob>().WithIdentity("job1", "group1").Build(); 9 ITrigger trigger = TriggerBuilder.Create() 10 .WithIdentity("trigger1", "group1") 11 .StartNow() 12 .WithSimpleSchedule( 13 x => x 14 .WithIntervalInMinutes(28) // 28分鐘一次 15 .RepeatForever() 16 ).Build(); 17 scheduler.ScheduleJob(job, trigger); 18 19 } 20 catch (SchedulerException se) 21 { 22 Console.WriteLine(se); 23 } 24 }
最後採集具備反爬機制的html頁面的時候,使用代理IP,這個相信你們都會,設置一下webRequest的Proxy參數便可。
webRequest.Proxy = new WebProxy(ip, port);
以上,就實現了一個基於redis的免費代理IP池。咱們被封IP的爬蟲服務又滿血復活了,繼續採集新數據去。
原創文章,代碼都是從本身項目裏貼出來的。轉載請註明出處哦,親~~~