記一次企業級爬蟲系統升級改造(六):基於Redis實現免費的IP代理池

 前言:html

  首先表示抱歉,春節後一直較忙,未及時更新該系列文章。node

  近期,因爲監控的站源愈來愈多,就偶有站源作了反爬機制,形成咱們的SupportYun系統小爬蟲服務時常被封IP,不能進行數據採集。web

  這時候,前面有園友提到的IP代理就該上場表演了。redis

 

IP代理池設計:c#

  博主查閱與調研了多方資料,最終決定先經過爬取網絡上各大IP代理網站免費代理的方式,來創建本身的IP代理池。windows

  最終爬取了五家較爲優質的IP代理站點:服務器

    1.西刺代理網絡

    2.快代理框架

    3.逼格代理dom

    4.proxy360

    5.66免費代理

  IP代理池方案設計以下:

 

  簡單點說就是把在採集的站源裏面已知具備反爬機制的站源打上標籤,修改全部的爬蟲服務,遇到有此標籤的站源先從IP代理池隨機獲取可用的代理IP再進行數據爬取。

  

安裝Redis:

  首先,咱們須要一臺服務器來部署咱們的Redis服務(先不考慮集羣什麼的)。

  博主一貫不喜歡彈個小黑框,不停敲命令行進行操做的各類方式。我的認爲,GUI是推進計算機快速發展的重要因素之一(非喜勿噴)。

  翻閱了資料,找到了簡易的redis安裝客戶端(windows版本,安裝簡單到爆),地址以下:

  http://download.csdn.net/detail/cb511612371/9784687

  在博客園找到一篇介紹redis配置文件的博文,貼出來供你們參考:http://www.cnblogs.com/kreo/p/4423362.html

  話說博主就簡單的修改了一下內存限制,設置了容許外網鏈接,設置了一個密碼,也沒多改其餘東西。

  注意,配置文件在安裝完成後的目錄下,名稱是:Redis.window-server.conf

  熟悉一點都知道,redis的c#驅動ServiceStack.Redis,NuGet就能夠直接安裝。比較坑的是4.0版本後商業化了,限制每小時6000次,要麼下載3.9版本,要麼考慮其餘的驅動,例如:StackExchange。

  博主使用的是ServiceStack V3.9版本,附上下載地址:http://download.csdn.net/detail/cb511612371/9784626

  下面附上博主基於ServiceStack寫的RedisManageService,因爲業務簡單,只使用到了幾個API,你們湊合着看。

 1     /// <summary>
 2     /// 基於ServiceStack的redis操做管理服務
 3     /// 當前用到set存儲
 4     /// </summary>
 5     public class RedisManageService
 6     {
 7         private static readonly string redisAddress = ConfigurationManager.AppSettings["RedisAddress"];
 8         private static readonly string redisPassword = "myRedisPassword";
 9 
10 
11         /// <summary>
12         /// 獲取某set集合 隨機一條數據
13         /// </summary>
14         /// <param name="setName"></param>
15         /// <returns></returns>
16         public static string GetRandomItemFromSet(RedisSetNameEnum setName)
17         {
18             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
19             {
20                 var result = client.GetRandomItemFromSet(setName.ToString());
21                 if (result == null)
22                 {
23                     throw new Exception("redis set集合"+setName.ToString()+"已無數據!");
24                 }
25                 return result;
26             }
27         }
28 
29         /// <summary>
30         /// 從某set集合 刪除指定數據
31         /// </summary>
32         /// <param name="setName"></param>
33         /// <param name="value"></param>
34         /// <returns></returns>
35         public static void RemoveItemFromSet(RedisSetNameEnum setName, string value)
36         {
37             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
38             {
39                 client.RemoveItemFromSet(setName.ToString(), value);
40             }
41         }
42 
43         /// <summary>
44         /// 添加一條數據到某set集合
45         /// </summary>
46         /// <param name="setName"></param>
47         /// <param name="value"></param>
48         public static void AddItemToSet(RedisSetNameEnum setName, string value)
49         {
50             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
51             {
52                 client.AddItemToSet(setName.ToString(), value);
53             }
54         }
55 
56         /// <summary>
57         /// 添加一個列表到某set集合
58         /// </summary>
59         /// <param name="setName"></param>
60         /// <param name="values"></param>
61         public static void AddItemListToSet(RedisSetNameEnum setName, List<string> values)
62         {
63             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
64             {
65                 client.AddRangeToSet(setName.ToString(), values);
66             }
67         }
68 
69         /// <summary>
70         /// 判斷某值是否已存在某set集合中
71         /// </summary>
72         /// <param name="setName"></param>
73         /// <param name="value"></param>
74         /// <returns></returns>
75         public static bool JudgeItemInSet(RedisSetNameEnum setName, string value)
76         {
77             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
78             {
79                 return client.Sets[setName.ToString()].Any(t => t == value);
80             }
81         }
82 
83         /// <summary>
84         /// 獲取某set數據總數
85         /// </summary>
86         /// <param name="setName"></param>
87         /// <returns></returns>
88         public static long GetSetCount(RedisSetNameEnum setName)
89         {
90             using (RedisClient client = new RedisClient(redisAddress, 6379, redisPassword))
91             {
92                 return client.GetSetCount(setName.ToString());
93             }
94         }
95     }

 

免費代理IP抓取服務實現:

  咱們首先設計一個最簡單的IpProxy對象:

 1     /// <summary>
 2     /// Ip代理對象
 3     /// </summary>
 4     public class IpProxy
 5     {
 6         /// <summary>
 7         /// IP地址
 8         /// </summary>
 9         public string Address { get; set; }
10 
11         /// <summary>
12         /// 端口
13         /// </summary>
14         public int Port { get; set; }
15     }

  而後實現一個基於Redis的Ip代理池操做服務:

 1     /// <summary>
 2     /// 基於Redis的代理池管理服務
 3     /// </summary>
 4     public class PoolManageService
 5     {
 6         /// <summary>
 7         /// 從代理池隨機獲取一條代理
 8         /// </summary>
 9         /// <returns></returns>
10         public static string GetProxy()
11         {
12             string result = string.Empty;
13 
14             try
15             {
16                 result = RedisManageService.GetRandomItemFromSet(RedisSetNameEnum.ProxyPool);
17                 if (result != null)
18                 {
19                     if (
20                         !HttpHelper.IsAvailable(result.Split(new[] { ':' })[0],
21                             int.Parse(result.Split(new[] { ':' })[1])))
22                     {
23                         DeleteProxy(result);
24                         return GetProxy();
25                     }
26                 }
27             }
28             catch (Exception e)
29             {
30                 LogUtils.ErrorLog(new Exception("從代理池獲取代理數據出錯", e));
31             }
32             return result;
33         }
34 
35         /// <summary>
36         /// 從代理池刪除一條代理
37         /// </summary>
38         /// <param name="value"></param>
39         public static void DeleteProxy(string value)
40         {
41             try
42             {
43                 RedisManageService.RemoveItemFromSet(RedisSetNameEnum.ProxyPool, value);
44             }
45             catch (Exception e)
46             {
47                 LogUtils.ErrorLog(new Exception("從代理池刪除代理數據出錯", e));
48             }
49         }
50 
51         /// <summary>
52         /// 添加一條代理到代理池
53         /// </summary>
54         /// <param name="proxy"></param>
55         public static void Add(IpProxy proxy)
56         {
57             try
58             {
59                 if (HttpHelper.IsAvailable(proxy.Address, proxy.Port))
60                 {
61                     RedisManageService.AddItemToSet(RedisSetNameEnum.ProxyPool, proxy.Address + ":" + proxy.Port.ToString());
62                 }
63             }
64             catch (Exception e)
65             {
66                 LogUtils.ErrorLog(new Exception("添加一條代理數據到代理池出錯", e));
67             }
68         }
69     }

  提供簡易的三個方法:添加代理IP、刪除代理IP、隨機獲取一條代理IP

  咱們還須要一個爬蟲服務,來爬取咱們須要的免費代理IP數據:

  1     /// <summary>
  2     /// IP池 抓取蜘蛛
  3     /// TODO:代理池站點變化較快,時常關注日誌監控
  4     /// </summary>
  5     public class IpPoolSpider
  6     {
  7         public void Initial()
  8         {
  9             ThreadPool.QueueUserWorkItem(Downloadproxy360);
 10             ThreadPool.QueueUserWorkItem(DownloadproxyBiGe);
 11             ThreadPool.QueueUserWorkItem(Downloadproxy66);
 12             ThreadPool.QueueUserWorkItem(Downloadxicidaili);
 13         }
 14 
 15         // 下載西刺代理的html頁面
 16         public void Downloadxicidaili(object DATA)
 17         {
 18             try
 19             {
 20                 List<string> list = new List<string>()
 21                 {
 22                     "http://www.xicidaili.com/nt/",
 23                     "http://www.xicidaili.com/nn/",
 24                     "http://www.xicidaili.com/wn/",
 25                     "http://www.xicidaili.com/wt/"
 26 
 27                 };
 28                 foreach (var utlitem in list)
 29                 {
 30                     for (int i = 1; i < 5; i++)
 31                     {
 32                         string url = utlitem + i.ToString();
 33                         var ipProxy = PoolManageService.GetProxy();
 34                         if (string.IsNullOrEmpty(ipProxy))
 35                         {
 36                             LogUtils.ErrorLog(new Exception("Ip代理池暫無可用代理IP"));
 37                             return;
 38                         }
 39                         var ip = ipProxy;
 40                         WebProxy webproxy;
 41                         if (ipProxy.Contains(":"))
 42                         {
 43                             ip = ipProxy.Split(new[] { ':' })[0];
 44                             var port = int.Parse(ipProxy.Split(new[] { ':' })[1]);
 45                             webproxy = new WebProxy(ip, port);
 46                         }
 47                         else
 48                         {
 49                             webproxy = new WebProxy(ip);
 50                         }
 51                         string html = HttpHelper.DownloadHtml(url, webproxy);
 52                         if (string.IsNullOrEmpty(html))
 53                         {
 54                             LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗"));
 55                             continue;
 56                         }
 57 
 58                         HtmlDocument doc = new HtmlDocument();
 59                         doc.LoadHtml(html);
 60                         HtmlNode node = doc.DocumentNode;
 61                         string xpathstring = "//tr[@class='odd']";
 62                         HtmlNodeCollection collection = node.SelectNodes(xpathstring);
 63                         foreach (var item in collection)
 64                         {
 65                             var proxy = new IpProxy();
 66                             string xpath = "td[2]";
 67                             proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
 68                             xpath = "td[3]";
 69                             proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
 70                             Task.Run(() =>
 71                             {
 72                                 PoolManageService.Add(proxy);
 73                             });
 74                         }
 75                     }
 76                 }
 77             }
 78             catch (Exception e)
 79             {
 80                 LogUtils.ErrorLog(new Exception("下載西刺代理IP池出現故障", e));
 81             }
 82         }
 83 
 84         // 下載快代理
 85         public void Downkuaidaili(object DATA)
 86         {
 87             try
 88             {
 89                 string url = "http://www.kuaidaili.com/proxylist/";
 90                 for (int i = 1; i < 4; i++)
 91                 {
 92                     string html = HttpHelper.DownloadHtml(url + i.ToString() + "/", null);
 93                     string xpath = "//tbody/tr";
 94                     HtmlDocument doc = new HtmlDocument();
 95                     doc.LoadHtml(html);
 96                     HtmlNode node = doc.DocumentNode;
 97                     HtmlNodeCollection collection = node.SelectNodes(xpath);
 98                     foreach (var item in collection)
 99                     {
100                         var proxy = new IpProxy();
101                         proxy.Address = item.FirstChild.InnerHtml;
102                         xpath = "td[2]";
103                         proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
104                         Task.Run(() =>
105                         {
106                             PoolManageService.Add(proxy);
107                         });
108                     }
109                 }
110             }
111             catch (Exception e)
112             {
113                 LogUtils.ErrorLog(new Exception("下載快代理IP池出現故障", e));
114             }
115         }
116 
117         // 下載proxy360
118         public void Downloadproxy360(object DATA)
119         {
120             try
121             {
122                 string url = "http://www.proxy360.cn/default.aspx";
123                 string html = HttpHelper.DownloadHtml(url, null);
124                 if (string.IsNullOrEmpty(html))
125                 {
126                     LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗"));
127                     return;
128                 }
129                 HtmlDocument doc = new HtmlDocument();
130                 doc.LoadHtml(html);
131                 string xpathstring = "//div[@class='proxylistitem']";
132                 HtmlNode node = doc.DocumentNode;
133                 HtmlNodeCollection collection = node.SelectNodes(xpathstring);
134 
135                 foreach (var item in collection)
136                 {
137                     var proxy = new IpProxy();
138                     var childnode = item.ChildNodes[1];
139                     xpathstring = "span[1]";
140                     proxy.Address = childnode.SelectSingleNode(xpathstring).InnerHtml.Trim();
141                     xpathstring = "span[2]";
142                     proxy.Port = int.Parse(childnode.SelectSingleNode(xpathstring).InnerHtml);
143                     Task.Run(() =>
144                     {
145                         PoolManageService.Add(proxy);
146                     });
147                 }
148             }
149             catch (Exception e)
150             {
151                 LogUtils.ErrorLog(new Exception("下載proxy360IP池出現故障", e));
152             }
153         }
154 
155         // 下載逼格代理
156         public void DownloadproxyBiGe(object DATA)
157         {
158             try
159             {
160                 List<string> list = new List<string>()
161                 {
162                     "http://www.bigdaili.com/dailiip/1/{0}.html",
163                     "http://www.bigdaili.com/dailiip/2/{0}.html",
164                     "http://www.bigdaili.com/dailiip/3/{0}.html",
165                     "http://www.bigdaili.com/dailiip/4/{0}.html"
166                 };
167                 foreach (var utlitem in list)
168                 {
169                     for (int i = 1; i < 5; i++)
170                     {
171                         string url = String.Format(utlitem, i);
172                         string html = HttpHelper.DownloadHtml(url, null);
173                         if (string.IsNullOrEmpty(html))
174                         {
175                             LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗"));
176                             continue;
177                         }
178 
179                         HtmlDocument doc = new HtmlDocument();
180                         doc.LoadHtml(html);
181                         HtmlNode node = doc.DocumentNode;
182                         string xpathstring = "//tbody/tr";
183                         HtmlNodeCollection collection = node.SelectNodes(xpathstring);
184                         foreach (var item in collection)
185                         {
186                             var proxy = new IpProxy();
187                             var xpath = "td[1]";
188                             proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
189                             xpath = "td[2]";
190                             proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
191                             Task.Run(() =>
192                             {
193                                 PoolManageService.Add(proxy);
194                             });
195                         }
196                     }
197                 }
198             }
199             catch (Exception e)
200             {
201                 LogUtils.ErrorLog(new Exception("下載逼格代理IP池出現故障", e));
202             }
203         }
204 
205         // 下載66免費代理
206         public void Downloadproxy66(object DATA)
207         {
208             try
209             {
210                 List<string> list = new List<string>()
211                 {
212                     "http://www.66ip.cn/areaindex_35/index.html",
213                     "http://www.66ip.cn/areaindex_35/2.html",
214                     "http://www.66ip.cn/areaindex_35/3.html"
215                 };
216                 foreach (var utlitem in list)
217                 {
218                     string url = utlitem;
219                     string html = HttpHelper.DownloadHtml(url, null);
220                     if (string.IsNullOrEmpty(html))
221                     {
222                         LogUtils.ErrorLog(new Exception("代理地址:" + url + " 訪問失敗"));
223                         break;
224                     }
225 
226                     HtmlDocument doc = new HtmlDocument();
227                     doc.LoadHtml(html);
228                     HtmlNode node = doc.DocumentNode;
229                     string xpathstring = "//table[@bordercolor='#6699ff']/tr";
230                     HtmlNodeCollection collection = node.SelectNodes(xpathstring);
231                     foreach (var item in collection)
232                     {
233                         var proxy = new IpProxy();
234                         var xpath = "td[1]";
235                         proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
236                         if (proxy.Address.Contains("ip"))
237                         {
238                             continue;
239                         }
240                         xpath = "td[2]";
241                         proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
242                         Task.Run(() =>
243                         {
244                             PoolManageService.Add(proxy);
245                         });
246                     }
247                 }
248             }
249             catch (Exception e)
250             {
251                 LogUtils.ErrorLog(new Exception("下載66免費代理IP池出現故障", e));
252             }
253         }
254     }

  這段代碼也沒什麼養分,就不仔細解釋了。

  前面有說到,博主的爬蟲服務都是以windows服務的方式部署的。之前一直用Timer來實現固定間隔屢次循環,此次博主引用了Quartz.NET任務調度框架來作,代碼看起來更優美一點。

  Quartz.NET可直接在NuGet下載安裝。

  先寫一個代理池的總調度任務類ProxyPoolTotalJob,繼承IJob接口:

 1     /// <summary>
 2     /// 代理池總調度任務
 3     /// </summary>
 4     class ProxyPoolTotalJob : IJob
 5     {
 6         public void Execute(IJobExecutionContext context)
 7         {
 8             var spider = new IpPoolSpider();
 9             spider.Initial();
10         }
11     }

  接下來是在OnStart中運行的Run()方法實現:

 1     private static void Run()
 2         {
 3             try
 4             {
 5                 StdSchedulerFactory factory = new StdSchedulerFactory();
 6                 IScheduler scheduler = factory.GetScheduler();
 7                 scheduler.Start();
 8                 IJobDetail job = JobBuilder.Create<ProxyPoolTotalJob>().WithIdentity("job1", "group1").Build();
 9                 ITrigger trigger = TriggerBuilder.Create()
10                  .WithIdentity("trigger1", "group1")
11                  .StartNow()
12                  .WithSimpleSchedule(
13                  x => x
14                 .WithIntervalInMinutes(28) // 28分鐘一次
15                  .RepeatForever()
16                 ).Build();
17                 scheduler.ScheduleJob(job, trigger);
18 
19             }
20             catch (SchedulerException se)
21             {
22                 Console.WriteLine(se);
23             }
24         }

 

  最後採集具備反爬機制的html頁面的時候,使用代理IP,這個相信你們都會,設置一下webRequest的Proxy參數便可。

  webRequest.Proxy = new WebProxy(ip, port);

 

  以上,就實現了一個基於redis的免費代理IP池。咱們被封IP的爬蟲服務又滿血復活了,繼續採集新數據去。 

 

原創文章,代碼都是從本身項目裏貼出來的。轉載請註明出處哦,親~~~

相關文章
相關標籤/搜索