我不知道如今有多少人在用網盤搜索引擎,但就去轉盤網來講本人傾注了不少的心血,如今使用的人數也還能夠,網盤資源都有個通病,那就是資源可能失效,但不少引擎都沒有作失效判斷,尤爲是一些google自定義的引擎,技術含量不高,站長也就花心思賺錢,不多考慮用戶體驗。這篇文章是本人又一篇技術公開博客,以前本人已經公開了去轉盤html
網的幾乎全部的技術細節,這一篇繼續補充:java
首先作個回顧:百度網盤爬蟲 java分詞算法 數據庫自動備份 代理服務器爬取 邀請好友註冊node
1 ing:utf-8 2 """ 3 @author:haoning 4 @create time:2015.8.5 5 """ 6 from __future__ import division # 精確除法 7 from Queue import Queue 8 from __builtin__ import False 9 from _sqlite3 import SQLITE_ALTER_TABLE 10 from collections import OrderedDict 11 import copy 12 import datetime 13 import json 14 import math 15 import os 16 import random 17 import platform 18 import re 19 import threading, errno, datetime 20 import time 21 import urllib2 22 import MySQLdb as mdb 23 24 25 DB_HOST = '127.0.0.1' 26 DB_USER = 'root' 27 DB_PASS = 'root' 28 29 30 def gethtml(url): 31 try: 32 print "url",url 33 req = urllib2.Request(url) 34 response = urllib2.urlopen(req,None,8) #在這裏應該加入代理 35 html = response.read() 36 return html 37 except Exception,e: 38 print "e",e 39 40 if __name__ == '__main__': 41 42 while 1: 43 #url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442' 44 url="http://pan.baidu.com/s/1qXQD2Pm" 45 html=gethtml(url) 46 print html
結果:e HTTP Error 403: Forbidden,這就是說,度娘他是反爬蟲的,以後看了不少網站,一不當心試了下面的連接:python
http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442算法
1 if __name__ == '__main__': 2 3 while 1: 4 url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442' 5 #url="http://pan.baidu.com/s/1qXQD2Pm" 6 html=gethtml(url) 7 print html
結果:<title>百度雲 網盤-連接不存在</title>,你懂的,有這個的必然已經失效,看來度娘沒有反爬蟲,好傢伙。sql
其實百度網盤的資源入口有兩種方式:數據庫
一種是:http://pan.baidu.com/s/1qXQD2Pm,最後爲短碼。json
另外一種是:http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442,關鍵是shareId+uk 前者已知道反爬蟲,後者目前沒有,因此用python測試後,本人又將代碼翻譯成了java,由於去轉盤是用java寫的,直接上代碼:ubuntu
1 package com.tray.common.utils; 2 3 import static org.junit.Assert.*; 4 5 import java.io.BufferedReader; 6 import java.io.IOException; 7 import java.io.InputStream; 8 import java.io.InputStreamReader; 9 import java.net.HttpURLConnection; 10 import java.net.MalformedURLException; 11 import java.net.URL; 12 import java.util.HashMap; 13 import java.util.Iterator; 14 import java.util.Map; 15 import java.util.Properties; 16 import java.util.Random; 17 import java.util.Set; 18 19 import org.jsoup.Jsoup; 20 import org.jsoup.nodes.Document; 21 import org.jsoup.select.Elements; 22 import org.junit.Test; 23 24 /** 25 * 資源校驗工具 26 * 27 * @author hui 28 * 29 */ 30 public class ResourceCheckUtil { 31 private static Map<String, String[]> rules; 32 static { 33 loadRule(); 34 } 35 36 /** 37 * 加載規則庫 38 */ 39 public static void loadRule() { 40 try { 41 InputStream in = ResourceCheckUtil.class.getClassLoader() 42 .getResourceAsStream("rule.properties"); 43 Properties p = new Properties(); 44 p.load(in); 45 Set<Object> keys = p.keySet(); 46 Iterator<Object> iterator = keys.iterator(); 47 String key = null; 48 String value = null; 49 String[] rule = null; 50 rules = new HashMap<String, String[]>(); 51 while (iterator.hasNext()) { 52 key = (String) iterator.next(); 53 value = (String) p.get(key); 54 rule = value.split("\\|"); 55 rules.put(key, rule); 56 } 57 } catch (Exception e) { 58 e.printStackTrace(); 59 } 60 } 61 62 public static String httpRequest(String url) { 63 try { 64 URL u = new URL(url); 65 Random random = new Random(); 66 HttpURLConnection connection = (HttpURLConnection) u 67 .openConnection(); 68 connection.setConnectTimeout(3000);//3秒超時 69 connection.setReadTimeout(3000); 70 connection.setDoOutput(true); 71 connection.setDoInput(true); 72 connection.setUseCaches(false); 73 connection.setRequestMethod("GET"); 74 75 String[] user_agents = { 76 "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11", 77 "Opera/9.25 (Windows NT 5.1; U; en)", 78 "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", 79 "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)", 80 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12", 81 "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9", 82 "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7", 83 "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 " 84 }; 85 int index=random.nextInt(7); 86 /*connection.setRequestProperty("Content-Type", 87 "text/html;charset=UTF-8");*/ 88 connection.setRequestProperty("User-Agent",user_agents[index]); 89 /*connection.setRequestProperty("Accept-Encoding","gzip, deflate, sdch"); 90 connection.setRequestProperty("Accept-Language","zh-CN,zh;q=0.8"); 91 connection.setRequestProperty("Connection","keep-alive"); 92 connection.setRequestProperty("Host","pan.baidu.com"); 93 connection.setRequestProperty("Cookie",""); 94 connection.setRequestProperty("Upgrade-Insecure-Requests","1");*/ 95 InputStream in = connection.getInputStream(); 96 97 BufferedReader br = new BufferedReader(new InputStreamReader(in, 98 "utf-8")); 99 StringBuffer sb = new StringBuffer(); 100 String line = null; 101 while ((line = br.readLine()) != null) { 102 sb.append(line); 103 } 104 return sb.toString(); 105 106 } catch (MalformedURLException e) { 107 e.printStackTrace(); 108 } catch (IOException e) { 109 e.printStackTrace(); 110 } 111 112 return null; 113 } 114 115 @Test 116 public void test7() throws Exception { 117 System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyq", 118 "baidu")); 119 System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyqa", 120 "baidu")); 121 122 System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTd","360")); 123 System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTdd", 124 "360")); 125 126 System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57","weiyun")); 127 System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57dd", 128 "360")); 129 130 System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSes","leshi")); 131 System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSesdd", 132 "leshi")); 133 } 134 135 /** 136 * 獲取指定頁面上標籤的內容 137 * 138 * @param url 139 * @param tagName 140 * 標籤名稱 141 * @return 142 */ 143 private static String getHtmlContent(String url, String tagName) { 144 String html = httpRequest(url); 145 if(html==null){ 146 return ""; 147 } 148 Document doc = Jsoup.parse(html); 149 //System.out.println("doc======"+doc); 150 Elements tag=null; 151 if(tagName.equals("<h3>")){ //針對微雲 152 tag=doc.select("h3"); 153 } 154 else if(tagName.equals("class")){ //針對360 155 tag=doc.select("div[class=tip]"); 156 } 157 else{ 158 tag= doc.getElementsByTag(tagName); 159 } 160 //System.out.println("tag======"+tag); 161 String content=""; 162 if(tag!=null&&!tag.isEmpty()){ 163 content = tag.get(0).text(); 164 } 165 return content; 166 } 167 168 public static int isExistResource(String url, String ruleName) { 169 try { 170 String[] rule = rules.get(ruleName); 171 String tagName = rule[0]; 172 String opt = rule[1]; 173 String flag = rule[2]; 174 /*System.out.println("ruleName"+ruleName); 175 System.out.println("tagName"+tagName); 176 System.out.println("opt"+opt); 177 System.out.println("flag"+flag); 178 System.out.println("url"+url);*/ 179 String content = getHtmlContent(url, tagName); 180 //System.out.println("content="+content); 181 if(ruleName.equals("baidu")){ 182 if(content.contains("百度雲升級")){ //升級做爲不存在處理 183 return 1; 184 } 185 } 186 String regex = null; 187 if ("eq".equals(opt)) { 188 regex = "^" + flag + "$"; 189 } else if ("bg".equals(opt)) { 190 regex = "^" + flag + ".*$"; 191 } else if ("ed".equals(opt)) { 192 regex = "^.*" + flag + "$"; 193 } else if ("like".equals(opt)) { 194 regex = "^.*" + flag + ".*$"; 195 }else if("contain".equals(opt)){ 196 if(content.contains(flag)){ 197 return 0; 198 } 199 else{ 200 return 1; 201 } 202 } 203 if(content.matches(regex)){ 204 return 1; 205 } 206 } catch (Exception e) { 207 e.printStackTrace(); 208 } 209 return 0; 210 } 211 212 // public static void main(String[] args)throws Exception { 213 // final Path p = Paths.get("C:/Users/hui/Desktop/6-14/"); 214 // final WatchService watchService = 215 // FileSystems.getDefault().newWatchService(); 216 // p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY); 217 // new Thread(new Runnable() { 218 // 219 // public void run() { 220 // while(true){ 221 // System.out.println("檢測中。。。。"); 222 // try { 223 // WatchKey watchKey = watchService.take(); 224 // List<WatchEvent<?>> watchEvents = watchKey.pollEvents(); 225 // 226 // for(WatchEvent<?> event : watchEvents){ 227 // //TODO 根據事件類型採起不一樣的操做。。。。。。。 228 // System.out.println("["+p.getFileName()+"/"+event.context()+"]文件發生了["+event.kind()+"]事件"); 229 // } 230 // watchKey.reset(); 231 // 232 // } catch (Exception e) { 233 // e.printStackTrace(); 234 // } 235 // } 236 // } 237 // }).start(); 238 // } 239 240 // @Test 241 // public void testName() throws Exception { 242 // System.out.println(new String("\u8BF7\u8F93\u5165\u63D0\u53D6\u7801".getBytes("utf-8"), "utf-8")); 243 // } 244 245 }
注意代碼本生要用來兼容360,微盤等網盤的,但有些網盤倒了,你們都知道,不過代碼仍是得在,這纔是程序猿該有的思路,那就是可寬展,注意代碼有個配置文件,我也附上吧:服務器
360=class|contain|\u5206\u4EAB\u8005\u5DF2\u53D6\u6D88\u6B64\u5206\u4EAB
baidu=title|contain|\u94FE\u63A5\u4E0D\u5B58\u5728
weiyun=<h3>|contain|\u5206\u4EAB\u8D44\u6E90\u5DF2\u7ECF\u5220\u9664
leshi=title|ed|\u63D0\u53D6\u6587\u4EF6
sorry,unicode編碼,麻煩你本身轉下碼吧,不會請百度:unicode轉碼工具
到此,去轉盤網連接是否失效的驗證,代碼我已經徹底公開,喜歡這篇博客的孩子請收藏並關注下。
本人建個qq羣,歡迎你們一塊兒交流技術, 羣號:512245829 喜歡微博的朋友關注:轉盤娛樂便可