由於老闆要我爬網易雲的數據,要對歌曲的評論進行類似度抽取,造成多個歌曲文案,因而我就作了這個爬蟲!在此記錄一下!java
1、分析網易雲 APInode
爲了緩解服務器的壓力,網易雲會有反爬蟲策略!我打開網易雲歌曲頁面, F12 發現看不到我要的數據,明白了!他應該是到這個頁面在發送請求獲取的歌詞、評論信息!因而我在網上找了要用的 API。數據庫
分析了 API 請求參數的加密方式。這個寫的比較好 (https://www.zhanghuanglong.com/detail/csharp-version-of-netease-cloud-music-api-analysis-(with-source-code))json
貼幾個項目中用到的 API:api
抓歌曲信息(沒有歌詞) | http://music.163.com/m/song?id=123 | GET |
抓歌詞信息: | http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=123 | GET |
抓評論信息 | http://music.163.com/weapi/v1/resource/comments/R_SO_4_123 (123 是歌詞) | POST |
2、深度網絡爬蟲瀏覽器
由於網易雲對數據進行了保護,因此不能像常規的網絡爬蟲同樣,抓頁面-->分析有用數據-->保持有用的數據-->提取連接加入任務隊列-->繼續抓頁面。服務器
我決定採用 id 的方式進行數據的抓取,將好比 100000000~200000000 的 id 加入任務隊列中。對於 id = 123,獲取歌曲信息、歌詞,評論,都是經過 song_id 對應起來的。網絡
爲了抓取的速度,採用 java 線程池作多線程爬蟲。多線程
在這裏,只討論爬蟲的具體實現吧!app
3、自定義任務類
java 任務類就是繼承 Runnable 接口,實現 Runnable 方法。在Runnable 方法中實現數據的抓取、分析、存數據庫。
一、歌曲信息任務類、
1 @Override 2 public void run() { 3 try { 4 Response execute; 5 if (uid % 2 == 0) { 6 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid) 7 .header("User-Agent", 8 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 9 .header("Cache-Control", "no-cache").timeout(2000000000) 11 // .proxy(IpProxy.ipEntitys.get(i).getIp(),IpProxy.ipEntitys.get(i).getPort()) 12 .execute(); 13 } 14 else { 15 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid) 16 .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; W…) Gecko/20100101 Firefox/56.0") 17 .header("Cache-Control", "no-cache") 18 19 .timeout(2000000000).execute(); 20 } 21 String body = execute.body(); 22 if (body.contains("很抱歉,你要查找的網頁找不到")) { 23 System.out.println("歌曲ID:" + uid + "=============網頁找不到"); 24 return; 25 } 26 Document parse = execute.parse(); 27 28 // 解析歌名 29 Elements elementsByClass = parse.getElementsByClass("f-ff2"); 30 Element element = elementsByClass.get(0); 31 Node childNode = element.childNode(0); 32 String song_name = childNode.toString(); 33 34 // 獲取歌手名 35 Elements elements = parse.getElementsByClass("s-fc7"); 36 Element singerElement = elements.get(1); 37 Node singerChildNode = singerElement.childNode(0); 38 String songer_name = singerChildNode.toString(); 39 40 // 獲取專輯名稱 41 Element albumElement = elements.get(2); 42 Node albumChildNode = albumElement.childNode(0); 43 String album_name = albumChildNode.toString(); 44 45 // 歌曲連接 46 String song_url = "http://music.163.com/m/song?id="+uid; 47 48 // 獲取歌詞 49 String lyric = getSongLyricBySongId(uid); 50 51 //歌曲持久化 52 dbUtils.insert_song(uid, song_name, songer_name, lyric, song_url, album_name); 53 54 } catch (Exception e) { 55 } 56 } 57 58 /* 59 * 根據歌曲 id 獲取 歌詞 60 */ 61 private String getSongLyricBySongId(long id) { 62 try { 63 Response data = Jsoup.connect("http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=" + id) 64 .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 65 .header("Cache-Control", "no-cache")//.timeout(20000) 66 .execute(); 67 68 String body = data.body(); 69 70 JsonObject jsonObject = (JsonObject)new Gson().fromJson(body, JsonObject.class); 71 jsonObject = (JsonObject) jsonObject.get("lrc"); 72 73 JsonElement jsonElement = jsonObject.get("lyric"); 74 String lyric = jsonElement.getAsString(); 75 // 替換掉 [*] 76 // String regex = "\\[\\d{2}\\:\\d{2}\\.\\d{2}\\]"; 77 String regex = "\\[\\d+\\:\\d+\\.\\d+\\]"; 78 lyric = lyric.replaceAll(regex, ""); 79 String regex2 = "\\[\\d+\\:\\d+\\]"; 80 lyric = lyric.replaceAll(regex2, ""); 81 lyric = lyric.replaceAll("'", ""); 82 lyric = lyric.replaceAll("\"", ""); 83 84 return lyric; 85 } catch (IOException e) { 86 e.printStackTrace(); 87 } 88 return "";
二、歌曲熱評任務類
一首歌大概 0~20 個熱評,都是經過一次 POST 請求就能夠獲取到的。所以與普通評論分開來處理。由於參數 params、ensecKey 進行了加密,能夠看前面的連接。
1 @Override 2 public void run() { 3 try{ 4 String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid; 5 String data = CenterUrl.getDataByUrl(url, "{\"offset\":0,\"limit\":10};"); 6 System.out.println(data); 7 JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>(); 8 CommentBean jsonData = commentData.getJsonData(data, CommentBean.class); 9 List<HotComments> hotComments = jsonData.getHotComments(); 10 for (HotComments comment : hotComments) { 11 // 組裝字段 12 Long comment_id = comment.getCommentId(); 13 String comment_content = comment.getContent(); 14 comment_content = comment_content.replaceAll("'", "").replaceAll("\"", ""); 15 Long liked_count = comment.getLikedCount(); 16 String commenter_name = comment.getUser().getNickname(); 17 int is_hot_comment = 1; 18 Long create_time = comment.getTime(); 19 // 插入數據庫 20 dbUtils.insert_hot_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time); 21 } 22 } catch (Exception e) { 23 logger.error(e.getMessage()); 24 } 25 }
三、歌曲普通評論任務類
由於普通評論要進行翻頁操做,因此裏邊有一個循環,能夠設置抓取的每首歌的普通評論數。
1 @Override 2 public void run() { 3 long pageSize = 0; 4 int dynamicPage = 105; // +1050,防止一部分抓取失敗 5 for (long i = 0; i <= pageSize && i < dynamicPage; i++) { // 1000 條非熱評 6 try { 7 String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid; 8 String data = CenterUrl.getDataByUrl(url, "{\"offset\":" + i * 10 + ",\"limit\":"+ 10 + "};"); 9 10 if(data.trim().equals("HTTP/1.1 400 Bad Request") || data.contains("用戶的數據無效")) { 11 // 因爲網絡等緣由請求抓取失敗 12 i--; 13 if(pageSize == 0) { // 第一次就失敗了。。。 14 pageSize = dynamicPage; 15 } 16 System.out.println("~~ song_id = " + uid + ", i(Page)=" + i + ", reason = " + data); 17 continue; 18 } 19 // 這一頁發生異常 20 if(data.contains("網絡超時") || data.equals("")) { 21 continue; 22 } 23 24 JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>(); 25 CommentBean jsonData = commentData.getJsonData(data, CommentBean.class); 26 long total = jsonData.getTotal(); 27 pageSize = total / 10; 28 List<Comments> comments = jsonData.getComments(); 29 for (Comments comment : comments) { 30 try { 31 // 組裝字段 32 Long comment_id = comment.getCommentId(); 33 String comment_content = comment.getContent(); 34 comment_content = comment_content.replaceAll("'", "").replaceAll("\"", ""); 35 Long liked_count = comment.getLikedCount(); 36 String commenter_name = comment.getUser().getNickname(); 37 int is_hot_comment = 0; 38 Long create_time = comment.getTime(); 39 // 插入數據庫 40 dbUtils.insert_tmp_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time); 41 } catch (Exception e) { 42 System.out.println(">>>>>>>>插入失敗: " + uid ); 43 } 44 } 45 } catch (Exception e) { 46 System.err.println("^^^" + e.getMessage()); 47 } 48 } 49 }
四、POST 請求
由於爬取的數據量比較大,當我用本地 IP 時,幾分鐘後,發現瀏覽器打開網易雲音樂,評論加載不出來了,幾分鐘後,歌曲也加載不出來了。因此,我以爲網易雲會對斷定爲爬蟲的 IP 禁止調用他的相應的接口!因而,我作了一個代理 IP 池 。當發現調用接口返回信息包含 「cheating」 等,就去除這個代理 IP,從新從池子裏獲取一個!
public static String getDataByUrl(String url, String encrypt) { try{ System.out.println("****************************正在使用的代理IP:"+ip+"*********端口"+port+"**********************"); String data = ""; // 參數加密 String secKey = new BigInteger(100, new SecureRandom()).toString(32).substring(0, 16);//limit String encText = EncryptUtils.aesEncrypt(EncryptUtils.aesEncrypt(encrypt,"0CoJUm6Qyw8W8jud"), secKey); String encSecKey = EncryptUtils.rsaEncrypt(secKey); // 設置請求頭 Response execute = Jsoup.connect(url+"?csrf_token=6b9af67aaac0a2d1deb5683987d059e1") .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.32 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") .header("Cache-Control", "max-age=60").header("Accept", "*/*").header("Accept-Encoding", "gzip, deflate, br") .header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8").header("Connection", "keep-alive") .header("Referer", "https://music.163.com/song?id=1324447466") .header("Origin", "https://music.163.com").header("Host", "music.163.com") .header("Content-Type", "application/x-www-form-urlencoded") .data("params",encText) .data("encSecKey",encSecKey) .method(Method.POST).ignoreContentType(true) .timeout(1000000) .proxy(ip, port) .execute(); data = execute.body().toString(); //若是當前的IP被拉黑了就從IP網站上抓取新的IP if(data.contains("Cheating")||data.contains("指定 product id") || data.contains("無效用戶")){ // 去除無效 ipEntity if(IpProxy.ipEntitys.contains(ipEntity)) IpProxy.ipEntitys.remove(ipEntity); ipEntity = getIpEntityByRandom(); ip = ipEntity.getIp(); port = ipEntity.getPort(); return "用戶的數據無效!!!"; } return data; } catch (Exception e) { // 去除無效 ipEntity if(IpProxy.ipEntitys.contains(ipEntity)) IpProxy.ipEntitys.remove(ipEntity); ipEntity = getIpEntityByRandom(); ip = ipEntity.getIp(); port = ipEntity.getPort(); System.err.println("網絡超時緣由: " + e.getMessage()); if(e.getMessage().contains("Connection refused: connect") || e.getMessage().contains("No route to host: connect")) { IpProxy.ipEntitys.clear(); IpProxy.getZDaYeProxyIp(); } return "網絡超時"; } } /* * 隨機從 List 中獲取 ipEntity */ private static IpEntity getIpEntityByRandom() { try { int size = IpProxy.ipEntitys.size(); if(size == 0) { Thread.sleep(20000); IpProxy.getZDaYeProxyIp(); } int i = (int)(Math.random()*size); if(size > 0 && i < size) return IpProxy.ipEntitys.get(i); } catch (Exception e) { System.err.println("pig!pig!隨機獲取生成代理 ip 異常:!!!!!!!"); } return null; }
4、代理 IP 資源池
免費的代理 IP 比較好用的是 西刺代理,IP很新鮮!缺點是不穩定,他的網站常常崩掉。
還有這個: https://www.ip-adress.com/proxy-list
對西刺代理網頁進行分析,抓取Dom節點的數據放入 IP 代理池中!
1 public static List<IpEntity> getProxyIp(String url) throws Exception{ 2 Response execute = Jsoup.connect(url) 3 .header("User-Agent", 4 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36") 5 .header("Cache-Control", "max-age=60").header("Accept", "*/*") 6 .header("Accept-Language", "zh-CN,zh;q=0.8,en;q=0.6").header("Connection", "keep-alive") 7 .header("Referer", "http://music.163.com/song?id=186016") 8 .header("Origin", "http://music.163.com").header("Host", "music.163.com") 9 .header("Content-Type", "application/x-www-form-urlencoded") 10 .header("Cookie", 11 "UM_distinctid=15e9863cf14335-0a09f939cd2af9-6d1b137c-100200-15e9863cf157f1; vjuids=414b87eb3.15e9863cfc1.0.ec99d6f660d09; _ntes_nnid=4543481cc76ab2fd3110ecaafd5f1288,1505795231854; _ntes_nuid=4543481cc76ab2fd3110ecaafd5f1288; __s_=1; __gads=ID=6cbc4ab41878c6b9:T=1505795247:S=ALNI_MbCe-bAY4kZyMbVKlS4T2BSuY75kw; usertrack=c+xxC1nMphjBCzKpBPJjAg==; NTES_CMT_USER_INFO=100899097%7Cm187****4250%7C%7Cfalse%7CbTE4NzAzNDE0MjUwQDE2My5jb20%3D; P_INFO=m18703414250@163.com|1507178162|2|mail163|00&99|CA&1506163335&mail163#hun&430800#10#0#0|187250&1|163|18703414250@163.com; vinfo_n_f_l_n3=8ba0369be425c0d2.1.7.1505795231863.1507950353704.1508150387844; vjlast=1505795232.1508150167.11; Province=0450; City=0454; _ga=GA1.2.1044198758.1506584097; _gid=GA1.2.763458995.1508907342; JSESSIONID-WYYY=Zm%2FnBG6%2B1vb%2BfJp%5CJP8nIyBZQfABmnAiIqMM8fgXABoqI0PdVq%2FpCsSPDROY1APPaZnFgh14pR2pV9E0Vdv2DaO%2BKkifMncYvxRVlOKMEGzq9dTcC%2F0PI07KWacWqGpwO88GviAmX%2BVuDkIVNBEquDrJ4QKhTZ2dzyGD%2Bd2T%2BbiztinJ%3A1508946396692; _iuqxldmzr_=32; playerid=20572717; MUSIC_U=39d0b2b5e15675f10fd5d9c05e8a5d593c61fcb81368d4431bab029c28eff977d4a57de2f409f533b482feaf99a1b61e80836282123441c67df96e4bf32a71bc38be3a5b629323e7bf122d59fa1ed6a2; __remember_me=true; __csrf=2032a8f34f1f92412a49ba3d6f68b2db; __utma=94650624.1044198758.1506584097.1508939111.1508942690.40; __utmb=94650624.20.10.1508942690; __utmc=94650624; __utmz=94650624.1508394258.18.4.utmcsr=xujin.org|utmccn=(referral)|utmcmd=referral|utmcct=/") 12 .method(Method.GET).ignoreContentType(true) 13 .timeout(2099999999).execute(); 14 Document pageJson = execute.parse(); 15 Element body = pageJson.body(); 16 List<Node> childNodes = body.childNode(11).childNode(3).childNode(5).childNode(1).childNodes(); 17 // ipEntitys.clear(); // 先清空在添加 18 19 for(int i = 2;i < childNodes.size();i += 2){ 20 IpEntity ipEntity = new IpEntity(); 21 Node node = childNodes.get(i); 22 List<Node> nodes = node.childNodes(); 23 String ip = nodes.get(3).childNode(0).toString(); 24 int port = Integer.parseInt(nodes.get(5).childNode(0).toString()); 25 ipEntity.setIp(ip); 26 ipEntity.setPort(port); 27 ipEntitys.add(ipEntity); 28 } 29 return ipEntitys; 30 }
可是爲了少操心,最終買了「站大爺」的提供的代理 IP 服務,17塊錢一天!服務還挺好的,嗯!
5、總結
寫這個爬蟲持續了挺久時間的!中間碰到了不少問題。好比,調網易雲接口一直返回 460 等。最後發現是代理 IP 沒有更新的問題!調用站大爺的接口,獲取到的全是沒用的 IP 原來是沒有綁定本身公網的 IP !還有爬蟲常常是爬十來分鐘就卡住了,一直調定時任務更新線程池,而線程沒有動了! 我猜是由於我抓普通評論時的 for 循環使得個人多線程都堵塞了,還有待驗證!
雖然爬蟲很簡單,可是把他作好也很難!加油!