前言:最近博主買了臺Kindle,感受亞馬遜上的圖書資源質量挺好,還時不時地會有價格低但質量高的書出售,但限於亞馬遜並無很好的優惠提醒功能,本身每天盯着又很累。因而,我本身寫了一個基於Java的亞馬遜圖書監控的簡單爬蟲,只要出現特別優惠的書便會自動給指定的郵箱發郵件。html
簡單地說一下實現的思路,本文只說明思路,須要完整項目的童鞋請移步文末java
URL類
返回的URLConnection對象
對網站進行訪問,抓取數據。(這裏有個小技巧,在訪問亞馬遜的時候若是沒有在請求頭上加入Accept-Encoding:gzip, deflate, br
這個參數,則不出幾回便會被拒絕訪問(返回503),加上以後返回的數據是經GZIP壓縮過的,此時須要用GZIPInputStream
這個流去讀取,不然讀到的是亂碼)由於只截取了部分代碼,內容有所缺失,思路能看明白便可git
this.url = new URL("https://www.amazon.cn/gp/bestsellers/digital-text"); //打開一個鏈接 URLConnection connection = this.url.openConnection(); //設置請求頭,防止被503 connection.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"); connection.setRequestProperty("Accept-Encoding", "gzip, deflate, br"); connection.setRequestProperty("Accept-Language", "zh-CN,zh;q=0.9"); connection.setRequestProperty("Host", "www.amazon.cn"); connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"); //發起鏈接 connection.connect(); //獲取數據,由於服務器發過來的數據通過GZIP壓縮,要用對應的流進行讀取 BufferedInputStream bis = new BufferedInputStream(new GZIPInputStream(connection.getInputStream())); ByteArrayOutputStream baos = new ByteArrayOutputStream(); //讀數據 while ((len = bis.read(tmp)) != -1) { baos.write(tmp, 0, len); } this.rawData = new String(baos.toByteArray(), "utf8"); bis.close();
//先用正則表達式去取單個li標籤 Pattern p1 = Pattern.compile("<li class=\"zg-item-immersion\"[\\s\\S]+?</li>"); Matcher m1 = p1.matcher(this.rawData == null ? "" : this.rawData); while (m1.find()) { //取出單個li標籤的名字和價格 Pattern p2 = Pattern.compile("alt=\"([\\u4E00-\\u9FA5:—,0-9a-zA-Z]+)[\\s\\S]+?¥(\\d{1,2}\\.\\d{2})"); Matcher m2 = p2.matcher(m1.group()); while (m2.find()) { //先取出名字 String name = m2.group(1); //再取出價格 double price = Double.parseDouble(m2.group(2)); //如有相同名字的書籍只記錄價格低的 if (this.destData.containsKey(name)) { double oldPrice = this.destData.get(name).getPrice(); price = oldPrice > price ? price : oldPrice; } //將數據放入Map中 this.destData.put(name, new Price(price, new Date())); } }
我把完整的項目放在了個人Github上,更多詳細狀況(怎麼配置、怎麼用),有興趣的童鞋能夠去捧個場!
倉庫地址:https://github.com/horvey/Amazon-BookMonitorgithub