自如房屋詳情頁的價格字段用圖片顯示,特此破解一下以豐富一下爬蟲筆記系列博文集。css
先打開一個房屋詳情頁觀察一下;html
網頁的源代碼中沒有直接顯示價格字段,價格的顯示是使用一張背景圖,圖上是0-9十個數字,而後網頁上顯示的時候價格的每個數字對應着一個元素,元素的背景圖就設置爲這張圖片,而後使用偏移定位到本身對應的數字:java
就拿上面這個例子來講,它對應的背景圖是:git
這張圖寬30*10=300px,每一個數字寬度是30px,網頁上價格每一個元素實際顯示的數字在圖片中數字的下標映射公式爲:github
Math.abs(style_background-position_value) / 30
拿這個房屋價格代入:apache
第一個數字的background-position:-30px,帶入得1,對應背景圖中的第1個數字(下標從0開始),即爲1 第二個數字的background-position:-60px,帶入得2,對應背景圖中的第2個數字,即爲9 第三個數字的background-position:-90px,帶入得3,對應背景圖中的第3個數字,即爲3 第四個數字的background-position:-240px,帶入得8,對應背景圖中的第8個數字,即爲0
拼接起來獲得最終價格:1930,與頁面上顯示的價格吻合。json
其實並無那麼複雜,每一位對應圖片中的數字的下標並不須要本身根據css計算,這個對應下標是在詳情頁的接口中返回的:數組
price是個數組,第一個元素是背景圖的小圖,第二個元素是背景圖的大圖,第三個元素是價格字段對應背景圖中的第幾個數字,有這幾個信息足夠識別出價格字段了,先從背景圖中將價格對應的數字圖片割出來,而後識別出來按順序拼接起來再轉爲數字便可。工具
下面是識別價格字段的一個小Demo,依賴了我以前寫的一個字符圖片識別的小工具:commons-simple-character-ocr。加密
源碼:
package cc11001100.crawler.ziroom; import cc11001100.ocr.OcrUtil; import cc11001100.ocr.clean.SingleColorFilterClean; import cc11001100.ocr.split.ImageSplitImpl; import cc11001100.ocr.util.ImageUtil; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger; import org.jsoup.Jsoup; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.ByteArrayInputStream; import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import static com.alibaba.fastjson.JSON.parseObject; import static java.util.stream.Collectors.joining; /** * 自如的房租價格用圖片顯示,這是一個從圖片中解析出價格的例子 * * * <a>http://www.ziroom.com/z/vr/250682.html</a> * * @author CC11001100 */ public class ZiRoomPriceGrab { private static final Logger log = LogManager.getLogger(ZiRoomPriceGrab.class); private static SingleColorFilterClean singleColorFilterClean = new SingleColorFilterClean(0XFFA000); private static ImageSplitImpl imageSplit = new ImageSplitImpl(); private static Map<Integer, String> dictionaryMap = new HashMap<>(); static { dictionaryMap.put(-2132100338, "0"); dictionaryMap.put(-458583857, "1"); dictionaryMap.put(913575273, "2"); dictionaryMap.put(803609598, "3"); dictionaryMap.put(-1845065635, "4"); dictionaryMap.put(1128997321, "5"); dictionaryMap.put(-660564186, "6"); dictionaryMap.put(-1173287820, "7"); dictionaryMap.put(1872761224, "8"); dictionaryMap.put(-1739426700, "9"); } public static JSONObject getHouseInfo(String id, String houseId) { String url = "http://www.ziroom.com/detail/info?id=" + id + "&house_id=" + houseId; String respJson = downloadText(url); if (respJson == null) { throw new RuntimeException("response null, id=" + id + ", houseId=" + houseId); } return parseObject(respJson); } private static int extractPrice(JSONObject houseInfo) throws IOException { JSONArray priceInfo = houseInfo.getJSONObject("data").getJSONArray("price"); String priceRawImgUrl = "http:" + priceInfo.getString(0); System.out.println("priceRawImgUrl: " + priceRawImgUrl); JSONArray priceImgCharIndexArray = priceInfo.getJSONArray(2); System.out.println("priceImgCharIndexArray: " + priceImgCharIndexArray); BufferedImage img = downloadImg(priceRawImgUrl); if (img == null) { throw new RuntimeException("img download failed, url=" + priceRawImgUrl); } List<BufferedImage> priceCharImgList = extractNeedCharImg(img, priceImgCharIndexArray); String priceStr = priceCharImgList.stream().map(charImg -> { int charImgHashCode = ImageUtil.imageHashCode(charImg); return dictionaryMap.get(charImgHashCode); }).collect(joining()); return Integer.parseInt(priceStr); } // 由於價格一般是4位數,而返回的圖片有10位數(0-9),因此第一步就是將價格字符摳出來 // (或者也能夠先所有識別爲字符串而後從字符串中按下標選取) private static List<BufferedImage> extractNeedCharImg(BufferedImage img, JSONArray charImgIndexArray) { List<BufferedImage> allCharImgList = imageSplit.split(singleColorFilterClean.clean(img)); List<BufferedImage> needCharImg = new ArrayList<>(); for (int i = 0; i < charImgIndexArray.size(); i++) { int index = charImgIndexArray.getInteger(i); needCharImg.add(allCharImgList.get(index)); } return needCharImg; } private static byte[] downloadBytes(String url) { for (int i = 0; i < 3; i++) { long start = System.currentTimeMillis(); try { byte[] responseBody = Jsoup.connect(url) .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36") .ignoreContentType(true) .execute() .bodyAsBytes(); long cost = System.currentTimeMillis() - start; log.info("request ok, tryTimes={}, url={}, cost={}", i, url, cost); return responseBody; } catch (Exception e) { long cost = System.currentTimeMillis() - start; log.info("request failed, tryTimes={}, url={}, cost={}, cause={}", i, url, cost, e.getMessage()); } } return null; } private static String downloadText(String url) { byte[] respBytes = downloadBytes(url); if (respBytes == null) { return null; } else { return new String(respBytes); } } private static BufferedImage downloadImg(String url) throws IOException { byte[] imgBytes = downloadBytes(url); if (imgBytes == null) { return null; } return ImageIO.read(new ByteArrayInputStream(imgBytes)); } private static void init() { // OcrUtil ocrUtil = new OcrUtil().setImageClean(new SingleColorFilterClean(0XFFA000)); // ocrUtil.init("H:/test/crawler/ziroom/raw/", "H:/test/crawler/ziroom/char/"); OcrUtil.genAndPrintDictionaryMap("H:/test/crawler/ziroom/char/", "dictionaryMap", filename -> filename.substring(0, 1)); } public static void main(String[] args) throws IOException { // init(); JSONObject o = getHouseInfo("61718150", "60273500"); int price = extractPrice(o); System.out.println("price: " + price); // 1930 // output: // 2018-12-15 20:24:59.206 INFO cc11001100.crawler.ziroom.ZiRoomPriceGrab 103 downloadBytes - request ok, tryTimes=0, url=http://www.ziroom.com/detail/info?id=61718150&house_id=60273500, cost=559 // priceRawImgUrl: http://static8.ziroom.com/phoenix/pc/images/price/ba99db25b3be2abed93c50c7f55c332cs.png // priceImgCharIndexArray: [6,3,8,1] // 2018-12-15 20:24:59.538 INFO cc11001100.crawler.ziroom.ZiRoomPriceGrab 103 downloadBytes - request ok, tryTimes=0, url=http://static8.ziroom.com/phoenix/pc/images/price/ba99db25b3be2abed93c50c7f55c332cs.png, cost=146 // price: 1930 } }
自如的房屋價格圖片顯示相似於新蛋的商品價格圖片顯示,此類反爬措施破解難度較低,比較致命的是破解方案具備通用性,這意味着隨便找個圖片識別的庫懟上就行,因此還不如自研個比較複雜的js加密來反爬呢,你要想高效的爬取就得來分析js折騰半天,反爬機制對應的破解方案應該不具備通用性而且成本比較高這個反爬作得纔有意義,不然爬蟲方面投入很小的成本(時間 & 經濟上的投入)就破解了那這反爬至關於白作哇。
相關資料:
2. commons-simple-character-ocr
.