抓取的網頁內容中,有大部分會是類似的,抓取時就要過濾掉,開始考慮用VSM算法,後來發現不對,要比較太多東西了,而後就發現了simHash算法,這個算法的解釋我就懶得copy了,simhash算法對於短數據的支持很差,可是,我原本就是很長的數據,用上!java
源碼實現網上也有很多,可是貌似都是一樣的,裏面寫得不清不楚的,雖然效果基本能達到,可是不清楚的東西,我用來作啥?算法
仔細研究simhash算法的說明後,把裏面字符串的hash算法換成的fvn-1算法,這個在http://www.isthe.com/chongo/tech/comp/fnv/裏面有說明了,具體的那些固定數值,網站上都寫了。原先代碼裏面有些處理,和算法不符的,也換掉了。數據庫
首先搞起IKAnalyzer,切詞並計算每一個詞的頻率:apache
package com.cnblogs.zxub.lucene.similarity; import java.io.IOException; import java.io.Reader; import java.io.StringReader; import java.util.HashMap; import java.util.Map; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.wltea.analyzer.lucene.IKAnalyzer; public class WordsSpliter { public static Map<String, Integer> getSplitedWords(String str) throws IOException { // str = str.replaceAll("[0-9a-zA-Z]", ""); Analyzer analyzer = new IKAnalyzer(); Reader r = new StringReader(str); TokenStream ts = analyzer.tokenStream("searchValue", r); ts.addAttribute(CharTermAttribute.class); Map<String, Integer> result = new HashMap<String, Integer>(); while (ts.incrementToken()) { CharTermAttribute ta = ts.getAttribute(CharTermAttribute.class); String word = ta.toString(); if (!result.containsKey(word)) { result.put(word, 0); } result.put(word, result.get(word) + 1); } return result; } }
而後把SimHash的算法搞上:緩存
package com.cnblogs.zxub.lucene.similarity; import java.io.IOException; import java.math.BigInteger; import java.util.Map; import java.util.Set; public class SimHash { private static final int HASH_BITS = 64; private static final BigInteger FNV_64_INIT = new BigInteger( "14695981039346656037"); private static final BigInteger FNV_64_PRIME = new BigInteger( "1099511628211"); private static final BigInteger MASK_64 = BigInteger.ONE.shiftLeft( HASH_BITS).subtract(BigInteger.ONE); private String hash; private BigInteger signature; public SimHash(String content) throws IOException { super(); this.analysis(content); } public String getHash() { return this.hash; } public BigInteger getSignature() { return this.signature; } private void analysis(String content) throws IOException { Map<String, Integer> wordInfos = WordsSpliter.getSplitedWords(content); int[] featureVector = new int[SimHash.HASH_BITS]; Set<String> words = wordInfos.keySet(); for (String word : words) { BigInteger wordhash = this.fnv1_64_hash(word); for (int i = 0; i < SimHash.HASH_BITS; i++) { BigInteger bitmask = BigInteger.ONE.shiftLeft(SimHash.HASH_BITS - i - 1); if (wordhash.and(bitmask).signum() != 0) { featureVector[i] += wordInfos.get(word); } else { featureVector[i] -= wordInfos.get(word); } } } BigInteger signature = BigInteger.ZERO; StringBuffer hashBuffer = new StringBuffer(); for (int i = 0; i < SimHash.HASH_BITS; i++) { if (featureVector[i] >= 0) { signature = signature.add(BigInteger.ONE .shiftLeft(SimHash.HASH_BITS - i - 1)); hashBuffer.append("1"); } else { hashBuffer.append("0"); } } this.hash = hashBuffer.toString(); this.signature = signature; } // fnv-1 hash算法,將字符串轉換爲64位hash值 private BigInteger fnv1_64_hash(String str) { BigInteger hash = FNV_64_INIT; int len = str.length(); for (int i = 0; i < len; i++) { hash = hash.multiply(FNV_64_PRIME); hash = hash.xor(BigInteger.valueOf(str.charAt(i))); } hash = hash.and(MASK_64); return hash; } public int getHammingDistance(BigInteger targetSignature) { BigInteger x = this.getSignature().xor(targetSignature); String s = x.toString(2); return s.replaceAll("0", "").length(); } public int getHashDistance(String targetHash) { int distance; if (this.getHash().length() != targetHash.length()) { distance = -1; } else { distance = 0; for (int i = 0; i < this.getHash().length(); i++) { if (this.getHash().charAt(i) != targetHash.charAt(i)) { distance++; } } } return distance; } }
數據庫裏面存個簽名就行了,至於距離運算,本打算所有拉出來計算,後來發現oracle的bitand函數,就用它了!異或以後,轉二進制字符串,把0去掉,取長度,再count一下長度小於4的,獲得的結果就是很類似的內容數目了。之後再把計算改爲用緩存的去,先偷個懶。oracle
oracle函數部分貼上(注意Oracle的length函數永遠不會返回0,最後要用個nvl函數,還有就是bitand在數值太大的時候,會溢出致使結果失誤,因此要用utl_raw.bit_and,後面兩個函數中字符串還不能用64位,改爲128位搞定,估計還能小點,不弄了): app
create or replace function bitxor(a in number,b in number) return number is begin return return a+b-2*to_number(utl_raw.bit_and(to_char(a),to_char(b)));
end; create or replace function dec2bit(v_num number) return varchar is v_rtn varchar(128); v_n1 number; v_n2 number; begin v_n1 := v_num; loop v_n2 := mod(v_n1, 2); v_n1 := trunc(v_n1 / 2); v_rtn := to_char(v_n2) || v_rtn; exit when v_n1 = 0; end loop; return v_rtn; end; create or replace function hm_distance(a in number,b in number) return number is v_dis number; v_xor number; v_bit varchar(128); begin v_xor:=bitxor(a,b); v_bit:=dec2bit(v_xor); v_dis:=length(replace(v_bit,'0','')); return nvl(v_dis,0); end;
跑一下 select hm_distance(1108937774045716955,1108937774045721051) from dual ,結果爲1,o了。函數
後面去用了下,發現fnv1竟然正好撞到一個神奇的萬金油,改爲fnv1a就行了,代碼就不改了。。。oop