elasticsearch min_hash 應用分析

需求做類似文本查詢html

 

爬蟲做頁面去重,會用到simhash,第一個想到的是用simhash算法java

 

但在現有數據集(elasticsearch集羣)上用simhash,成本高,simhash值還好計算,不管是外部api仍是實現一套es token filter都很容易實現.最大的難點在於查詢,及類似度計算.須要根據simhash的距離,重寫elasticsearch的評分邏輯.算法

 

若是不考慮關鍵字權重的話,minhash和simhash的效果相似.api

 

目前新版的elasticsearch(5.5) 原生支持minhash,但官方的文檔很粗略app

 

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-minhash-tokenfilter.htmlelasticsearch

 

遂參照minhash及es源碼 調研測試了minhash 爲圖簡單中文分詞采用的ikide

 

四個參數函數

Setting Description

hash_count性能

The number of hashes to hash the token stream with. Defaults to 1.測試

bucket_count

The number of buckets to divide the minhashes into. Defaults to 512.

hash_set_size

The number of minhashes to keep per bucket. Defaults to 1.

with_rotation

Whether or not to fill empty buckets with the value of the first non-empty bucket to its circular right. Only takes effect if hash_set_size is equal to one. Defaults to trueif bucket_count is greater than one, else false.

 

 

代碼是是三層結構(懶得畫圖)

[ hashcount_0:[
   bucket_count_0:[hash_set_size_0,hash_set_size_1...],
   bucket_count_1:[hash_set_size_0,hash_set_size_1...],
   ...
   bucket_count_511:[hash_set_size_0,hash_set_size_1...]
 ],
 hashcount_1:[
   bucket_count_0:[hash_set_size_0,hash_set_size_1...],
   bucket_count_1:[hash_set_size_0,hash_set_size_1...],
   ...
   bucket_count_511:[hash_set_size_0,hash_set_size_1...]
 ]
]

 

出處 https://my.oschina.net/pathenon/blog/65210

       第一種:使用多個hash函數

        爲了計算集合A、B具備最小哈希值的機率,咱們能夠選擇必定數量的hash函數,好比K個。而後用這K個hash函數分別對集合A、B求哈希值,對
每一個集合都獲得K個最小值。好比Min(A)k={a1,a2,...,ak},Min(B)k={b1,b2,...,bk}。
        那麼,集合A、B的類似度爲|Min(A)k ∩ Min(B)k| / |Min(A)k  ∪  Min(B)k|,及Min(A)k和Min(B)k中相同元素個數與總的元素個數的比例。

       第二種:使用單個hash函數

       第一種方法有一個很明顯的缺陷,那就是計算複雜度高。使用單個hash函數是怎麼解決這個問題的呢?請看:
       前面咱們定義過 hmin(S)爲集合S中具備最小哈希值的一個元素,那麼咱們也能夠定義hmink(S)爲集合S中具備最小哈希值的K個元素。這樣一來,
咱們就只須要對每一個集合求一次哈希,而後取最小的K個元素。計算兩個集合A、B的類似度,就是集合A中最小的K個元素與集合B中最小的K個元素
的交集個數與並集個數的比例。
for (int i = 0; i < hashCount; i++) {
byte[] bytes = current.getBytes("UTF-16LE");
LongPair hash = new LongPair();
murmurhash3_x64_128(bytes, 0, bytes.length, 0, hash);
LongPair rehashed = combineOrdered(hash, getIntHash(i));
minHashSets.get(i).get((int) ((rehashed.val2 >>> 32) / bucketSize)).add(rehashed);
}

es綜合採用了這兩種方法

 

hashcount 是hash函數的個數 getIntHash根據i計算出不一樣偏移值,以不一樣的偏移值達到,按不一樣hash函數計算不一樣的hash值的效果,沒有實際再計算屢次 hash,減小計算量,這種優化選擇javaer都熟悉,計算 hash 利用高位,jdk1.7以前ConcurrentMap hashmap,segments 兩級 hash都相似(題外話,es文檔集中hash,有官方的 MurmurHash 插件,性能比傳統 hash 要好不少)

 

bucket_count 是每一個hash函數對集合全部元素計算出的bucket_count 即是hmink(S) 中的k值

 

最終最小哈希值個數爲

hash_count* bucket_count

 

 

該filter初始化時,構造三層結構

while (input.incrementToken()) {
found = true;
String current = new String(termAttribute.buffer(), 0, termAttribute.length());

for (int i = 0; i < hashCount; i++) {
byte[] bytes = current.getBytes("UTF-16LE");
LongPair hash = new LongPair();
murmurhash3_x64_128(bytes, 0, bytes.length, 0, hash);
LongPair rehashed = combineOrdered(hash, getIntHash(i));
minHashSets.get(i).get((int) ((rehashed.val2 >>> 32) / bucketSize)).add(rehashed);
}
endOffset = offsetAttribute.endOffset();
}
exhausted = true;
input.end();
// We need the end state so an underlying shingle filter can have its state restored correctly.
endState = captureState();
if (!found) {
return false;
}

遍歷上一filter來的全部數據,做hash填充至三層結構

 

while (hashPosition < hashCount) {
if (hashPosition == -1) {
hashPosition++;
} else {
while (bucketPosition < bucketCount) {
if (bucketPosition == -1) {
bucketPosition++;
} else {
LongPair hash = minHashSets.get(hashPosition).get(bucketPosition).pollFirst();
if (hash != null) {
termAttribute.setEmpty();
if (hashCount > 1) {
termAttribute.append(int0(hashPosition));
termAttribute.append(int1(hashPosition));
}
long high = hash.val2;
termAttribute.append(long0(high));
termAttribute.append(long1(high));
termAttribute.append(long2(high));
termAttribute.append(long3(high));
long low = hash.val1;
termAttribute.append(long0(low));
termAttribute.append(long1(low));
if (hashCount == 1) {
termAttribute.append(long2(low));
termAttribute.append(long3(low));
}
posIncAttribute.setPositionIncrement(positionIncrement);
offsetAttribute.setOffset(0, endOffset);
typeAttribute.setType(MIN_HASH_TYPE);
posLenAttribute.setPositionLength(1);
return true;
} else {
bucketPosition++;
}
}

}
bucketPosition = -1;
hashPosition++;
}
}

無限循環從三層結構裏拿minhash值,將這些值構造爲terms傳入下一個filter.

 

with_rotation 是填充項

假如 bucket_count 爲512 with_rotation 爲false 則65個詞,最後的min_hashes爲

"0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64"

 

with_rotation 爲true 則可能的示例爲

"65,1,1,1,1,1,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,9,9,10,11,12,13,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,15,15,16,16,17,17,17,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,19,19,19,20,20,20,20,21,21,22,22,22,22,22,22,22,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,25,25,25,26,26,26,26,26,26,27,27,27,27,27,27,28,28,28,28,28,28,28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,35,36,36,36,36,37,37,37,37,38,38,39,39,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,43,43,44,44,44,44,45,45,46,47,47,48,48,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,53,53,53,53,53,53,54,54,54,54,55,55,55,55,55,55,55,55,55,55,55,55,55,56,56,56,56,56,57,57,57,57,57,58,58,59,60,60,60,60,60,60,61,61,61,61,62,62,62,62,62,62,62,62,63,63,63,63,63,63,63,64,64,64,64,64,64,64,65,65,65,65,65,65"

 

 

elasticsearch (確切的說是Lucene)的min_hash filter代碼分析就這些,下一次開始應用測試

相關文章
相關標籤/搜索