simhash算法:海量千萬級的數據去重
simhash算法及原理參考:
簡單易懂講解simhash算法 hash 哈希:https://blog.csdn.net/le_le_name/article/details/51615931html
simhash算法及原理簡介:https://blog.csdn.net/lengye7/article/details/79789206python
使用SimHash進行海量文本去重:https://www.cnblogs.com/maybe2030/p/5203186.html#_label3算法
python實現:
python使用simhash實現文本類似性對比(全代碼展現):http://www.javashuo.com/article/p-ritxpaev-kg.html數據結構
simhash的py實現:https://blog.csdn.net/gzt940726/article/details/80460419app
python庫simhash使用
詳情請查看:https://leons.im/posts/a-python-implementation-of-simhash-algorithm/post
(1) 查看simhash值spa
>>> from simhash import Simhash >>> print '%x' % Simhash(u'I am very happy'.split()).value 9f8fd7efdb1ded7f
Simhash()接收一個token序列,或者叫特徵序列。.net
(2)計算兩個simhash值距離code
>>> hash1 = Simhash(u'I am very happy'.split()) >>> hash2 = Simhash(u'I am very sad'.split()) >>> print hash1.distance(hash2)
(3)創建索引htm
simhash被用來去重。若是兩兩分別計算simhash值,數據量較大的狀況下確定hold不住。有專門的數據結構,參考:http://www.cnblogs.com/maybe2030/p/5203186.html#_label4
from simhash import Simhash, SimhashIndex # 創建索引 data = { u'1': u'How are you I Am fine . blar blar blar blar blar Thanks .'.lower().split(), u'2': u'How are you i am fine .'.lower().split(), u'3': u'This is simhash test .'.lower().split(), } objs = [(id, Simhash(sent)) for id, sent in data.items()] index = SimhashIndex(objs, k=10) # k是容忍度;k越大,檢索出的類似文本就越多 # 檢索 s1 = Simhash(u'How are you . blar blar blar blar blar Thanks'.lower().split()) print index.get_near_dups(s1) # 增長新索引 index.add(u'4', s1)