用docsim/doc2vec/LSH比較兩個文檔之間的類似度

在咱們作文本處理的時候,常常須要對兩篇文檔是否類似作處理或者根據輸入的文檔,找出最類似的文檔。html

如需轉載,請註明出處。node

幸虧gensim提供了這樣的工具,具體的處理思路以下,對於中文文本的比較,先須要作分詞處理,根據分詞的結果生成一個字典,而後再根據字典把原文檔轉化成向量。而後去訓練類似度。把對應的文檔構建一個索引,原文描述以下:python

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like 「Tell me how similar is this query document to each document in the index?」. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.git

第一種方法,使用docsim(推薦使用,結果比較穩定)github

示例代碼:爲了清楚的查看結果,對訓練數據作了標號app

# 訓練樣本
raw_documents = [
    '0無償居間介紹買賣毒品的行爲應如何定性',
    '1吸毒男動態持有大量毒品的行爲該如何認定',
    '2如何區分是非法種植毒品原植物罪仍是非法制造毒品罪',
    '3爲毒販販賣毒品提供幫助構成販賣毒品罪',
    '4將本身吸食的毒品原價轉讓給朋友吸食的行爲該如何認定',
    '5爲獲報酬幫人購買毒品的行爲該如何認定',
    '6毒販出獄後再次夠買毒品途中被抓的行爲認定',
    '7虛誇毒品功效勸人吸食毒品的行爲該如何認定',
    '8妻子下落不明丈夫又與他人登記結婚是否爲無效婚姻',
    '9一方未簽字辦理的結婚登記是否有效',
    '10夫妻雙方1990年按農村習俗舉辦婚禮沒有結婚證 一方能否起訴離婚',
    '11結婚前對方父母出資購買的住房寫咱們二人的名字有效嗎',
    '12身份證被別人冒用沒法登記結婚怎麼辦?',
    '13同居後又與他人登記結婚是否構成重婚罪',
    '14未辦登記只舉辦結婚儀式可起訴離婚嗎',
    '15同居多年未辦理結婚登記,是否能夠向法院起訴要求離婚'
]
corpora_documents = []
for item_text in raw_documents:
    item_str = util_words_cut.get_class_words_list(item_text)
    corpora_documents.append(item_str)

# 生成字典和向量語料
dictionary = corpora.Dictionary(corpora_documents)
corpus = [dictionary.doc2bow(text) for text in corpora_documents]

similarity = Similarity('-Similarity-index', corpus, num_features=400)

test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)
similarity.num_best = 5
print(similarity[test_corpus_1])  # 返回最類似的樣本材料,(index_of_document, similarity) tuples

print('################################')

test_data_2 = '家人因涉嫌運輸毒品被抓,她只是去朋友家探望朋友的,結果就被抓了,還在朋友家收出毒品,可家人的身上和行李中都沒有。如今已經拘留10多天了,請問會被判刑嗎'
test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)
test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)
similarity.num_best = 5
print(similarity[test_corpus_2])  # 返回最類似的樣本材料,(index_of_document, similarity) tuples

運行結果以下:dom

/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.py
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
Loading model cost 0.521 seconds.
Loading model cost 0.521 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(61 unique tokens: ['丈夫', '法院', '結婚', '住房', '出資']...) from 16 documents (total 89 corpus positions)
starting similarity index under -Similarity-index
[(14, 0.40824830532073975), (15, 0.40824830532073975), (10, 0.35355338454246521)]
################################
creating sparse index
creating sparse matrix from corpus
PROGRESS: at document #0/16
created <16x400 sparse matrix of type '<class 'numpy.float32'>'
	with 86 stored elements in Compressed Sparse Row format>
creating sparse shard #0
saving index shard to -Similarity-index.0
saving SparseMatrixSimilarity object under -Similarity-index.0, separately None
loading SparseMatrixSimilarity object from -Similarity-index.0
[(6, 0.50395262241363525), (2, 0.47140452265739441), (4, 0.33333337306976318), (1, 0.29814240336418152), (5, 0.29814240336418152)]

Process finished with exit code 0

對於第1個測試問題:原文檔中14,15,10和其類似,後面是對應的類似度ide

對於第2個測試問題:原文檔中6,2,4,1,5和其類似,後面是對應的類似度工具

第二種方法,使用doc2vec測試

看了gensim的官方文檔,寫的很差,一樣是使用上面的數據作測試,代碼及結果以下:

# 使用doc2vec來判斷
cores = multiprocessing.cpu_count()
print(cores)
corpora_documents = []
for i, item_text in enumerate(raw_documents):
    words_list = util_words_cut.get_class_words_list(item_text)
    document = TaggedDocument(words=words_list, tags=[i])
    corpora_documents.append(document)

print(corpora_documents[:2])

model = Doc2Vec(size=89, min_count=1, iter=10)
model.build_vocab(corpora_documents)
model.train(corpora_documents)

print('#########', model.vector_size)

test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
print(test_cut_raw_1)
inferred_vector = model.infer_vector(test_cut_raw_1)
print(inferred_vector)
sims = model.docvecs.most_similar([inferred_vector], topn=3)
print(sims)

控制檯打印的相關信息以下:

Pattern library is not installed, lemmatization won't be available.
'pattern' package not found; tag filters are not available for English
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
4
Loading model cost 0.513 seconds.
Loading model cost 0.513 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
collected 61 word types and 16 unique tags from a corpus of 16 examples and 89 words
min_count=1 retains 61 unique words (drops 0)
min_count leaves 89 word corpus (100% of original 89)
deleting the raw counts dictionary of 61 items
sample=0 downsamples 0 most-common words
downsampling leaves estimated 89 word corpus (100.0% of prior 89)
estimated required memory for 61 words and 89 dimensions: 91828 bytes
constructing a huffman tree from 61 words
built huffman tree with maximum node depth 7
resetting layer weights
training model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0
expecting 16 sentences, matching count from corpus used for vocabulary survey
[TaggedDocument(words=['無償', '居間', '介紹', '買賣', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '動態', '持有', '毒品', '認定'], tags=[1])]
worker thread finished; awaiting finish of 0 more threads
training on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/s
under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
######### 89
['離婚', '孩子', '自動', '生效', '離婚']
[  2.54629389e-03   1.87756249e-03  -9.76708368e-04  -5.15014399e-03
  -7.54948880e-04  -3.74549557e-03   5.37392031e-03   3.35739669e-03
  -3.50345811e-03   2.63415743e-03  -1.32059853e-03  -4.15759953e-03
  -2.39425618e-03  -6.20105816e-03  -1.42006821e-03  -4.64246795e-03
   3.78829846e-03   1.47493952e-03   4.49652784e-03  -5.57655795e-03
  -1.40081509e-04  -7.10823014e-03  -5.34327468e-04  -4.21888893e-03
  -2.96280603e-03   6.52066898e-04   5.98943839e-03  -4.01164964e-03
   2.49637989e-03  -9.08742077e-04   4.65002051e-03   9.24886088e-04
   1.67128560e-03  -1.93383044e-03  -4.58135502e-03   1.78024184e-03
  -9.60796722e-04   7.26479106e-04   4.50814469e-03   2.58095766e-04
  -4.53767460e-03  -1.72883295e-03  -3.89566552e-03   4.85864235e-03
   5.90517826e-04   4.30173194e-03   3.37816169e-03  -1.08716707e-03
   1.85196218e-03   1.94042712e-03   1.20989932e-03  -4.69703926e-03
  -5.35873650e-03  -1.35291950e-03  -4.62053996e-03   2.15436472e-03
   4.05823253e-03   8.01778078e-05  -3.84314684e-03   1.11574796e-03
  -4.36050585e-03  -3.31182266e-03  -2.15692003e-03  -2.09038518e-03
   4.50274721e-03  -1.85286190e-04  -5.09306230e-03  -1.12043330e-04
   8.25022871e-04   2.60405545e-03  -1.73542544e-03   5.14509249e-03
  -9.16058663e-04   1.01291772e-03  -7.90049613e-04   4.20650374e-03
  -3.00139328e-03   3.34924040e-03  -2.11520446e-03   4.79168072e-03
   2.11459701e-03  -3.07943812e-03  -5.09956060e-03  -2.34926818e-03
   7.30032055e-03  -5.31428820e-03  -2.96888268e-03   4.95154131e-03
   3.09590902e-03]
[(15, 0.2670447528362274), (14, 0.18831682205200195), (10, 0.07022987306118011)]
precomputing L2-norms of doc weight vectors

使用doc2vec結果不是很穩定,多是我沒有正確的使用吧,不過我看官方文檔也沒有找到比較有用的信息

文檔相關連接以下: https://radimrehurek.com/gensim/models/doc2vec.html

第三種方式:使用LSH(LSH原理請見百度搜索)

sciket-learn提供了lsh的實現,固然github上也有lsh的實現。sciket-learn上是提供的lsh樹。

LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

仍是使用一樣的測試數據,代碼以下:

# 使用lsh來處理
tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
train_documents = []
for item_text in raw_documents:
    item_str = util_words_cut.get_class_words_with_space(item_text)
    train_documents.append(item_str)
x_train = tfidf_vectorizer.fit_transform(train_documents)

test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)
x_test = tfidf_vectorizer.transform([test_cut_raw_1])

lshf = LSHForest(random_state=42)
lshf.fit(x_train.toarray())

distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)
print(distances)
print(indices)

控制檯的信息以下,基本和docsim一致

[[ 0.42264973  0.42264973  0.48875208]]
[[10 15 14]]

以上是本身找出來的用來比較文本類似度的實現,不過通常lsh比較適合作短文本的比較。

相關文章
相關標籤/搜索