文章類似度

一、環境   html

單機版、windows系統、python3.六、gensim模塊python

參考文獻:windows

 https://pypi.org/project/gensim/網絡

 https://radimrehurek.com/gensim/app

https://www.jianshu.com/p/6e07729c6c5bcode

二、gensim安裝  https://pypi.org/project/gensim/orm

通常能夠直接經過  pip install -U gensim安裝htm

若是沒有網絡環境,能夠在下載相應安裝包 安裝(會有其餘包依賴問題,須要逐個安裝)ip

三、經過gensim計算文章類似度 https://radimrehurek.com/gensim/similarities/docsim.html內存

3.1  cosine類似度 cosine similarity 

 a)gensim.similarities.docsim.MatrixSimilarity (矩陣向量,內存運算)

  b) gensim.similarities.docsim.Similarity (動態運算,若是MatrixSimilarity、SparseMatrixSimilarity數據量大,沒法計算時,可以使用)

  c) gensim.similarities.docsim.SparseMatrixSimilarity (稀疏向量輸入,內存運算)

3.2 wmd類似度

 gensim.similarities.docsim.WmdSimilarity  

 

四、簡易代碼

數據輸入(text): 分詞完以後的詞向量, 如[["love","China"], ["weather", "sunny"]]

from gensim.models import Word2Vec
from gensim.similarities import WmdSimilarity, Similarity, MatrixSimilarity, SparseMatrixSimilarity

from gensim import corpora, models
#文章輸入
text = [["love","China"], ["weather", "sunny"]]

#將類似度向量轉成list
def index2list(index):
doc_sim_list = []
for s in index:
try:
doc_sim_list.append(s)
except:
print ("there is something woring at index : {0}".format(s))
return doc_sim_list

##WmdSimilarity
#獲取詞向量模型
model = Word2Vec(text, min_count=1)
#計算WmdSimilarity
index = WmdSimilarity(text, model)
doc_sim_list = index2list(index)


##cosine similarity
#構建詞語字典
dictionary = corpora.Dictionary(text)
#將文章轉成此向量
corpus = [dictionary.doc2bow(t) for t in text]

#SparseMatrixSimilarity
index = SparseMatrixSimilarity(corpus, num_features=len(dictionary))
doc_sim_list = index2list(index)

#MatrixSimilarity
index = MatrixSimilarity(corpus, num_features=len(dictionary))
doc_sim_list = index2list(index)


#Similarity
#idf computation
tfidf_model = models.TfidfModel(corpus)

tfidf = tfidf_model[corpus]

index = Similarity("Similarity-index", tfidf, num_features=len(dictionary))doc_sim_list = index2list(index)

相關文章
相關標籤/搜索