用Python作SVD文檔聚類---奇異值分解----文檔類似性----LSI（潛在語義分析）

時間 2019-11-17

標籤 python svd 文檔奇異分解類似 lsi 潛在語義分析欄目 Python 简体版

原文原文鏈接

轉載請註明出處：電子科技大學EClab——落葉花開http://www.cnblogs.com/nlp-yekai/p/3848528.htmlhtml

SVD，即奇異值分解，在天然語言處理中，用來作潛在語義分析即LSI，或者LSA。最先見文章python

An introduction to latent semantic analysis

SVD的有關資料，從不少大牛的博客中整理了一下，而後本身寫了個python版本，放上來，跟你們分享～app

關於SVD的講解，參考博客spa

本文由LeftNotEasy發佈於http://leftnoteasy.cnblogs.com, 本文能夠被所有的轉載或者部分使用，但請註明出處，若是有問題，請聯繫wheeleast@gmail.comcode

python的拓展包numpy,scipy都能求解SVD，基於numpy寫了一個文檔作svd的程序。首先將每篇文檔向量化，而後對向量化後的文檔集合作SVD，取計算後的矩陣U，進行分析。先上代碼：orm

  1 #coding=utf-8
  2 import re
  3 import math
  4 import numpy as np
  5 import matplotlib.pylab as plt
  6 
  7 def f_file_open(trace_string):
  8     """open the document_set, save in the list called txt"""
  9     f=open(trace_string,'r')
 10     txt=f.readlines()
 11     f.close()
 12     return txt
 13 
 14 def f_vector_found(txt):
 15     """calculate all of the word in the document set---構造詞空間"""
 16     word_list=[]
 17     for line in txt:
 18         line_clean=line.split()
 19         for word in line_clean:
 20             if word not in word_list:
 21                 word_list.append(word)
 22             else:
 23                 pass
 24     return word_list
 25 
 26 def f_document_vector(document,word_list):
 27     """transform the document to vector---文檔向量化"""
 28     vector=[]
 29     document_clean=document.split()
 30     for word in word_list:
 31         a=document_clean.count(word)
 32         vector.append(a)
 33     return vector
 34 
 35 def f_svd_calculate(document_array):
 36     """calculate the svd and return the three matrics"""
 37     U,S,V=np.linalg.svd(document_array)
 38     return (U,S,V)
 39 
 40 def f_process_matric_U(matric_U,Save_N_Singular_value):
 41     """according to the matric U, choose the words as the feature in each document,根據前N個奇異值對U進行切分,選擇前N列""" 
 42     document_matric_U=[]
 43     for line in matric_U:
 44         line_new=line[:Save_N_Singular_value]
 45         document_matric_U.append(line_new)
 46     return document_matric_U
 47 
 48 def f_process_matric_S(matric_S,Save_information_value):
 49     """choose the items with large singular value,根據保留信息需求選擇奇異值個數"""
 50     matricS_new=[]
 51     S_self=0
 52     N_count=0
 53     Threshold=sum(matric_S)*float(Save_information_value)
 54     for value in matric_S:
 55         if S_self<=Threshold:
 56             matricS_new.append(value)
 57             S_self+=value
 58             N_count+=1
 59         else:
 60             break
 61     print ("the %d largest singular values keep the %s information " %(N_count,Save_information_value))
 62     return (N_count,matricS_new)
 63 
 64 def f_process_matric_V(matric_V,Save_N_Singular_value):
 65     """according to the matric V, choose the words as the feature in each document,根據前N個奇異值對U進行切分,選擇前N行"""
 66     document_matric_V=matric_V[:Save_N_Singular_value]
 67     return document_matric_V
 68 
 69 def f_combine_U_S_V(matric_u,matric_s,matirc_v):
 70     """calculate the new document對奇異值篩選後從新計算文檔矩陣"""
 71     
 72     new_document_matric=np.dot(np.dot(matric_u,np.diag(matric_s)),matirc_v)
 73     return new_document_matric
 74 
 75 def f_matric_to_document(document_matric,word_list_self):
 76     """transform the matric to document,將矩陣轉換爲文檔"""
 77     new_document=[]
 78     for line in document_matric:
 79         count=0
 80         for word in line:
 81             if float(word)>=0.9:                                                                                     #轉換後文檔中詞選擇的閾值
 82                 new_document.append(word_list_self[count]+" ")
 83             else:
 84                 pass
 85             count+=1
 86         new_document.append("\n")
 87     return new_document
 88 
 89 
 90 def f_save_file(trace,document):
 91     f=open(trace,'a')
 92     for line in document:
 93         for word in line:
 94             f.write(word)
 95 
 96 trace_open="/home/alber/experiment/test.txt"
 97 trace_save="/home/alber/experiment/20140715/svd_result1.txt"
 98 txt=f_file_open(trace_open)
 99 word_vector=f_vector_found(txt)
100 print (len(word_vector))
101 
102 document=[]
103 Num_line=0
104 for line in txt:                                #transform the document set to matric
105     Num_line=Num_line+1
106     document_vector=f_document_vector(line,word_vector)
107     document.append(document_vector)
108 print (len(document))
109 U,S,V=f_svd_calculate(document)
110 print (sum(S))
111 N_count,document_matric_S=f_process_matric_S(S,0.9)
112 document_matric_U=f_process_matric_U(U,N_count)
113 document_matric_V=f_process_matric_V(V,N_count)
114 print (len(document_matric_U[1]))
115 print (len(document_matric_V))
116 new_document_matric=f_combine_U_S_V(document_matric_U,document_matric_S,document_matric_V)
117 print (sorted(new_document_matric[1],reverse=True))
118 new_document=f_matric_to_document(new_document_matric,word_vector)
119 f_save_file(trace_save,new_document)
120 print ("the new document has been saved in %s"%trace_save)

第一篇文檔對應的向量的結果以下圖（未列完，已排序）：htm

[1.0557039715196566, 1.0302828340480468, 1.0177955652284856, 1.0059864028992798, 0.99050787479103541, 0.93109816291875147, 0.70360233131357808, 0.22614603502510683, 0.10577134907675778, 0.098346889985350489, 0.091221506093784849, 0.085227549911874326, 0.052355994530275715, 0.049805639460153352, 0.046430974364203001, 0.046430974364203001, 0.045655634442695908, 0.043471974743277547, 0.041953839699628029, 0.041483792741663243, 0.039635143169293147, 0.03681955156197822, 0.034893319065413916, 0.0331697465114036, 0.029874818442883051, 0.029874818442883051, 0.028506042937487715, 0.028506042937487715, 0.027724455461901349, 0.026160357130229708, 0.023821284531034687, 0.023821284531034687, 0.017212073571417009, 0.016793815602261938, 0.016793815602261938, 0.016726955476865021, 0.015012207148054771, 0.013657280765244915。。。。。

基於這樣一種結果，要對分解後的矩陣進行分析，如上圖，值越大，代表該位置的詞對該文檔貢獻越大，而值越小則該詞無心義，於是，下一步就是設定閾值，取每一篇文檔的特徵詞，至於閾值的設定，有不少種方法，能夠對全部值進行排序，取拐點。如圖（不是上面的結果作出來的圖）：blog