轉載請註明出處:電子科技大學EClab——落葉花開http://www.cnblogs.com/nlp-yekai/p/3848528.htmlhtml
SVD,即奇異值分解,在天然語言處理中,用來作潛在語義分析即LSI,或者LSA。最先見文章python
SVD的有關資料,從不少大牛的博客中整理了一下,而後本身寫了個python版本,放上來,跟你們分享~app
關於SVD的講解,參考博客spa
本文由LeftNotEasy發佈於http://leftnoteasy.cnblogs.com, 本文能夠被所有的轉載或者部分使用,但請註明出處,若是有問題,請聯繫wheeleast@gmail.comcode
python的拓展包numpy,scipy都能求解SVD,基於numpy寫了一個文檔作svd的程序。首先將每篇文檔向量化,而後對向量化後的文檔集合作SVD,取計算後的矩陣U,進行分析。先上代碼:orm
1 #coding=utf-8 2 import re 3 import math 4 import numpy as np 5 import matplotlib.pylab as plt 6 7 def f_file_open(trace_string): 8 """open the document_set, save in the list called txt""" 9 f=open(trace_string,'r') 10 txt=f.readlines() 11 f.close() 12 return txt 13 14 def f_vector_found(txt): 15 """calculate all of the word in the document set---構造詞空間""" 16 word_list=[] 17 for line in txt: 18 line_clean=line.split() 19 for word in line_clean: 20 if word not in word_list: 21 word_list.append(word) 22 else: 23 pass 24 return word_list 25 26 def f_document_vector(document,word_list): 27 """transform the document to vector---文檔向量化""" 28 vector=[] 29 document_clean=document.split() 30 for word in word_list: 31 a=document_clean.count(word) 32 vector.append(a) 33 return vector 34 35 def f_svd_calculate(document_array): 36 """calculate the svd and return the three matrics""" 37 U,S,V=np.linalg.svd(document_array) 38 return (U,S,V) 39 40 def f_process_matric_U(matric_U,Save_N_Singular_value): 41 """according to the matric U, choose the words as the feature in each document,根據前N個奇異值對U進行切分,選擇前N列""" 42 document_matric_U=[] 43 for line in matric_U: 44 line_new=line[:Save_N_Singular_value] 45 document_matric_U.append(line_new) 46 return document_matric_U 47 48 def f_process_matric_S(matric_S,Save_information_value): 49 """choose the items with large singular value,根據保留信息需求選擇奇異值個數""" 50 matricS_new=[] 51 S_self=0 52 N_count=0 53 Threshold=sum(matric_S)*float(Save_information_value) 54 for value in matric_S: 55 if S_self<=Threshold: 56 matricS_new.append(value) 57 S_self+=value 58 N_count+=1 59 else: 60 break 61 print ("the %d largest singular values keep the %s information " %(N_count,Save_information_value)) 62 return (N_count,matricS_new) 63 64 def f_process_matric_V(matric_V,Save_N_Singular_value): 65 """according to the matric V, choose the words as the feature in each document,根據前N個奇異值對U進行切分,選擇前N行""" 66 document_matric_V=matric_V[:Save_N_Singular_value] 67 return document_matric_V 68 69 def f_combine_U_S_V(matric_u,matric_s,matirc_v): 70 """calculate the new document對奇異值篩選後從新計算文檔矩陣""" 71 72 new_document_matric=np.dot(np.dot(matric_u,np.diag(matric_s)),matirc_v) 73 return new_document_matric 74 75 def f_matric_to_document(document_matric,word_list_self): 76 """transform the matric to document,將矩陣轉換爲文檔""" 77 new_document=[] 78 for line in document_matric: 79 count=0 80 for word in line: 81 if float(word)>=0.9: #轉換後文檔中詞選擇的閾值 82 new_document.append(word_list_self[count]+" ") 83 else: 84 pass 85 count+=1 86 new_document.append("\n") 87 return new_document 88 89 90 def f_save_file(trace,document): 91 f=open(trace,'a') 92 for line in document: 93 for word in line: 94 f.write(word) 95 96 trace_open="/home/alber/experiment/test.txt" 97 trace_save="/home/alber/experiment/20140715/svd_result1.txt" 98 txt=f_file_open(trace_open) 99 word_vector=f_vector_found(txt) 100 print (len(word_vector)) 101 102 document=[] 103 Num_line=0 104 for line in txt: #transform the document set to matric 105 Num_line=Num_line+1 106 document_vector=f_document_vector(line,word_vector) 107 document.append(document_vector) 108 print (len(document)) 109 U,S,V=f_svd_calculate(document) 110 print (sum(S)) 111 N_count,document_matric_S=f_process_matric_S(S,0.9) 112 document_matric_U=f_process_matric_U(U,N_count) 113 document_matric_V=f_process_matric_V(V,N_count) 114 print (len(document_matric_U[1])) 115 print (len(document_matric_V)) 116 new_document_matric=f_combine_U_S_V(document_matric_U,document_matric_S,document_matric_V) 117 print (sorted(new_document_matric[1],reverse=True)) 118 new_document=f_matric_to_document(new_document_matric,word_vector) 119 f_save_file(trace_save,new_document) 120 print ("the new document has been saved in %s"%trace_save)
第一篇文檔對應的向量的結果以下圖(未列完,已排序):htm
[1.0557039715196566, 1.0302828340480468, 1.0177955652284856, 1.0059864028992798, 0.99050787479103541, 0.93109816291875147, 0.70360233131357808, 0.22614603502510683, 0.10577134907675778, 0.098346889985350489, 0.091221506093784849, 0.085227549911874326, 0.052355994530275715, 0.049805639460153352, 0.046430974364203001, 0.046430974364203001, 0.045655634442695908, 0.043471974743277547, 0.041953839699628029, 0.041483792741663243, 0.039635143169293147, 0.03681955156197822, 0.034893319065413916, 0.0331697465114036, 0.029874818442883051, 0.029874818442883051, 0.028506042937487715, 0.028506042937487715, 0.027724455461901349, 0.026160357130229708, 0.023821284531034687, 0.023821284531034687, 0.017212073571417009, 0.016793815602261938, 0.016793815602261938, 0.016726955476865021, 0.015012207148054771, 0.013657280765244915。。。。。
基於這樣一種結果,要對分解後的矩陣進行分析,如上圖,值越大,代表該位置的詞對該文檔貢獻越大,而值越小則該詞無心義,於是,下一步就是設定閾值,取每一篇文檔的特徵詞,至於閾值的設定,有不少種方法,能夠對全部值進行排序,取拐點。如圖(不是上面的結果作出來的圖):blog
顯然,只有拐點之後的值對文檔的貢獻較高,而拐點之後的值變爲0,這樣,一個文檔--詞矩陣就經過SVD分解而下降了維度。排序
這個過程當中,有兩個認爲設定的參數,一個是奇異值的選擇,如上圖(右):奇異值降低較快,而其中前N個奇異值已經可以代替整個矩陣大部分的的信息。在個人程序中,經過設定須要保留的信息比率(保留90%或者95%或者其餘等等)來控制奇異值個數。three
另外一個須要設定的就是在對上圖(左),對於從新構造的矩陣,要用來代替原來的文檔矩陣,須要對詞進行選擇,上面已經說過的,取拐點值是一種。
詞--文檔矩陣的SVD分解基本上就是這些內容。歡迎糾錯和吐槽。