TF-IDF計算:java
TF-IDF反映了在文檔集合中一個單詞對一個文檔的重要性,常常在文本數據挖據與信息ide
提取中用來做爲權重因子。在一份給定的文件裏,詞頻(termfrequency-TF)指的是某一spa
個給定的詞語在該文件中出現的頻率。逆向文件頻率(inversedocument frequency,code
IDF)是一個詞語普遍重要性的度量。某一特定詞語的IDF,能夠由總文件數目除以包含ip
該詞語之文件的數目,再將獲得的商取對數獲得。文檔
相關代碼:get
private static Pattern r = Pattern.compile("([ \\t{}()\",:;. \n])"); private static List<String> documentCollection; //Calculates TF-IDF weight for each term t in document d private static float findTFIDF(String document, String term) { float tf = findTermFrequency(document, term); float idf = findInverseDocumentFrequency(term); return tf * idf; } private static float findTermFrequency(String document, String term) { int count = getFrequencyInOneDoc(document, term); return (float)((float)count / (float)(r.split(document).length)); } private static int getFrequencyInOneDoc(String document, String term) { int count = 0; for(String s : r.split(document)) { if(s.toUpperCase().equals(term.toUpperCase())) { count++; } } return count; } private static float findInverseDocumentFrequency(String term) { //find the no. of document that contains the term in whole document collection int count = 0; for(String doc : documentCollection) { count += getFrequencyInOneDoc(doc, term); } /* * log of the ratio of total no of document in the collection to the no. of document containing the term * we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case; */ return (float)Math.log((float)documentCollection.size() / (float)count); }創建文檔的向量空間模型Vector Space Model並計算餘弦類似度。
相關代碼:it
public static float findCosineSimilarity(float[] vecA, float[] vecB) { float dotProduct = dotProduct(vecA, vecB); float magnitudeOfA = magnitude(vecA); float magnitudeOfB = magnitude(vecB); float result = dotProduct / (magnitudeOfA * magnitudeOfB); //when 0 is divided by 0 it shows result NaN so return 0 in such case. if (Float.isNaN(result)) return 0; else return (float)result; } public static float dotProduct(float[] vecA, float[] vecB) { float dotProduct = 0; for (int i = 0; i < vecA.length; i++) { dotProduct += (vecA[i] * vecB[i]); } return dotProduct; } // Magnitude of the vector is the square root of the dot product of the vector with itself. public static float magnitude(float[] vector) { return (float)Math.sqrt(dotProduct(vector, vector)); }注意點:
零詞過濾(stop-words filter)io
零詞列表class
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
關於TF-IDF參考這裏: