文檔的詞頻-反向文檔頻率(TF-IDF)計算

TF-IDF計算:java

TF-IDF反映了在文檔集合中一個單詞對一個文檔的重要性,常常在文本數據挖據與信息ide

提取中用來做爲權重因子。在一份給定的文件裏,詞頻(termfrequency-TF)指的是某一spa

個給定的詞語在該文件中出現的頻率。逆向文件頻率(inversedocument frequency,code

IDF)是一個詞語普遍重要性的度量。某一特定詞語的IDF,能夠由總文件數目除以包含ip

該詞語之文件的數目,再將獲得的商取對數獲得。文檔

相關代碼:get

	private static Pattern r = Pattern.compile("([ \\t{}()\",:;. \n])");  	private static List<String> documentCollection;      //Calculates TF-IDF weight for each term t in document d     private static float findTFIDF(String document, String term)     {         float tf = findTermFrequency(document, term);         float idf = findInverseDocumentFrequency(term);         return tf * idf;     }      private static float findTermFrequency(String document, String term)     {     	int count = getFrequencyInOneDoc(document, term);          return (float)((float)count / (float)(r.split(document).length));     }          private static int getFrequencyInOneDoc(String document, String term)     {     	int count = 0;         for(String s : r.split(document))         {         	if(s.toUpperCase().equals(term.toUpperCase())) {         		count++;         	}         }         return count;     }       private static float findInverseDocumentFrequency(String term)     {         //find the  no. of document that contains the term in whole document collection         int count = 0;         for(String doc : documentCollection)         {         	count += getFrequencyInOneDoc(doc, term);         }         /*          * log of the ratio of  total no of document in the collection to the no. of document containing the term          * we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case;           */         return (float)Math.log((float)documentCollection.size() / (float)count);      }
創建文檔的向量空間模型Vector Space Model並計算餘弦類似度。

相關代碼:it

public static float findCosineSimilarity(float[] vecA, float[] vecB) {     float dotProduct = dotProduct(vecA, vecB);     float magnitudeOfA = magnitude(vecA);     float magnitudeOfB = magnitude(vecB);     float result = dotProduct / (magnitudeOfA * magnitudeOfB);     //when 0 is divided by 0 it shows result NaN so return 0 in such case.     if (Float.isNaN(result))         return 0;     else         return (float)result; }  public static float dotProduct(float[] vecA, float[] vecB) {      float dotProduct = 0;     for (int i = 0; i < vecA.length; i++)     {         dotProduct += (vecA[i] * vecB[i]);     }      return dotProduct; }  // Magnitude of the vector is the square root of the dot product of the vector with itself. public static float magnitude(float[] vector) {     return (float)Math.sqrt(dotProduct(vector, vector)); }
注意點

零詞過濾(stop-words filter)io

零詞列表class

ftp://ftp.cs.cornell.edu/pub/smart/english.stop

關於TF-IDF參考這裏:

連接–> http://en.wikipedia.org/wiki/Tf*idf

相關文章
相關標籤/搜索