from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer # 說明: # 一、主要用到了兩個函數:CountVectorizer()和TfidfTransformer()。 # 二、CountVectorizer是經過fit_transform函數將文本中的詞語轉換爲詞頻矩陣, # 1)矩陣元素weight[i][j] 表示j詞在第i個文本下的詞頻,即各個詞語出現的次數; # 2)經過get_feature_names()可看到全部文本的關鍵字,經過toarray()可看到詞頻矩陣的結果。 # 三、TfidfTransformer也有個fit_transform函數,它的做用是計算tf-idf值 # 測試目的: # 1)CountVectorizer單條屢次計算與一次lst結算結果是否一致, # 2)即test的結果與test_a、test_b的結果是否一致 # 3) 即測試數據是否會影響TfidfTransformer與CountVectorizer計算itidf的結果 # 結論: # 1) 只要train與test都是用同一個詞頻矩陣CountVectorizer,單條屢次計算與一次lst結算結果一致 # 2) 爲了保證測試集也能用到訓練集的詞頻矩陣,保存模型的時候須要保存CountVectorizer train = ['This is the first document.', 'This is the second second document.'] test = ['And the third one.', 'Is this the first document?'] test_a = ['And the third one.'] test_b = ['Is this the first document?'] vectorizer = CountVectorizer() tfidftransformer = TfidfTransformer() # 注意只要vectorizer.fit_transform,詞頻矩陣就固定了 count_train = vectorizer.fit_transform(train) print('count:') print(vectorizer.vocabulary_) print('feature_names:') print(vectorizer.get_feature_names()) print(count_train.toarray()) tfidf = tfidftransformer.fit_transform(count_train) train_weight = tfidf.toarray() print(tfidf.shape) print(train_weight) count_test = vectorizer.transform(test) # 注意,這裏是經過固定的詞頻矩陣來轉換test_a、test_b count_test_a = vectorizer.transform(test_a) count_test_b = vectorizer.transform(test_b) # print(type(count2)) print('count_train:') print(vectorizer.get_feature_names()) print('詞頻矩陣對好比下:') print(count_test.toarray()) print(count_test_a.toarray()) print(count_test_b.toarray()) test_tfidf = tfidftransformer.transform(count_test) test_weight = test_tfidf.toarray() test_weight_a = tfidftransformer.transform(count_test_a).toarray() test_weight_b = tfidftransformer.transform(count_test_b).toarray() print('tfidf對好比下:') print(test_weight) print(test_weight_a) print(test_weight_b)
結果輸出以下:函數
count:
{'first': 1, 'the': 4, 'is': 2, 'second': 3, 'this': 5, 'document': 0}
feature_names:
['document', 'first', 'is', 'second', 'the', 'this']
[[1 1 1 0 1 1]
[1 0 1 2 1 1]]
(2, 6)
[[0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ]
[0.28986934 0. 0.28986934 0.81480247 0.28986934 0.28986934]]
count_train:
['document', 'first', 'is', 'second', 'the', 'this']
詞頻矩陣對好比下:
[[0 0 0 0 1 0]
[1 1 1 0 1 1]]
[[0 0 0 0 1 0]]
[[1 1 1 0 1 1]]
tfidf對好比下:
[[0. 0. 0. 0. 1. 0. ]
[0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ]]
[[0. 0. 0. 0. 1. 0.]]
[[0.4090901 0.57496187 0.4090901 0. 0.4090901 0.4090901 ]]測試