TF-IDF是NLP中一種經常使用的統計方法,用以評估一個字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度,一般用於提取文本的特徵,即關鍵詞。字詞的重要性隨着它在文件中出現的次數成正比增長,但同時會隨着它在語料庫中出現的頻率成反比降低。
在NLP中,TF-IDF的計算公式以下:git
其中,tf是詞頻(Term Frequency),idf爲逆向文件頻率(Inverse Document Frequency)。
tf爲詞頻,即一個詞語在文檔中的出現頻率,假設一個詞語在整個文檔中出現了i次,而整個文檔有N個詞語,則tf的值爲i/N.
idf爲逆向文件頻率,假設整個文檔有n篇文章,而一個詞語在k篇文章中出現,則idf值爲github
固然,不一樣地方的idf值計算公式會有稍微的不一樣。好比有些地方會在分母的k上加1,防止分母爲0,還有些地方會讓分子,分母都加上1,這是smoothing技巧。在本文中,仍是採用最原始的idf值計算公式,由於這與gensim裏面的計算公式一致。
假設整個文檔有D篇文章,則單詞i在第j篇文章中的tfidf值爲算法
以上就是TF-IDF的計算方法。app
咱們將採用如下三個示例文本:less
text1 =""" Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes. """ text2 = """ Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. """ text3 = """ Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net. """
這三篇文章分別是關於足球,籃球,排球的介紹,它們組成一篇文檔。
接下來是文本的預處理部分。
首先是對文本去掉換行符,而後是分句,分詞,再去掉其中的標點,完整的Python代碼以下,輸入的參數爲文章text:ide
import nltk import string # 文本預處理 # 函數:text文件分句,分詞,並去掉標點 def get_tokens(text): text = text.replace('\n', '') sents = nltk.sent_tokenize(text) # 分句 tokens = [] for sent in sents: for word in nltk.word_tokenize(sent): # 分詞 if word not in string.punctuation: # 去掉標點 tokens.append(word) return tokens
接着,去掉文章中的通用詞(stopwords),而後統計每一個單詞的出現次數,完整的Python代碼以下,輸入的參數爲文章text:函數
from nltk.corpus import stopwords #停用詞 # 對原始的text文件去掉停用詞 # 生成count字典,即每一個單詞的出現次數 def make_count(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用詞 count = Counter(filtered) return count
以text3爲例,生成的count字典以下:oop
Counter({'ball': 4, 'net': 4, 'teammate': 3, 'returned': 2, 'bat': 2, 'court': 2, 'team': 2, 'across': 2, 'touches': 2, 'back': 2, 'players': 2, 'touch': 1, 'must': 1, 'usually': 1, 'side': 1, 'player': 1, 'area': 1, 'Volleyball': 1, 'hands': 1, 'may': 1, 'toward': 1, 'A': 1, 'third': 1, 'two': 1, 'six': 1, 'opposing': 1, 'within': 1, 'prevent': 1, 'allowed': 1, '’': 1, 'playing': 1, 'played': 1, 'volley': 1, 'surface—that': 1, 'volleys': 1, 'opponents': 1, 'use': 1, 'high': 1, 'teams': 1, 'bats': 1, 'To': 1, 'game': 1, 'make': 1, 'forth': 1, 'three': 1, 'trying': 1})學習
對文本進行預處理後,對於以上三個示例文本,咱們都會獲得一個count字典,裏面是每一個文本中單詞的出現次數。下面,咱們將用gensim中的已實現的TF-IDF模型,來輸出每篇文章中TF-IDF排名前三的單詞及它們的tfidf值,完整的代碼以下:測試
from nltk.corpus import stopwords #停用詞 from gensim import corpora, models, matutils #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # training by TfidfModel in gensim dictionary = corpora.Dictionary(countlist) new_dict = {v:k for k,v in dictionary.token2id.items()} corpus2 = [dictionary.doc2bow(count) for count in countlist] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf = tfidf2[corpus2] # output print("\nTraining by gensim Tfidf Model.......\n") for i, doc in enumerate(corpus_tfidf): print("Top words in document %d"%(i + 1)) sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list for num, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
輸出的結果以下:
Training by gensim Tfidf Model....... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: cm, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: across, TF-IDF: 0.22888
輸出的結果仍是比較符合咱們的預期的,好比關於足球的文章中提取了football, rugby關鍵詞,關於籃球的文章中提取了plat, cm關鍵詞,關於排球的文章中提取了net, teammate關鍵詞。
有了以上咱們對TF-IDF模型的理解,其實咱們本身也能夠動手實踐一把,這是學習算法的最佳方式!
如下是筆者實踐TF-IDF的代碼(接文本預處理代碼):
import math # 計算tf def tf(word, count): return count[word] / sum(count.values()) # 計算count_list有多少個文件包含word def n_containing(word, count_list): return sum(1 for count in count_list if word in count) # 計算idf def idf(word, count_list): return math.log2(len(count_list) / (n_containing(word, count_list))) #對數以2爲底 # 計算tf-idf def tfidf(word, count, count_list): return tf(word, count) * idf(word, count_list) # TF-IDF測試 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......\n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list # sorted_words = matutils.unitvec(sorted_words) for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))
輸出結果以下:
Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.30677 Word: rugby, TF-IDF: 0.07669 Word: known, TF-IDF: 0.05113 Top words in document 2 Word: play, TF-IDF: 0.05283 Word: inches, TF-IDF: 0.03522 Word: worth, TF-IDF: 0.03522 Top words in document 3 Word: net, TF-IDF: 0.10226 Word: teammate, TF-IDF: 0.07669 Word: across, TF-IDF: 0.05113
能夠看到,筆者本身動手實踐的TF-IDF模型提取的關鍵詞與gensim一致,至於籃球中爲何後兩個單詞不一致,是由於這些單詞的tfidf同樣,隨機選擇的結果不一樣而已。可是有一個問題,那就是計算獲得的tfidf值不同,這是什麼緣由呢?
查閱gensim中計算tf-idf值的源代碼(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):
也就是說,gensim對獲得的tf-idf向量作了規範化(normalize),將其轉化爲單位向量。所以,咱們須要在剛纔的代碼中加入規範化這一步,代碼以下:
import numpy as np # 對向量作規範化, normalize def unitvec(sorted_words): lst = [item[1] for item in sorted_words] L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst))) unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words] return unit_vector # TF-IDF測試 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......\n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list sorted_words = unitvec(sorted_words) # normalize for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))
輸出結果以下:
Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: shooting, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: back, TF-IDF: 0.22888
如今的輸出結果與gensim獲得的結果一致!
本文的完整代碼以下:
import nltk import math import string from nltk.corpus import stopwords #停用詞 from collections import Counter #計數 from gensim import corpora, models, matutils text1 =""" Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes. """ text2 = """ Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. """ text3 = """ Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net. """ # 文本預處理 # 函數:text文件分句,分詞,並去掉標點 def get_tokens(text): text = text.replace('\n', '') sents = nltk.sent_tokenize(text) # 分句 tokens = [] for sent in sents: for word in nltk.word_tokenize(sent): # 分詞 if word not in string.punctuation: # 去掉標點 tokens.append(word) return tokens # 對原始的text文件去掉停用詞 # 生成count字典,即每一個單詞的出現次數 def make_count(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用詞 count = Counter(filtered) return count # 計算tf def tf(word, count): return count[word] / sum(count.values()) # 計算count_list有多少個文件包含word def n_containing(word, count_list): return sum(1 for count in count_list if word in count) # 計算idf def idf(word, count_list): return math.log2(len(count_list) / (n_containing(word, count_list))) #對數以2爲底 # 計算tf-idf def tfidf(word, count, count_list): return tf(word, count) * idf(word, count_list) import numpy as np # 對向量作規範化, normalize def unitvec(sorted_words): lst = [item[1] for item in sorted_words] L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst))) unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words] return unit_vector # TF-IDF測試 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......\n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list sorted_words = unitvec(sorted_words) # normalize for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5))) #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # training by TfidfModel in gensim dictionary = corpora.Dictionary(countlist) new_dict = {v:k for k,v in dictionary.token2id.items()} corpus2 = [dictionary.doc2bow(count) for count in countlist] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf = tfidf2[corpus2] # output print("\nTraining by gensim Tfidf Model.......\n") for i, doc in enumerate(corpus_tfidf): print("Top words in document %d"%(i + 1)) sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list for num, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5))) """ 輸出結果: Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: word, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: inches, TF-IDF: 0.19915 Word: points, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: bat, TF-IDF: 0.22888 Training by gensim Tfidf Model....... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: cm, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: across, TF-IDF: 0.22888 """