NLP探究TF-IDF的原理

TF-IDF介紹

  TF-IDF是NLP中一種經常使用的統計方法,用以評估一個字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度,一般用於提取文本的特徵,即關鍵詞。字詞的重要性隨着它在文件中出現的次數成正比增長,但同時會隨着它在語料庫中出現的頻率成反比降低。
  在NLP中,TF-IDF的計算公式以下:git

tfidf = tf*idf.

  其中,tf是詞頻(Term Frequency),idf爲逆向文件頻率(Inverse Document Frequency)。
  tf爲詞頻,即一個詞語在文檔中的出現頻率,假設一個詞語在整個文檔中出現了i次,而整個文檔有N個詞語,則tf的值爲i/N.
  idf爲逆向文件頻率,假設整個文檔有n篇文章,而一個詞語在k篇文章中出現,則idf值爲github

idf=\log_{2}(\frac{n}{k}).

  固然,不一樣地方的idf值計算公式會有稍微的不一樣。好比有些地方會在分母的k上加1,防止分母爲0,還有些地方會讓分子,分母都加上1,這是smoothing技巧。在本文中,仍是採用最原始的idf值計算公式,由於這與gensim裏面的計算公式一致。
  假設整個文檔有D篇文章,則單詞i在第j篇文章中的tfidf值爲算法

 
  gensim中tfidf的計算公式

  以上就是TF-IDF的計算方法。app

文本介紹及預處理

  咱們將採用如下三個示例文本:less

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

  這三篇文章分別是關於足球,籃球,排球的介紹,它們組成一篇文檔。
  接下來是文本的預處理部分。
  首先是對文本去掉換行符,而後是分句,分詞,再去掉其中的標點,完整的Python代碼以下,輸入的參數爲文章text:ide

import nltk
import string

# 文本預處理
# 函數:text文件分句,分詞,並去掉標點
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分詞
            if word not in string.punctuation: # 去掉標點
                tokens.append(word)
    return tokens

  接着,去掉文章中的通用詞(stopwords),而後統計每一個單詞的出現次數,完整的Python代碼以下,輸入的參數爲文章text:函數

from nltk.corpus import stopwords     #停用詞

# 對原始的text文件去掉停用詞
# 生成count字典,即每一個單詞的出現次數
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用詞
    count = Counter(filtered)
    return count

  以text3爲例,生成的count字典以下:oop

Counter({'ball': 4, 'net': 4, 'teammate': 3, 'returned': 2, 'bat': 2, 'court': 2, 'team': 2, 'across': 2, 'touches': 2, 'back': 2, 'players': 2, 'touch': 1, 'must': 1, 'usually': 1, 'side': 1, 'player': 1, 'area': 1, 'Volleyball': 1, 'hands': 1, 'may': 1, 'toward': 1, 'A': 1, 'third': 1, 'two': 1, 'six': 1, 'opposing': 1, 'within': 1, 'prevent': 1, 'allowed': 1, '’': 1, 'playing': 1, 'played': 1, 'volley': 1, 'surface—that': 1, 'volleys': 1, 'opponents': 1, 'use': 1, 'high': 1, 'teams': 1, 'bats': 1, 'To': 1, 'game': 1, 'make': 1, 'forth': 1, 'three': 1, 'trying': 1})學習

Gensim中的TF-IDF

  對文本進行預處理後,對於以上三個示例文本,咱們都會獲得一個count字典,裏面是每一個文本中單詞的出現次數。下面,咱們將用gensim中的已實現的TF-IDF模型,來輸出每篇文章中TF-IDF排名前三的單詞及它們的tfidf值,完整的代碼以下:測試

from nltk.corpus import stopwords     #停用詞
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

  輸出的結果以下:

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888

  輸出的結果仍是比較符合咱們的預期的,好比關於足球的文章中提取了football, rugby關鍵詞,關於籃球的文章中提取了plat, cm關鍵詞,關於排球的文章中提取了net, teammate關鍵詞。

本身動手實踐TF-IDF模型

  有了以上咱們對TF-IDF模型的理解,其實咱們本身也能夠動手實踐一把,這是學習算法的最佳方式!
  如下是筆者實踐TF-IDF的代碼(接文本預處理代碼):

import math

# 計算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 計算count_list有多少個文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 計算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #對數以2爲底
# 計算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    # sorted_words = matutils.unitvec(sorted_words)
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

  輸出結果以下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.30677
    Word: rugby, TF-IDF: 0.07669
    Word: known, TF-IDF: 0.05113
Top words in document 2
    Word: play, TF-IDF: 0.05283
    Word: inches, TF-IDF: 0.03522
    Word: worth, TF-IDF: 0.03522
Top words in document 3
    Word: net, TF-IDF: 0.10226
    Word: teammate, TF-IDF: 0.07669
    Word: across, TF-IDF: 0.05113

  能夠看到,筆者本身動手實踐的TF-IDF模型提取的關鍵詞與gensim一致,至於籃球中爲何後兩個單詞不一致,是由於這些單詞的tfidf同樣,隨機選擇的結果不一樣而已。可是有一個問題,那就是計算獲得的tfidf值不同,這是什麼緣由呢?
  查閱gensim中計算tf-idf值的源代碼(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):

 
TfidfModel類的參數
 
normalize參數的說明

  也就是說,gensim對獲得的tf-idf向量作了規範化(normalize),將其轉化爲單位向量。所以,咱們須要在剛纔的代碼中加入規範化這一步,代碼以下:

import numpy as np

# 對向量作規範化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

  輸出結果以下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: shooting, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: back, TF-IDF: 0.22888

  如今的輸出結果與gensim獲得的結果一致!

  

  本文的完整代碼以下:

import nltk
import math
import string
from nltk.corpus import stopwords     #停用詞
from collections import Counter       #計數
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# 文本預處理
# 函數:text文件分句,分詞,並去掉標點
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分詞
            if word not in string.punctuation: # 去掉標點
                tokens.append(word)
    return tokens

# 對原始的text文件去掉停用詞
# 生成count字典,即每一個單詞的出現次數
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]    #去掉停用詞
    count = Counter(filtered)
    return count

# 計算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 計算count_list有多少個文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 計算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #對數以2爲底
# 計算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# 對向量作規範化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
        
"""
輸出結果:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: word, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: inches, TF-IDF: 0.19915
    Word: points, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: bat, TF-IDF: 0.22888

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888
"""
相關文章
相關標籤/搜索