使用sklearn作文本特徵提取

時間 2019-12-06

標籤使用 sklearn 作文特徵提取简体版

原文原文鏈接

提取文本的特徵，把文本用特徵表示出來，是文本分類的前提，使用sklearn作文本的特徵提取，須要導入TfidfVectorizer模塊。html

from sklearn.feature_extraction.text import TfidfVectorizer

一，使用sklearn作文本特徵提取

sklearn提取文本特徵時，最重要的兩個步驟是：建立Tfidf向量生成器，把原始文檔轉換爲詞-文檔矩陣。python

使用TfidfVectorizer()函數建立向量生成器，最經常使用的參數是：stow_words="english",ngram_range,max_df,min_df ，其餘參數請參考官方文檔：web

sklearn.feature_extraction.text.TfidfVectorizer(stop_words=None, ngram_range=(1, 1), max_df=1.0, min_df=1, ...)

把原始文檔轉換爲詞-文檔矩陣，返回的是一個稀疏矩陣：函數

fit_transform(raw_documents, y=None)

二，查看文檔的特徵

特徵提取的簡單步驟，corpus是語料，其結構是文檔列表，每個列表項都是一個文檔（doc），語料共有5個文檔：spa

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'Where can I find information about how to become a Surface or Surface Hub Microsoft Authorized Device Reseller (ADR)?'
    ,'If you are interested in becoming a Surface or Surface Hub Microsoft Authorized Device Reseller'
    ,'you should contact a Microsoft Surface Authorized Device Distributor and sign up to receive updates on the ADR program.'
    ,'Microsoft partner website: Contact a Microsoft Surface Authorized Device Distributor'
    ,'Sign up to receive updates on becoming a Microsoft Surface Hub ADR or installer'
]
vectorizer = TfidfVectorizer(stop_words ="english")
matrix= vectorizer.fit_transform(corpus)

1，查看文本特徵.net

從原始文檔列表（語料）中獲取特徵列表，相比原始文本的分詞，特徵顯得更有意義，分析返回的特徵，這5個文檔放回18個特徵：rest

>>> print(vectorizer.get_feature_names())
['adr', 'authorized', 'contact', 'device', 'distributor', 'hub', 'information', 'installer', 
'interested', 'microsoft', 'partner', 'program', 'receive', 'reseller', 'sign', 'surface', 'updates', 'website']

2，獲取term和特徵索引的映射code

詞和特徵之間有映射關係，例如，詞information對應的特徵的索引是6，orm

>>> items=vectorizer.vocabulary_.items()
>>> print(items)
dict_items([('information', 6), ('surface', 15), ('hub', 5), ('microsoft', 9), ('authorized', 1), 
('device', 3), ('reseller', 13), ('adr', 0), ('interested', 8), ('contact', 2), ('distributor', 4), 
('sign', 14), ('receive', 12), ('updates', 16), ('program', 11), ('partner', 10), ('website', 17), ('installer', 7)])

把dict_items結構轉換爲Python的字典結構：htm

>>> feature_dict = {v: k for k, v in vectorizer.vocabulary_.items()}
>>> feature_dict
{6: 'information', 15: 'surface', 5: 'hub', 9: 'microsoft', 1: 'authorized', 3: 'device', 
13: 'reseller', 0: 'adr', 8: 'interested', 2: 'contact', 4: 'distributor', 14: 'sign', 
12: 'receive', 16: 'updates', 11: 'program', 10: 'partner', 17: 'website', 7: 'installer'}

3，查看詞-文檔矩陣

fit_transform()返回的是稀疏矩陣，屬性shape表示矩陣的行-列數量，該共有5行18列，列表明的是特徵，行表明的原始文檔的數量，value表明該文檔包含特徵的TD-IDF值，範圍從0-1。

>>> matrix.shape
(5, 18)
>>> print(matrix.todense())
[[0.32228866 0.27111938 0.         0.27111938 0.         0.32228866
  0.48123496 0.         0.         0.22931104 0.         0.
  0.         0.38825733 0.         0.45862207 0.         0.        ]
 [0.         0.28640134 0.         0.28640134 0.         0.34045484
  0.         0.         0.50836033 0.24223642 0.         0.
  0.         0.41014192 0.         0.48447285 0.         0.        ]
 [0.2782744  0.2340932  0.33523388 0.2340932  0.33523388 0.
  0.         0.         0.         0.19799453 0.         0.41551375
  0.33523388 0.         0.33523388 0.19799453 0.33523388 0.        ]
 [0.         0.25015965 0.35824188 0.25015965 0.35824188 0.
  0.         0.         0.         0.42316685 0.44403158 0.
  0.         0.         0.         0.21158343 0.         0.44403158]
 [0.32281764 0.         0.         0.         0.         0.32281764
  0.         0.48202482 0.         0.22968741 0.         0.
  0.38889459 0.         0.38889459 0.22968741 0.38889459 0.        ]]

三，特徵提取的兩個模型

特徵提取的兩種方法：詞袋（Bag of Words）和TF-IDF

1，詞袋模型

詞袋模型（BoW）是從文本中提取特徵的最簡單方法。 BoW將文本轉換爲文檔中單詞出現的矩陣。該模型關注文檔中是否出現給定單詞。

有三個文檔（doc），每一個文檔是一行文本。

Doc 1: I love dogs.
Doc 2: I hate dogs and knitting.
Doc 3: Knitting is my hobby and passion.

根據這三個文檔，建立文檔和切詞的矩陣，並計算單詞出如今文檔中的次數，這個矩陣叫作文檔-詞矩陣（DTM，Document-Term Matrix）。

這個矩陣使用的是單個詞，也能夠使用兩個或多個詞的組合，叫作bi-gram模型或tri-gram模型，統稱n-gram模型。

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
#tokenizer to remove unwanted elements from out data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts= cv.fit_transform(text_data)

2，TF-IDF模型

TF（詞頻）是Term Frequency，計算每一個單詞在每一個文檔中的數量（頻數），TF依賴於BOW模型的輸出。

IDF（逆文檔頻率）是Inverse Document Frequency，反映關鍵詞的廣泛程度——當一個詞越廣泛（即有大量文檔包含這個詞）時，其IDF值越低；反之，則IDF值越高。IDF是包含該單詞的文檔數量和文檔總數的對數縮放比例。

TF-IDF（術語頻率 - 逆文檔頻率）模型是TF和IDF結合的產物，TF-IDF=TF*IDF。在文檔中具備高tf-idf的單詞，大多數時間發生在給定文檔中，而且在其餘文檔中不存在，因此這些詞是該文檔的特徵詞。

from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
text_tf= tf.fit_transform(doc_list)

參考文檔：

文本中的特徵提取與特徵選擇

sklearn.feature_extraction.text.TfidfVectorizer

python 文本特徵提取 CountVectorizer, TfidfVectorizer

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。