NLTK中的Stemmers

Stemmers

在英語中,一個單詞經常是另外一個單詞的「變種」,如:happy=>happiness,這裏happy叫作happiness的詞幹(stem)。在信息檢索系統中,咱們經常作的一件事,就是在Term規範化過程當中,提取詞幹(stemming),即除去英文單詞分詞變換形式的結尾。python

本文主要介紹nltk中Stemmer的用法算法

Porter Stemmer

應用最爲普遍的、中等複雜程度的、基於後綴剝離的詞幹提取算法是波特詞幹算法,也叫波特詞幹器(Porter Stemmer)。app

from nltk.stem.porter import *
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

'''
output: caress fli die mule deni die agre own humbl size meet
state siez item sensat tradit refer colon plot
'''

Snowball stemmer

雪球詞幹算法(不知道該怎麼翻譯=.=)支持多種語言spa

>>> from nltk.stem.snowball import SnowballStemmer
>>> print(" ".join(SnowballStemmer.languages))
danish dutch english finnish french german hungarian italian
norwegian porter portuguese romanian russian spanish swedish

以英語爲例:翻譯

>>> stemmer = SnowballStemmer("english")
>>> print(stemmer.stem("running"))
run

能夠設置忽略停用詞:code

>>> stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have
>>> print(stemmer2.stem("having"))
having

通常來講,SnowballStemmer("english")要比PorterStemmer()更準確。ip

>>> print(SnowballStemmer("english").stem("generously"))
generous
>>> print(SnowballStemmer("porter").stem("generously"))
gener

LancasterStemmer

也是一種詞幹提取器,直接看代碼吧。it

>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘multiply’)
‘multiply’
>>> lancaster_stemmer.stem(‘provision’)
u’provid’
>>> lancaster_stemmer.stem(‘owed’)
‘ow’
相關文章
相關標籤/搜索