NLTK(Natural Language Toolkit)是一個功能強大的天然語言處理工具,它提供了一組天然語言算法,例如切分詞(Tokenize),詞性標註(Part-Of-Speech Tagging),詞幹分析(Stem)和命名實體識別(Named Entity Recognition),分類算法(classification)等。 安裝和引用NLTKhtml
pip install nltk import nltk
文本是由段落(Paragraph)構成的,段落是由句子(Sentence)構成的,句子是由單詞構成的。切詞是文本分析的第一步,它把文本段落分解爲較小的實體(如單詞或句子),每個實體叫作一個Token,Token是構成句子(sentence )的單詞,是段落(paragraph)的句子。NLTK可以實現句子切分和單詞切分兩種功能。python
1,句子切分(斷句)算法
句子切分是指把段落切分紅句子:函數
from nltk.tokenize import sent_tokenize text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard""" tokenized_text=sent_tokenize(text) print(tokenized_text)
句子切分的結果:工具
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.',
'The sky is pinkish-blue.', "You shouldn't eat cardboard"]
2,單詞切分(分詞)學習
單詞切分是把句子切分紅單詞this
from nltk.tokenize import word_tokenize text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard""" tokenized_text=word_tokenize(text) print(tokenized_text)
單詞切分的結果是:spa
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?',
'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.',
'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']
能夠發現,切詞以後,標點符號也包括在結果中。.net
對切詞的處理,須要移除標點符號和移除停用詞和詞彙規範化。code
1,移除標點符號
對每一個切詞調用該函數,移除字符串中的標點符號,string.punctuation包含了全部的標點符號,從切詞中把這些標點符號替換爲空格。
import string s='abc.' s.translate(str.maketrans(string.punctuation," "*len(string.punctuation),"")
2,移除停用詞
停用詞(stopword)是文本中的噪音單詞,沒有任何意義,經常使用的英語停用詞,例如:is, am, are, this, a, an, the。NLTK的語料庫中由一個停用詞,用戶必須從切詞列表中把停用詞去掉。
from nltk.corpus import stopwords stop_words = stopwords.words("english") word_tokens = nltk.tokenize.word_tokenize(text.strip()) filtered_sentence = [w for w in word_tokens if not w in stop_words]
詞彙規範化是指把詞的各類派生形式轉換爲詞根,在NLTK中存在兩種抽取詞幹的方法porter和wordnet。
詞形還原(lemmatization)利用上下文語境和詞性來肯定相關單詞的變化形式,根據詞性來獲取相關的詞根,也叫lemma。抽取詞幹(stem)是把單詞轉換爲詞幹。
from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer() from nltk.stem.porter import PorterStemmer stem = PorterStemmer() word = "flying" print("Lemmatized Word:",lem.lemmatize(word,"v")) print("Stemmed Word:",stem.stem(word))
詞性(POS)標記的主要目標是識別給定單詞的語法組,POS標記查找句子內的關係,併爲該單詞分配相應的標籤。
sent = "Albert Einstein was born in Ulm, Germany in 1879." tokens=nltk.word_tokenize(sent) nltk.pos_tag(tokens)
略
參考文檔: