NLTK 第一篇：介紹

時間 2019-12-05

標籤 nltk 一篇介紹简体版

原文原文鏈接

NLTK（Natural Language Toolkit）是一個功能強大的天然語言處理工具，它提供了一組天然語言算法，例如切分詞（Tokenize），詞性標註(Part-Of-Speech Tagging)，詞幹分析(Stem)和命名實體識別(Named Entity Recognition)，分類算法（classification）等。安裝和引用NLTKhtml

pip install nltk

import nltk

一，切詞

文本是由段落（Paragraph）構成的，段落是由句子（Sentence）構成的，句子是由單詞構成的。切詞是文本分析的第一步，它把文本段落分解爲較小的實體（如單詞或句子），每個實體叫作一個Token，Token是構成句子（sentence ）的單詞，是段落（paragraph）的句子。NLTK可以實現句子切分和單詞切分兩種功能。python

1，句子切分（斷句）算法

句子切分是指把段落切分紅句子：函數

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

句子切分的結果：工具

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 
'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

2，單詞切分（分詞）學習

單詞切分是把句子切分紅單詞this

from nltk.tokenize import word_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=word_tokenize(text)
print(tokenized_text)

單詞切分的結果是：spa

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 
'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.',
'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']

能夠發現，切詞以後，標點符號也包括在結果中。.net

二，處理切詞

對切詞的處理，須要移除標點符號和移除停用詞和詞彙規範化。code

1，移除標點符號

對每一個切詞調用該函數，移除字符串中的標點符號，string.punctuation包含了全部的標點符號，從切詞中把這些標點符號替換爲空格。

import string

s='abc.'
s.translate(str.maketrans(string.punctuation," "*len(string.punctuation),"")

2，移除停用詞

停用詞（stopword）是文本中的噪音單詞，沒有任何意義，經常使用的英語停用詞，例如：is, am, are, this, a, an, the。NLTK的語料庫中由一個停用詞，用戶必須從切詞列表中把停用詞去掉。

from nltk.corpus import stopwords

stop_words = stopwords.words("english")

word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_sentence = [w for w in word_tokens if not w in stop_words]

三，詞彙規範化（Lexicon Normalization）

詞彙規範化是指把詞的各類派生形式轉換爲詞根，在NLTK中存在兩種抽取詞幹的方法porter和wordnet。

詞形還原（lemmatization）利用上下文語境和詞性來肯定相關單詞的變化形式，根據詞性來獲取相關的詞根，也叫lemma。抽取詞幹（stem）是把單詞轉換爲詞幹。

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

四，詞性標註

詞性（POS）標記的主要目標是識別給定單詞的語法組，POS標記查找句子內的關係，併爲該單詞分配相應的標籤。

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens=nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

五，分類

略

參考文檔：

NLTK in Python

Text Analytics for Beginners using NLTK

NLTK學習筆記 -- 字符串操做

【NLP】Python NLTK 走進大秦帝國

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。