nltk的安裝和簡單使用

時間 2019-12-08

標籤 nltk 安裝簡單使用简体版

原文原文鏈接

使用python進行天然語言處理，有一些第三方庫供你們使用：html

·NLTK（Python天然語言工具包）用於諸如標記化、詞形還原、詞幹化、解析、POS標註等任務。該庫具備幾乎全部NLP任務的工具。python

·Spacy是NLTK的主要競爭對手。這兩個庫可用於相同的任務。less

·Scikit-learn爲機器學習提供了一個大型庫。此外還提供了用於文本預處理的工具。機器學習

·Gensim是一個主題和向量空間建模、文檔集合類似性的工具包。工具

·Pattern庫的通常任務是充當Web挖掘模塊。所以，它僅支持天然語言處理（NLP）做爲輔助任務。學習

·Polyglot是天然語言處理（NLP）的另外一個Python工具包。它不是很受歡迎，但也能夠用於各類NLP任務。this

先由nltk入手學習。spa

1. NLTK安裝命令行

簡單來講仍是跟python其餘第三方庫的安裝方式同樣，直接在命令行運行：pip install nltk3d

2. 運行不起來？

當你安裝完成後，想要試試下面的代碼對一段英文文本進行簡單的切分：

import nltk
text=nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29 .")
print(text)

運行結果，報錯以下：

...
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/english.pickle

  Searched in:
    - 'C:\\Users\\Administrator/nltk_data'
    - 'C:\\Users\\Administrator\\Desktop\\meatwice\\venv\\nltk_data'
    - 'C:\\Users\\Administrator\\Desktop\\meatwice\\venv\\share\\nltk_data'
    - 'C:\\Users\\Administrator\\Desktop\\meatwice\\venv\\lib\\nltk_data'
    - 'C:\\Users\\Administrator\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************

3. 解決方法：

不用着急，解決方法在異常中已經給出來了

命令行進入python交互模式，運行以下：

import nltk
nltk.download()

而後會彈出一個窗口，點擊models，找到punkt，雙擊進行下載便可。

而後運行開始的那段python代碼，對文本進行切分：

import nltk
text=nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29 .")
print(text)

結果以下，不會報錯：

4. nltk的簡單使用方法。

上面看了一個簡單的nltk的使用示例，下面來具體看看其使用方法。

4.1 將文本切分爲語句， sent_tokenize()

from nltk.tokenize import sent_tokenize
text=" Welcome readers. I hope you find it interesting. Please do reply."
print(sent_tokenize(text))

從標點處開始切分，結果：

4.2 將句子切分爲單詞， word_tokenize()

from nltk.tokenize import word_tokenize
text=" Welcome readers. I hope you find it interesting. Please do reply."
print(word_tokenize(text))

切分紅單個的單詞，運行結果：

4.3.1 使用 TreebankWordTokenizer 進行切分

from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
print(tokenizer.tokenize("What is Love? I know this question exists in each human being's mind including myse\
lf. If not it is still waiting to be discovered deeply in your heart. What do I think of love? For me, I belie\
ve love is a priceless diamond, because a diamond has thousands of reflections, and each reflection represent\
s a meaning of love."))

也是將語句切分紅單詞，運行結果：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。