天然語言處理2.1——NLTK文本語料庫

時間 2019-11-17

標籤天然語言處理 2.1 nltk 文本語料庫简体版

原文原文鏈接

1.獲取文本語料庫python

NLTK庫中包含了大量的語料庫，下面一一介紹幾個：web

（1）古騰堡語料庫：NLTK包含古騰堡項目電子文本檔案的一小部分文本。該項目目前大約有36000本免費的電子圖書。算法

>>>import nltk
>>>nltk.corpus.gutenberg.fileids()
['austen-emma.txt','austen-persuasion.txt' 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt','bryant-stories.txt','burgess-busterbrown.tx'carroll-alice.txt', 'chesterton-ball.txt','chesterton-brown.txt','chesterton-thursday.tx'edgeworth-parents.txt'
 'melville-moby_dick.txt'milton-paradise.txt', 'shakespeare-caesar.txt, 'shakespeare-hamlet.txt, 'shakespeare-macbeth.txt 'whitman-leaves.txt']

使用：from nltk.corpus import gutenbergcookie

寫一段簡短的程序，經過遍歷前面所列出的與gutenberg文體標識符相應的fileid，而後統計每一個文本：網絡

import nltk
from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
	num_chars=len(gutenberg.raw(fileid))  ###統計字符數
	num_words=len(gutenberg.words(fileid))  ##統計單詞書
	num_sent=len(gutenberg.sents(fileid))  ###統計句子數
	num_vocab=len(set([w.lower() for w in gutenberg.words(fileid)]))  ###惟一化單詞
	print(int(num_chars/num_words),int(num_words/num_sent),int(num_words/num_vocab),fileid)

結果爲：4 24 26 austen-emma.txt
4 26 16 austen-persuasion.txt
4 28 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 20 12 carroll-alice.txt
4 20 11 chesterton-ball.txt
4 22 11 chesterton-brown.txt
4 18 10 chesterton-thursday.txt
4 20 24 edgeworth-parents.txt
4 25 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 36 12 whitman-leaves.txt函數

這個結果顯示了每一個文本的3個統計量：平局詞長，平均句子長度和文本中每一個詞出現的平均次數。測試

（2）網絡和聊天文本：this

這部分表明的是非正式的語言，包括Firefox交流論壇、在紐約無心聽到的對話、《加勒比海盜》電影劇本。我的廣告以及葡萄酒的評論。firefox

導入：from nltk.corpus import webtextblog

import nltk
from nltk.corpus import webtext

for fileid in webtext.fileids():
	print( fileid,webtext.raw(fileid)[:65],'...')

結果爲：firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there! [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

還有一個即時聊天會話語料庫，最初由海軍研究生院爲研究自動檢測互聯網入侵者而收集的：

>>>from nltk.corpus import nps_chat

(3)布朗語意庫：

布朗語意庫是第一個百萬詞集的英語電子語料庫，有布朗大學於1961年建立，包含500多個不一樣來源的文本，按照文本類型，如新聞、社評等分類。

>>>import nltk
>>>from nltk.corpus import brown

>>>print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

布朗語料庫是一個研究文體之間系統性差別的資源。讓咱們來比較不一樣文體的情態動詞的用法。步驟以下：

第一步：對特定文體進行計數。

import nltk
from nltk.corpus import brown

news_text=brown.words(categories='news')
fdist=nltk.FreqDist([w.lower() for w in news_text])
modals=['can','could','may','might','must','will']
for m in modals:
	print(m+':',fdist[m])

結果以下:can: 94,could: 87,may: 93,might: 38,must: 53,will: 389

第二步：統計每個感興趣的文體。咱們使用NLTK提供的條件機率分佈函數。

cfd=nltk.ConditionalFreqDist((genre,word) for genre in brown.categories() for word in brown.words(categories=genre))

genres=['news','religion','hobbies','science_fiction','romance','humor']
modals=['can','could','may','might','must','will']
cfd.tabulate(conditions=genres,samples=modals)

輸出結果爲：

                   can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

（4）路透社語料庫

路透社語料庫包括10788個新聞文檔，共計130萬字。這些文檔分紅了90個主題，按照‘訓練’和‘測試’分爲兩組。所以，編號爲‘test/14826’的文檔屬於測試組。這樣分割是爲了方便運用訓練和測試算法的自動檢驗文檔的主題。

（5）就任演說語料庫

語料庫其實是55個文本的集合，每一個文本都是一個總統的演講。這個集合的顯著特徵就是時間維度。

import nltk
from nltk.corpus import inaugural
print(inaugural.fileids())

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
能夠發現，每一個文本的年代都出如今他的文件名中。要從文件名中提取出年代，只須要使用fileid[:4]便可。

例子：咱們能夠看看‘American’和‘citizen’隨着時間推移的使用狀況。

import nltk
from nltk.corpus import inaugural
cfd=nltk.ConditionalFreqDist((target,fileid[:4]) 
							for fileid in inaugural.fileids()
							for w in inaugural.words(fileid)
							for target in ['american','citizen']
							if w.lower().startswith(target) )

cfd.plot()

結果以下：