python 天然語言處理（二）____得到文本語料和詞彙資源

時間 2019-11-16

原文原文鏈接

一, 獲取文本語料庫web

　　一個文本語料庫是一大段文本。它一般包含多個單獨的文本，但爲了處理方便，咱們把他們頭尾鏈接起來當作一個文本對待。算法

1. 古騰堡語料庫數據庫

　　nltk包含古騰堡項目（Project Gutenberg）電子文本檔案的一小部分文本。要使用該語料庫一般須要用Python解釋器加載nltk包，而後嘗試nltk.corpus.gutenberg.fileids().實例以下：cookie

1 >>> import nltk
2 >>> nltk.corpus.gutenberg.fileids()
3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt'
4 , 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-a
5 lice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.t
6 xt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', '
7 shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'w
8 hitman-leaves.txt']
9 >>>

運行結果顯示的是nltk包含了該語料庫的哪些文本。咱們能夠對其中的任意文本進行操做。網絡

1）統計詞數。實例以下：ide

1 >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
2 >>> len(emma)
3 192427
4 >>>

2）索引文本。實例以下：函數

1 >>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
2 >>> emma.concordance("surprise")
3 Displaying 1 of 1 matches:
4  that Emma could not but feel some surprise , and a little displeasure , on he
5 >>>

3）獲取文本的標識符，詞，句。實例以下：測試

279 >>> for fileid in gutenberg.fileids():
280 ...     raw = gutenberg.raw(fileid)
281 ...     num_chars = len(raw)
282 ...     words = gutenberg.words(fileid)
283 ...     num_words = len(words)
284 ...     sents = gutenberg.sents(fileid)
285 ...     num_sents = len(sents)
286 ...     vocab = set([w.lower() for w in gutenberg.words(fileid)])
287 ...     num_vocab = len(vocab)
288 ...     print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))
289 ...
290 887071 192427 7752 austen-emma.txt
291 466292 98171 3747 austen-persuasion.txt
292 673022 141576 4999 austen-sense.txt
293 4332554 1010654 30103 bible-kjv.txt
294 38153 8354 438 blake-poems.txt
295 249439 55563 2863 bryant-stories.txt
296 84663 18963 1054 burgess-busterbrown.txt
297 144395 34110 1703 carroll-alice.txt
298 457450 96996 4779 chesterton-ball.txt
299 406629 86063 3806 chesterton-brown.txt
300 320525 69213 3742 chesterton-thursday.txt
301 935158 210663 10230 edgeworth-parents.txt
302 1242990 260819 10059 melville-moby_dick.txt
303 468220 96825 1851 milton-paradise.txt
304 112310 25833 2163 shakespeare-caesar.txt
305 162881 37360 3106 shakespeare-hamlet.txt
306 100351 23140 1907 shakespeare-macbeth.txt
307 711215 154883 4250 whitman-leaves.txt
308 
309 >>> raw[:1000]
310 "[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses fo
311 r my Body let us write, (for we are one,)\nThat should I after return,\nOr, long
312 , long hence, in other spheres,\nThere to some group of mates the chants resumin
313 g,\n(Tallying Earth's soil, trees, winds, tumultuous waves,)\nEver with pleas'd
314 smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and
315  now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BO
316 OK I.  INSCRIPTIONS]\n\n}  One's-Self I Sing\n\nOne's-self I sing, a simple sepa
317 rate person,\nYet utter the word Democratic, the word En-Masse.\n\nOf physiology
318  from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for th
319 e Muse, I say\n    the Form complete is worthier far,\nThe Female equally with t
320 he Male I sing.\n\nOf Life immense in passion, pulse, and power,\nCheerful, for
321 freest action form'd under the laws divine,\nThe Modern Man I sing.\n\n\n\n}  As
322  I Ponder'd in Silence\n\nAs I ponder'd in silence,\nReturning upon my poems, c"
323 >>>
324 >>> words
325 ['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]
326 >>> sents
327 [['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', ']'], ['Come',
328 ',', 'said', 'my', 'soul', ',', 'Such', 'verses', 'for', 'my', 'Body', 'let', 'u
329 s', 'write', ',', '(', 'for', 'we', 'are', 'one', ',)', 'That', 'should', 'I', '
330 after', 'return', ',', 'Or', ',', 'long', ',', 'long', 'hence', ',', 'in', 'othe
331 r', 'spheres', ',', 'There', 'to', 'some', 'group', 'of', 'mates', 'the', 'chant
332 s', 'resuming', ',', '(', 'Tallying', 'Earth', "'", 's', 'soil', ',', 'trees', '
333 ,', 'winds', ',', 'tumultuous', 'waves', ',)', 'Ever', 'with', 'pleas', "'", 'd'
334 , 'smile', 'I', 'may', 'keep', 'on', ',', 'Ever', 'and', 'ever', 'yet', 'the', '
335 verses', 'owning', '--', 'as', ',', 'first', ',', 'I', 'here', 'and', 'now', 'Si
336 gning', 'for', 'Soul', 'and', 'Body', ',', 'set', 'to', 'them', 'my', 'name', ',
337 '], ...]

raw表示的是文本中全部的標識符，words是詞，sents是句子。顯然句子都是劃分紅一個個詞來進行存儲的。除了words(), raw() 和 sents()之外，大多數nltk語料庫閱讀器還包括多種訪問方法。this

2. 網絡和聊天文本編碼

古騰堡項目包含的是成千上萬的書籍，它們比較正式，表明了既定的文學。除此以外， nltk中還有不少的網絡文本小集合，其內容包括Firefox交流論壇，在紐約無心中聽到的對話，《加勒比海盜》的電影劇本，我的廣告和葡萄酒的評論。訪問該部分的文本實例以下：

 1 >>> for fileid in webtext.fileids():
 2 ...     print("%s   %s ..." % (fileid, webtext.raw(fileid)[:65]))
 3 ...
 4 firefox.txt   Cookie Manager: "Don't allow sites that set removed cookies to se
 5 ...
 6 grail.txt   SCENE 1: [wind] [clop clop clop]
 7 KING ARTHUR: Whoa there!  [clop ...
 8 overheard.txt   White guy: So, do you have any plans for this evening?
 9 Asian girl ...
10 pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr
11 ...
12 singles.txt   25 SEXY MALE, seeks attrac older single lady, for discreet encoun
13 ...
14 wine.txt   Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
15 
16 >>>

3. 即時消息聊天會話語料庫

該語料庫最初是由美國海軍研究生院爲研究自動檢測互聯網入侵者而收集的，包含超過1000個帖子，被分紅15個文件，每一個文件包含幾百個從特定日期和特定年齡的聊天室收集的帖子。文件名包含日期，聊天室和帖子的數量。引用實例以下：

4.布朗語料庫

布朗語料庫是第一個百萬詞級的英語電子語料庫，其中包含500個不一樣來源的文本，按照文體分類，如新聞，社論等。它主要用於研究文體之間的系統性差別（又叫作文體學的語言學研究）。咱們能夠將語料庫做爲詞鏈表或者句子鏈表來訪問。

1）按特定類別或文件閱讀

 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()
 3 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
 4  'humor', 'learned', 'lore', 'mystery', 'new', 'news', 'religion', 'reviews', 'r
 5 omance', 'science_fiction']
 6 >>> brown.words(categories='news')
 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
 9 >>> brown.words(fileids=['cg22'])
10 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
11 >>> brown.sents(categories=['news', 'editorial', 'reviews', ])
12 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investiga
13 tion', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no
14 ', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['T
15 he', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the',
16  'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'o
17 f', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks',
18 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which
19 ', 'the', 'election', 'was', 'conducted', '.'], ...]
20 >>>

2）比較不一樣文體之間情態動詞的用法

 1 >>> from nltk.corpus import brown
 2 >>> news_text = brown.words(categories='news')
 3 >>> fdist=nltk.FreqDist([w.lower() for w in news_text])
 4 >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
 5 >>> for m in modals:
 6 ...     print("%s:%d" %(m, fdist[m]))
 7 ...
 8 can:94
 9 could:87
10 may:93
11 might:38
12 must:53
13 will:389
14 >>>

5. 路透社語料庫

路透社語料庫包含10788個新聞文檔，共計130萬字。這些文檔分紅90個主題，按照訓練和測試分爲兩組，這樣分割是爲了方便運用訓練和測試算法的自動檢測文檔的主題。與布朗語料庫不一樣，路透社語料庫的類別是相互重疊的，由於新聞報道每每涉及多個主題。咱們能夠查找由一個或多個文檔涵蓋的主題，也能夠查找包含在一個或者多個類別中的文檔。應用實例以下：

 1 >>> reuters.categories()
 2 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...]
 3 >>> reuters.categories('training/9865')
 4 ['barley', 'corn', 'grain', 'wheat']
 5 >>> reuters.categories(['training/9865', 'training/9880'])
 6 ['barley', 'corn', 'grain', 'money-fx', 'wheat']
 7 >>> reuters.fileids('barley')
 8 ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/158
 9 75',....]
10 >>> reuters.fileids(['barley', 'corn'])
11 ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/152
12 87', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]
13 >>>
14 >>> reuters.words('training/9865')[:14]
15 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', '
16 operators', 'have', 'requested', 'licences', 'to', 'export']
17 >>> reuters.words(['training/9865', 'training/9880'])
18 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
19 >>> reuters.words(categories='barley')
20 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
21 >>> reuters.words(categories=['barley', 'corn'])
22 ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]
23 >>>

View Code

6.就任演說預料庫

該語料庫其實是55個文本的集合，每一個文本都是一個總統的演說。這個集合的一個顯著特徵是時間維度。

 1 >>> from nltk.corpus import inaugural
 2 >>> inaugural.fileids()
 3 ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson
 4 .txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro
 5 e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t
 6 xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt
 7 ', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt
 8 ', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1
 9 885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx
10 t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt
11 ', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'
12 , '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev
13 elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961
14 -Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car
15 ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t
16 xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
17 >>> [fileid[:4] for fileid in inaugural.fileids()]
18 ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825',
19  '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865',
20  '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905',
21  '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945',
22  '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985',
23  '1989', '1993', '1997', '2001', '2005', '2009']
24 >>>

View Code

須要注意的是每一個文本的年代都出如今它的文件名中。要從文件名中得到年代，使用fileid[:4]提取前四個字符。

1 >>> import nltk
2 >>> cfd=nltk.ConditionalFreqDist(
3 ... (target, fileid[:4]
4 ... )
5 ... for fileid in inaugural.fileids()
6 ... for w in inaugural.words(fileid)
7 ... for target in ['america', 'citizen']
8 ... if w.lower().startswith(target))
9 >>> cfd.plot()

View Code

以上實例是詞彙america和citizen隨時間推移的使用狀況。就任演說語料庫中全部以america或citizen開始的詞都將被計數。每一個演講單獨計數並繪製出圖形，這樣就能觀察出隨時間變化這些用法的演變趨勢。計數沒有與文檔長度進行歸一化處理。

7.標註文本語料庫

許多文本語料庫都包含語言學標註，有詞性標註，命名實體，句法結構，語義角色等。nltk中提供了幾種很方便的方法來訪問這幾個語料庫，並且還包含有語料庫和語料樣本的數據包，用於教學和科研時能夠免費下載。

8.其餘語言語料庫

nltk還包含多國語言語料庫。好比udhr，包含有超過300種語言的世界人權宣言。這個語料庫的fileids包括有關文件所使用的字符編碼信息，好比：UTF8或者Latin1。利用條件頻率分佈來研究「世界人權宣言」(udhr)語料庫中不一樣語言版本中的字長差別。應用實例以下：

 1 >>> from nltk.corpus import udhr
 2 >>> languages=['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut'
 3 , 'Hungarian_Magyar', 'Ibibio_Efik']
 4 >>>
 5 >>> cfd=nltk.ConditionalFreqDist(
 6 ... (lang, len(word))
 7 ... for lang in languages
 8 ... for word in udhr.words(lang+'-Latin1'))
 9 >>> cfd.plot(cumulative=True)
10 >>>

View Code

9.nltk中定義的基本語料庫函數

示例	描述
fileids()	語料庫中的文件
fileids([categories])	分類對應的語料庫中的文件
categories()	語料庫中的分類
categories([fileids])	文件對應的語料庫中的分類
raw()	語料庫的原始內容
raw([fileids=[f1, f2, f3])	指定文件的原始內容
raw(categories=[c1, c2])	指定分類的原始內容
words()	整個語料庫中的詞彙
words(fileids=[f1,f2,f3])	指定文件中的詞彙
words(categories=[c1,c2])	指定分類中的詞彙
sents()	指定分類中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分類中的句子
abspath(fileid)	指定文件在磁盤上的位置
encoding(fileid)	文件編碼（若是知道的話）
open(fileid)	打開指定語料庫文件的文件流
root()	到本地安裝的語料庫根目錄的路徑
readme()	語料庫中的README文件的內容

10.載入本身的語料庫

 1 >>> from nltk.corpus import *
 2 >>> corpus_root = r"E:\corpora"             //本地存放文本的目錄，原始的nltk數據庫存放目錄爲D：\
 3 >>> wordlists=PlaintextCorpusReader(corpus_root, '.*')
 4 >>> wordlists.fileids()                    //獲取文件列表
 5 ['README', 'aaaaaaaaaaa.txt', 'austen-emma.txt', 'austen-persuasion.txt', 'auste              //其中的aaaaaaaaaaa.txt是自定義的文件
 6 n-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess
 7 -busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown
 8 .txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'luo.txt', 'melville-
 9 moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-ha
10 mlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
11 >>>