3.理解文本語句和結構

時間 2019-11-20

標籤理解文本語句結構简体版

原文原文鏈接

理解文本語句和結構

下面會介紹和實現一些用於理解文本語法和結構的概念和技術。這些算法在 NLP 中很是有用，它一般在文本處理和標準化以後執行。主要關注一下技術：php

詞性（POS）標籤。
淺層分析。
基於依存關係的解析。
基於成分結構的解析。

文章的做者針對讀者是文本分析實踐人員，能夠執行並住處在實際問題中使用技術和算法的最佳解決方案。因此，下面將介紹利用現有庫（如 nltk 和 spacy）來實現和執行一些技術的最佳方法。此外因爲許多讀者可能對技術的內部構建感興趣，而且可能會嘗試本身實現部分技術，也會介紹如何作到這一點。請記住，主要關注的是以實際的例子來研究實現概念的方法，而不是重寫方法。html

安裝必要的依賴項

下面是所須要的依賴庫：java

nltk 庫
spacy 庫
pattern 庫
斯坦福分析器（Stanford parser）
Graphviz 及必要庫。

若是以爲 nltk 安裝包可能依賴的過多，及其繁瑣的下載，可執行下面代碼所有如今：node

 
          In [ 
          94 
          ]:  
          import  
          nltk 
         
          In [ 
          95 
          ]: nltk.download( 
          "all" 
          , halt_on_error 
          = 
          False 
          )

安裝 pattern 庫，請執行：python

 
          $ pip  
          install  
          pattern

下載並安裝及其必要的依賴項。git

對於 spacy 庫，須要先安裝軟件包，而後單獨安裝及其依賴項（也稱爲語言模型）。請在終端執行：github

 
          $ pip  
          install  
          spacy

安裝完成後，請使用命令：正則表達式

$ python -m spacy download en

從終端下載英文語言模型（大約 500MB），存儲與 spacy 包的根目錄下。更多詳情，參見：https://spacy.io/models/，其上包含 spacy 庫的使用說明。算法

斯坦福分析器是由斯坦福大學開發的基於 Java 的語言分析器，它可以幫助咱們解析句子以瞭解底層結構。咱們將使用斯坦福分析器和 nltk 來執行基於依存關係的解析以及基於成分結構的分析。nltk 提供了一個出色的封裝，它可與利用 Python 自己的分析器，於是無需在 Java 中變成，能夠從：https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software ，其上介紹瞭如何下載和安裝斯坦福分析器並將其 nltk 集成。bash

Graphviz 並非一個必要庫，僅僅使用它來查看斯坦福分析器生成的依存關係分析樹。能夠從其官方網站 http://www.graphvize.org/download 下載並安裝 Graphviz 庫。而後，安裝 pygraphviz，能夠根據本身的系統架構和 Python 版本從 https://www.lfd.uci.edu/~gohlke/pythonlibs/#pygraphviz 網站上的而下載 wheel 文件。接下來，使用命令

 
          $ pip  
          install  
          pygraphviz-1.3.1.._amd64.whl

安裝它（適用於 64 位系統 Python2.7.x 環境）。安裝完成後 pygraphviz 就能夠正常工做了。可能有其餘安裝過程當中遇到的問題，如執行如下代碼段：

 
          $ pip  
          install  
          pydot-ng 
         
          $ pip  
          install  
          graphviz

機器學習重要概念

利用一些與先構建好的標籤其來訓練本身的標籤器。爲了更好理解實現過程，必須知道以下與數據分析和機器學習相關的重要概念。

數據準備：一般包含特徵提取以及訓練前的數據預處理。
特徵提取：從原始數據中提取出有用特徵以訓練機器學習模式的過程。
特徵：數據的各類有用屬性（以我的數據爲例，能夠是年齡、體重等）。
訓練數據：用於訓練模式的一組數據。
測試/效驗數據：一組數據，通過預先訓練的模擬使用該組數據進行測試和評估，以評估模型優劣。
模型：使用數據/特徵組合構建，一個機器學習算法能夠是有監督的，也能夠是無監督的。
準確率：模式預測的準確程度（還有其餘詳細的評估標準，如精確率，召回率和 F1-score）。

知道了這些術語應該知足如下學習的基本內容。

詞性標註

詞性（POS）是基於語法語境和詞語做用的具體詞彙分類。主要的 POS，包括名詞、動詞形容詞和副詞。對單詞進行分類並標記 POS 標籤稱爲詞性標註或 POS 標註。POS 標籤用於標註單詞並描述其詞性，當須要在基於 NPL 的程序中使用註釋文本時，這是很是有用的，由於能夠經過特定的詞性過濾數據並利用該信息來執行具體的分析，例如將詞彙範圍縮小至名詞，分析哪些是最衝突的詞語，消除分歧並進行語法分析。

下面將使用 Penn Treebank 進行 POS 標註。能夠在 http://www.cis.uni-muenchen.de/schmid/tools/TreeTagger/data/Penn-Treebank-Tagset.pdf 中找到關於各類 POD 標籤及其標註的更多信息，其上包含了詳細的說明文檔，舉例說明了每一項標籤。Penn Treebank 項目是賓夕法尼亞大學的一個項目，該項目網站 http://www.cis.upenn.edu/index.php 提供了更多的相關信息。目前，有各類各樣的標籤以知足不一樣的應用場景，例如 POS 標籤是分配給單詞標記詞性的標籤，語塊標籤一般是分配給短語的標籤，還一些標籤時用於描述關係的次級標籤。下表給出了詞性標籤的詳細描述機器示例，若是不想花費力氣請查看 Penn Treebank 標籤的詳細文檔，能夠隨時用它做爲參考，以便更好的理解 POS 標籤和分析樹：

序號	TAG	描述	示例
1	CC	條件鏈接詞	and, or
2	CD	基本數量詞	dive, one, 2
3	DT	限定詞	a, the
4	EX	存在量詞	there were two cars
5	FW	外來詞	d'hoevre, mais
6	IN	介詞/從句鏈接詞	of, in, on, that
7	JJ	形容詞	quick, lazy
8	JJR	比較級形容詞	quicker, laziest
9	JJS	最高級形容詞	quickest, laziest
10	LS	列表項標記符	2)
11	MD	情態動詞	could, should
12	NN	單數或不可數名詞	fox, dog
13	NNS	複數名詞	foxes, dogs
14	NNP	專有名詞單數	John, Alice
15	NNPS	專有名詞複數	Vikings, Indians, Germans
16	PDT	前置限定詞	both the cats
17	POS	全部格	boss's
18	PRP	人稱代詞	me, you
19	PRP$	全部格代詞	our, my, your
20	RB	副詞	naturally, extremely, hardly
21	RBR	比較級副詞	better
22	RBS	最高級級副詞	best
23	RP	副詞小品詞	about, up
24	SYM	符號	%, $
25	TO	不定詞	how to, what to do
26	UH	感嘆詞	oh, gosh, wow
27	VB	動詞原形	run, give
28	VBD	動詞過去式	ran, gave
29	VBG	動名詞	running, giving
30	VBN	動詞過去分詞	given
31	VBP	動詞非第三人稱通常如今時	I think, I take
32	VBZ	動詞第三人稱通常如今時	he thinks, he takes
33	WDT	WH 限定詞	which, whatever
34	WP	WH 人稱代詞	who, what
35	WP$	WH 物主代詞	whose
36	WRB	WH 副詞	where, when
37	NP	名詞短語	the brown fox
38	PP	介詞短語	in between, over the dog
39	VP	動詞短語	was jumping
40	ADJP	形容詞短語	warm and snug
41	ADVP	副詞短語	also
42	SBAR	主從句鏈接詞	whether or not
43	PRT	小品詞	up
44	INTJ	語氣詞	hello
45	PNP	介詞名詞短語	over the dog, as of today
46	-SBJ	主句	the fox jumped over the dog
47	-OBJ	從句	the fox jumped over the dog

該表顯示了 Penn Treebank 中主要的 POS 標籤集，也是各種文本分析和天然語言處理程序中使用最普遍的 POS 標籤集合。

POS 標籤器推薦

這裏討論一些標記句子的推薦方法。第一種方法是使用 nltk 推薦的 post_tag() 函數，它基於 Penn Treebank。如下代碼段展現了使用 nltk 獲取句子 POS 標籤的方法：

 
     
      
        
          In [ 
          96 
          ]: sentence  
          =  
          'The brown fox is quick and he is jumping over the lazy dog' 
         

             
         
 
          In [ 
          97 
          ]:  
          # recommended tagger based on PTB 
         

             
         
 
          In [ 
          98 
          ]:  
          import  
          nltk 
         

             
         
 
          In [ 
          99 
          ]: tokens  
          =  
          nltk.word_tokenize(sentence) 
         

             
         
 
          In [ 
          100 
          ]: tagged_sent  
          =  
          nltk.pos_tag(tokens, tagset 
          = 
          'universal' 
          ) 
         

             
         
 
          In [ 
          101 
          ]:  
          print 
          (tagged_sent) 
         
 
          [( 
          'The' 
          ,  
          'DET' 
          ), ( 
          'brown' 
          ,  
          'ADJ' 
          ), ( 
          'fox' 
          ,  
          'NOUN' 
          ), ( 
          'is' 
          ,  
          'VERB' 
          ), ( 
          'quick' 
          ,  
          'ADJ' 
          ), ( 
          'and' 
          ,  
          'CONJ' 
          ), ( 
          'he' 
          ,  
          'PRON' 
          ), ( 
          'is' 
          ,  
          'VERB' 
          ), ( 
          'jumping' 
          ,  
          'VERB' 
          ), ( 
          'over' 
          ,  
          'ADP' 
          ), ( 
          'the' 
          ,  
          'DET' 
          ), ( 
          'lazy' 
          ,  
          'ADJ' 
          ), ( 
          'dog' 
          ,  
          'NOUN' 
          )] 
         
 
      
 
     
   

上面的輸出顯示了句子中每一個單詞的 POS 標籤，能夠發現其與上表的標籤很是類似。其中一些做爲通用/廣泛標籤也在前面提過。還可使用 pattern 模塊經過如下代碼獲取句子的 POS 標籤：

 
     
      
        
          In [ 
          102 
          ]:  
          from  
          pattern.en  
          import  
          tag 
         

             
         
 
          In [ 
          103 
          ]: tagged_sent  
          =  
          tag(sentence) 
         

             
         
 
          In [ 
          104 
          ]:  
          print 
          (tagged_sent) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

該輸出提過了嚴格遵循 Penn Treebank 格式的標籤，指出了形容詞，名詞或動詞並給除了詳細信息。

建立本身的 POS 標籤器

下面將探討一些構建本身的 POS 標籤器的技術，並利用 nltk 提過的一些類來實現它們。爲了評估標籤器性能，會使用 nltk 中 treebank 語料庫的一些測試數據。還將使用一些訓練數據來訓練標籤器。首先，經過讀取已標記的 treebank 語料庫，能夠得到訓練和評估標籤器的必要數據：

 
     
      
        
          In [ 
          135 
          ]:  
          from  
          nltk.corpus  
          import  
          treebank 
         

             
         
 
          In [ 
          136 
          ]: data  
          =  
          treebank.tagged_sents() 
         

             
         
 
          In [ 
          137 
          ]: train_data  
          =  
          data[: 
          3500 
          ] 
         

             
         
 
          In [ 
          138 
          ]: test_data  
          =  
          data[ 
          3500 
          :] 
         

             
         
 
          In [ 
          139 
          ]:  
          print 
          (train_data[ 
          0 
          ]) 
         
 
          [( 
          'Pierre' 
          ,  
          'NNP' 
          ), ( 
          'Vinken' 
          ,  
          'NNP' 
          ), ( 
          ',' 
          ,  
          ',' 
          ), ( 
          '61' 
          ,  
          'CD' 
          ), ( 
          'years' 
          ,  
          'NNS' 
          ), ( 
          'old' 
          ,  
          'JJ' 
          ), ( 
          ',' 
          ,  
          ',' 
          ), ( 
          'will' 
          ,  
          'MD' 
          ), ( 
          'join' 
          ,  
          'VB' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'board' 
          ,  
          'NN' 
          ), ( 
          'as' 
          ,  
          'IN' 
          ), ( 
          'a' 
          ,  
          'DT' 
          ), ( 
          'nonexecutive' 
          ,  
          'JJ' 
          ), ( 
          'director' 
          ,  
          'NN' 
          ), ( 
          'Nov.' 
          ,  
          'NNP' 
          ), ( 
          '29' 
          ,  
          'CD' 
          ), ( 
          '.' 
          ,  
          '.' 
          )] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          140 
          ]: tokens  
          =  
          nltk.word_tokenize(sentence) 
         

             
         
 
          In [ 
          141 
          ]:  
          print 
          (tokens) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          'is' 
          ,  
          'quick' 
          ,  
          'and' 
          ,  
          'he' 
          ,  
          'is' 
          ,  
          'jumping' 
          ,  
          'over' 
          ,  
          'the' 
          ,  
          'lazy' 
          ,  
          'dog' 
          ] 
         
 
      
 
     
   

將使用測試數據來評估標籤器，並使用例句的標識做爲輸入來驗證標籤器的工做效果。在 nltk 中使用的全部標籤器均來自 nltk.tag 包。每一個標籤器都是基類 TaggerI 類的子類，而且每一個相同單詞標籤器都執行一個 tag() 函數，它將一個句子的標籤列表做爲輸入，返回帶有 POS 標籤的相同單詞列表做爲輸出。除了標記外，還有一個 evaluate() 函數用於評估標籤器的性能。它銅鼓標記每一個輸入測試語句，而後將輸出結果與句子的實際標籤進行對比來完成評論。下面將使用該函數來測試咱們的標籤器在 test_data 上的性能。

首先，看看從 SequentialBackoffTagger 基類集成的 DefaultTagger，併爲每一個單詞分配相同的用戶輸入 POS 標籤。這看起來可能很簡單，但它是構建 POS 標籤器基準的好方法：

 
     
      
        
          In [ 
          142 
          ]:  
          from  
          nltk.tag  
          import  
          DefaultTagger 
         

             
         
 
          In [ 
          143 
          ]: dt  
          =  
          DefaultTagger( 
          'NN' 
          ) 
         

             
         
 
          In [ 
          144 
          ]:  
          print 
          (dt.evaluate(test_data)) 
         
 
          0.1454158195372253 
         

             
         
 
          In [ 
          145 
          ]:  
          print 
          (dt.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'NN' 
          ), ( 
          'brown' 
          ,  
          'NN' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'NN' 
          ), ( 
          'quick' 
          ,  
          'NN' 
          ), ( 
          'and' 
          ,  
          'NN' 
          ), ( 
          'he' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'NN' 
          ), ( 
          'jumping' 
          ,  
          'NN' 
          ), ( 
          'over' 
          ,  
          'NN' 
          ), ( 
          'the' 
          ,  
          'NN' 
          ), ( 
          'lazy' 
          ,  
          'NN' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

從上面的輸出能夠看出，在數庫（treebank）測試數據集中，已經得到了 14% 的單詞正確標記率，這個結果並非很理想，而且正如預期的那樣，例句中的輸出標記都爲名詞，由於給標籤器輸入的都是相同的標籤。

如今，將使用正則表達式和 RegexpTagger 來嘗試構建一個性能更好的標籤器：

 
          # regex tagger 
         
          from  
          nltk.tag  
          import  
          RegexpTagger 
         
          # define regex tag patterns 
         
          patterns  
          =  
          [ 
         
          (r 
          '.*ing$' 
          ,  
          'VBG' 
          ),                
          # gerunds 
         
          (r 
          '.*ed$' 
          ,  
          'VBD' 
          ),                 
          # simple past 
         
          (r 
          '.*es$' 
          ,  
          'VBZ' 
          ),                 
          # 3rd singular present 
         
          (r 
          '.*ould$' 
          ,  
          'MD' 
          ),                
          # modals 
         
          (r 
          '.*\'s$' 
          ,  
          'NN$' 
          ),                
          # possessive nouns 
         
          (r 
          '.*s$' 
          ,  
          'NNS' 
          ),                  
          # plural nouns 
         
          (r 
          '^-?[0-9]+(.[0-9]+)?$' 
          ,  
          'CD' 
          ),   
          # cardinal numbers 
         
          (r 
          '.*' 
          ,  
          'NN' 
          )                      
          # nouns (default) ... 
         
          ] 
         
          rt  
          =  
          RegexpTagger(patterns)

 
     
      
        
          In [ 
          161 
          ]:  
          print 
          (rt.evaluate(test_data)) 
         
 
          0.24039113176493368 
         
 
             
         
 
          In [ 
          162 
          ]:  
          print 
          (rt.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'NN' 
          ), ( 
          'brown' 
          ,  
          'NN' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'NNS' 
          ), ( 
          'quick' 
          ,  
          'NN' 
          ), ( 
          'and' 
          ,  
          'NN' 
          ), ( 
          'he' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'NNS' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'NN' 
          ), ( 
          'the' 
          ,  
          'NN' 
          ), ( 
          'lazy' 
          ,  
          'NN' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

該輸出顯示如今的標準率已經達到了 24%，應該能夠作的更好，如今將訓練一些 n 元分詞標籤器。n 元分詞是來自文本序列或語音序列的 n 個連續想。這些項能夠由單詞、音素、字母、字符或音節組成。Shingles 是隻包含單詞的 n 元分詞。將使用大小爲一、2 和 3 的 n 元分詞，它們分別也稱爲一元分詞（unigrarn）、二元分詞（bigram）和三元分詞（trigram）。UnigramTagger、BigramTagger 和 TrigramTagger 繼承自基類 NgramTagger，NGramTagger 類則集成自 ContextTager 類，該類又集成自 SequentialBackoffTagger 類。將使用 train_data 做爲訓練數據，根據語句標識機器 POS 標籤來訓練 n 元分詞標籤器。而後將在 test_data 上評估訓練後的標籤器，並查看語句的標籤結果。

 
          ## N gram taggers 
         
          from  
          nltk.tag  
          import  
          UnigramTagger 
         
          from  
          nltk.tag  
          import  
          BigramTagger 
         
          from  
          nltk.tag  
          import  
          TrigramTagger 
         
          ut  
          =  
          UnigramTagger(train_data) 
         
          bt  
          =  
          BigramTagger(train_data) 
         
          tt  
          =  
          TrigramTagger(train_data)

 
     
      
        
          In [ 
          170 
          ]:  
          print 
          (ut.evaluate(test_data)) 
         
 
          0.860683512440701 
         

             
         
 
          In [ 
          171 
          ]:  
          print 
          (ut.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          None 
          ), ( 
          'fox' 
          ,  
          None 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          None 
          ), ( 
          'dog' 
          ,  
          None 
          )] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          172 
          ]:  
          print 
          (bt.evaluate(test_data)) 
         
 
          0.13486300706747992 
         

             
         
 
          In [ 
          173 
          ]:  
          print 
          (bt.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          None 
          ), ( 
          'fox' 
          ,  
          None 
          ), ( 
          'is' 
          ,  
          None 
          ), ( 
          'quick' 
          ,  
          None 
          ), ( 
          'and' 
          ,  
          None 
          ), ( 
          'he' 
          ,  
          None 
          ), ( 
          'is' 
          ,  
          None 
          ), ( 
          'jumping' 
          ,  
          None 
          ), ( 
          'over' 
          ,  
          None 
          ), ( 
          'the' 
          ,  
          None 
          ), ( 
          'lazy' 
          ,  
          None 
          ), ( 
          'dog' 
          ,  
          None 
          )] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          174 
          ]:  
          print 
          (tt.evaluate(test_data)) 
         
 
          0.08084035240584761 
         

             
         
 
          In [ 
          175 
          ]:  
          print 
          (tt.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          None 
          ), ( 
          'fox' 
          ,  
          None 
          ), ( 
          'is' 
          ,  
          None 
          ), ( 
          'quick' 
          ,  
          None 
          ), ( 
          'and' 
          ,  
          None 
          ), ( 
          'he' 
          ,  
          None 
          ), ( 
          'is' 
          ,  
          None 
          ), ( 
          'jumping' 
          ,  
          None 
          ), ( 
          'over' 
          ,  
          None 
          ), ( 
          'the' 
          ,  
          None 
          ), ( 
          'lazy' 
          ,  
          None 
          ), ( 
          'dog' 
          ,  
          None 
          )] 
         
 
      
 
     
   

上面的輸出清楚的說明，僅使用 UnigramTagger 標籤器就能夠在測試集上得到 86% 的準確率，這個結果與前一個標籤器相對要好不少。標籤 None 表示標籤器沒法標記該單詞，由於它在訓練數據中未能獲取相似的標識。二元分詞和三元分詞模式的準確性遠不及一元分詞模型，由於在訓練數據中觀察到的二元詞組和三元詞組不必定在測試數據中以相同的方式出現。

如今，經過建立一個包含標籤列表的組合標籤器以及使用 backoff 標籤器，將嘗試組合運用全部的標籤器。本質上，將建立一個標籤器鏈，對於每個標籤器，若是它不能標記輸入的標識，則標籤器的下一步將會推出到 backoff 標籤器：

 
          def  
          combined_tagger(train_data, taggers, backoff 
          = 
          None 
          ): 
         
          for  
          tagger  
          in  
          taggers: 
         
          backoff  
          =  
          tagger(train_data, backoff 
          = 
          backoff) 
         
          return  
          backoff 
         
          ct  
          =  
          combined_tagger(train_data 
          = 
          train_data, 
         
          taggers 
          = 
          [UnigramTagger, BigramTagger, TrigramTagger], 
         
          backoff 
          = 
          rt)

 
     
      
        
          In [ 
          181 
          ]:  
          print 
          (ct.evaluate(test_data)) 
         
 
          0.909768612644012 
         

             
         
 
          In [ 
          182 
          ]:  
          print 
          (ct.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'NN' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'NN' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

如今在測試數據上得到了 91% 的準確率，效果很是好。另外也看到，這個新標籤器可以成功的標記例句中的全部標識（即便它們中一些不正確，好比 brown 應該是一個形容詞）。

對於最終的標籤器，將使用有監督的分類算法來訓練它們。ClassfierBasedPOSTTagger 類使咱們可以使用 classifier_builder 參數中的有監督機器學習算法來訓練標籤器。該類繼承自 classifierBasedTagger，並擁有構成訓練過程核心部分的 feature_detector() 函數。該函數用於從訓練數據（如單詞。前一個單詞、標籤、前一個標籤，大小寫等）中生成各類特徵。實際上，在實例化 ClassifierBasedPOSTagger 類對象時，也能夠構建本身的特徵檢測器函數，將其牀底給 feature_detector 參數。在這裏，使用的分類器是 NaiveBayesClassifier，它使用貝葉斯定理構建機率分類器，假設特徵之間是獨立的。相關算法超出了討論範圍，想了解夠多，請參見：https://en.wikipedia.org/wiki/Naive_Bayes_classifier

如下代碼段展現瞭如何基於分類方法構建 POS 標籤器並對其進行評估：

 
          from  
          nltk.classify  
          import  
          NaiveBayesClassifier, MaxentClassifier 
         
          from  
          nltk.tag.sequential  
          import  
          ClassifierBasedPOSTagger 
         
          nbt  
          =  
          ClassifierBasedPOSTagger(train 
          = 
          train_data, 
         
          classifier_builder 
          = 
          NaiveBayesClassifier.train)

 
     
      
        
          In [ 
          14 
          ]:  
          print 
          (nbt.evaluate(test_data)) 
         
 
          print 
          ()^[[Db0. 
          9306806079969019 
         

             
         
 
          In [ 
          15 
          ]:  
          print 
          (nbt.tag(tokens)) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'VBG' 
          )] 
         
 
      
 
     
   

使用上面的標籤器，在檢測數據上的準確率達到了 93%，這在全部的標籤器中是最高的。此外，若是仔細觀察例句的輸出標籤，會發現它們不只僅是正確的，而且是徹底合理的。基於分類器的 POS 標籤器是多麼強大和有效。也能夠嘗試使用其餘的分類器，如 MaxentClassifier，並將其性能與此標籤器性能進行比較。此外，還有幾種使用 nltk 和其餘程序包構建或使用 POS 標籤器的方法。以上內容應該知足對於 POS 標籤器的需求。

淺層分析

淺層分析（shallow parsing）也稱爲淺分析（light parsing）或塊分析（chunking），是將它們組合成更高級的短語。在淺層分析中，主要的關注焦點是識別這些短語或語塊，而不是挖掘每一個塊內語法和語句關係的深層細節，正如在基於深度分析得到的分析樹中看到的。淺層分析的主要目的是得到語義上有意義的短語，並觀察它們之間的關係。

接下來，姜蔥一些值得推薦的、簡單易用的淺層分析器開始，研究進行淺層分析的各類方法。還將使用如正則表達式、分塊、加縫隙和基於標籤的訓練等技術，來實現本身的淺層分析器。

淺層分析器推薦

在這裏，將使用 pattern 包建立一個淺層分析器，用以從句子中提取有意義的語塊。如下代碼段展現瞭如何在例句上執行淺層分析：

 
          from  
          pattern.en  
          import  
          parsetree, Chunk 
         
          from  
          nltk.tree  
          import  
          Tree 
         
          sentence  
          =  
          'The brown fox is quick and he is jumping over the lazy dog' 
         
          tree  
          =  
          parsetree(sentence)

 
          In [ 
          23 
          ]: tree 
         
          Out[ 
          23 
          ]: [Sentence( 
          'The/DT/B-NP/O brown/JJ/I-NP/O fox/NN/I-NP/O is/VBZ/B-VP/O quick/JJ/B-ADJP/O and/CC/O/O he/PRP/B-NP/O is/VBZ/B-VP/O jumping/VBG/I-VP/O over/IN/B-PP/B-PNP the/DT/B-NP/I-PNP lazy/JJ/I-NP/I-PNP dog/NN/I-NP/I-PNP' 
          )]

上面的輸出就是例句的原始淺層分析語句樹。若是將它們與以前的 POS 標籤表進行對比，會發現許多標籤時很是類似的。上面的輸出中有一些新的符號，前綴 I、O 和 B，即分塊技術領域裏十分流行的 IOB 標註，I、O 和 B 分別表示內部、外部和開頭。標籤前面的 B，前綴表示它是塊的開始，而 I，前綴則表示它在快內。O 標籤表示標識不屬於任何塊。當後續標籤跟當前語塊的標籤類型相同，而且它它們以前不存在 O 標籤時，則對當前塊使用 B 標籤。

如下代碼段顯示瞭如何簡單易懂地得到語塊：

 
     
      
        
          In [ 
          24 
          ]:  
          for  
          sentence_tree  
          in  
          tree: 
         
 
              
          ....:          
          print 
          (sentence_tree.chunks) 
         
 
              
          ....: 
         
 
          [Chunk( 
          'The brown fox/NP' 
          ), Chunk( 
          'is/VP' 
          ), Chunk( 
          'quick/ADJP' 
          ), Chunk( 
          'he/NP' 
          ), Chunk( 
          'is jumping/VP' 
          ), Chunk( 
          'over/PP' 
          ), Chunk( 
          'the lazy dog/NP' 
          )] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          25 
          ]:  
          for  
          sentence_tree  
          in  
          tree: 
         
 
              
          ....:          
          for  
          chunk  
          in  
          sentence_tree.chunks: 
         
 
              
          ....:                  
          print 
          (chunk. 
          type 
          ,  
          '->' 
          , [(word.string, word. 
          type 
          ) 
         
 
              
          ....:                                           
          for  
          word  
          in  
          chunk.words]) 
         
 
              
          ....: 
         
 
          NP  
          - 
          > [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          )] 
         
 
          VP  
          - 
          > [( 
          'is' 
          ,  
          'VBZ' 
          )] 
         
 
          ADJP  
          - 
          > [( 
          'quick' 
          ,  
          'JJ' 
          )] 
         
 
          NP  
          - 
          > [( 
          'he' 
          ,  
          'PRP' 
          )] 
         
 
          VP  
          - 
          > [( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          )] 
         
 
          PP  
          - 
          > [( 
          'over' 
          ,  
          'IN' 
          )] 
         
 
          NP  
          - 
          > [( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

上面的輸出是例句的淺層分析結果，該結果十分簡單明瞭，其中的每一個短語及其組成部分都被清楚地顯示出來。

能夠構建一些通用函數，更好地解析和可視化淺層分析的語句樹，還能夠在分析常見句子時重複使用它們，以下列代碼所示：

 
     
      
        
          def  
          create_sentence_tree(sentence, lemmatize 
          = 
          False 
          ): 
         
 
               
          sentence_tree  
          =  
          parsetree(sentence, 
         
 
                                         
          relations 
          = 
          True 
          , 
         
 
                                         
          lemmata 
          = 
          lemmatize) 
         
 
               
          return  
          sentence_tree[ 
          0 
          ] 
         

             
         

             
         
 
          def  
          get_sentence_tree_constituents(sentence_tree): 
         
 
               
          return  
          sentence_tree.constituents() 
         
 
                
         
 
          def  
          process_sentence_tree(sentence_tree): 
         
 
               
          tree_constituents  
          =  
          get_sentence_tree_constituents(sentence_tree) 
         
 
               
          processed_tree  
          =  
          [(item. 
          type 
          ,[(w.string, w. 
          type 
          ) 
          for  
          w  
          in  
          item.words])  
          if  
          type 
          (item)  
          = 
          =  
          Chunk  
          else  
          ( 
          '-' 
          , [(item.string, item. 
          type 
          )])  
          for  
          item  
          in  
          tree_constituents] 
         
 
               
          return  
          processed_tree 
         
 
                
         
 
          def  
          print_sentence_tree(sentence_tree): 
         
 
               
          processed_tree  
          =  
          process_sentence_tree(sentence_tree) 
         
 
               
          processed_tree  
          =  
          [ Tree( item[ 
          0 
          ],[ Tree(x[ 
          1 
          ], [x[ 
          0 
          ]])  
          for  
          x  
          in  
          item[ 
          1 
          ]])  
          for  
          item  
          in  
          processed_tree ] 
         
 
               
          tree  
          =  
          Tree( 
          'S' 
          , processed_tree ) 
         
 
               
          print 
          (tree) 
         
 
                
         
 
          def  
          visualize_sentence_tree(sentence_tree): 
         
 
               
          processed_tree  
          =  
          process_sentence_tree(sentence_tree) 
         
 
               
          processed_tree  
          =  
          [ Tree( item[ 
          0 
          ], [ Tree(x[ 
          1 
          ], [x[ 
          0 
          ]])  
          for  
          x  
          in  
          item[ 
          1 
          ]])  
          for  
          item  
          in  
          processed_tree ] 
         
 
               
          tree  
          =  
          Tree( 
          'S' 
          , processed_tree ) 
         
 
               
          tree.draw() 
         
 
      
 
     
   

執行如下代碼段，能夠看出上述函數是如何在例句中發揮做用的：

 
          In [ 
          38 
          ]: t  
          =  
          create_sentence_tree(sentence) 
         
          In [ 
          39 
          ]: t 
         
          Out[ 
          39 
          ]: Sentence( 
          'The/DT/B-NP/O/NP-SBJ-1 brown/JJ/I-NP/O/NP-SBJ-1 fox/NN/I-NP/O/NP-SBJ-1 is/VBZ/B-VP/O/VP-1 quick/JJ/B-ADJP/O/O and/CC/O/O/O he/PRP/B-NP/O/NP-SBJ-2 is/VBZ/B-VP/O/VP-2 jumping/VBG/I-VP/O/VP-2 over/IN/B-PP/B-PNP/O the/DT/B-NP/I-PNP/O lazy/JJ/I-NP/I-PNP/O dog/NN/I-NP/I-PNP/O' 
          )

 
     
      
        
          In [ 
          40 
          ]: pt  
          =  
          process_sentence_tree(t) 
         

             
         
 
          In [ 
          41 
          ]: pt 
         
 
          Out[ 
          41 
          ]: 
         
 
          [( 
          'NP' 
          , [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          )]), 
         
 
            
          ( 
          'VP' 
          , [( 
          'is' 
          ,  
          'VBZ' 
          )]), 
         
 
            
          ( 
          'ADJP' 
          , [( 
          'quick' 
          ,  
          'JJ' 
          )]), 
         
 
            
          ( 
          '-' 
          , [( 
          'and' 
          ,  
          'CC' 
          )]), 
         
 
            
          ( 
          'NP' 
          , [( 
          'he' 
          ,  
          'PRP' 
          )]), 
         
 
            
          ( 
          'VP' 
          , [( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          )]), 
         
 
            
          ( 
          'PP' 
          , [( 
          'over' 
          ,  
          'IN' 
          )]), 
         
 
            
          ( 
          'NP' 
          , [( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )])] 
         
 
      
 
     
   

 
          In [ 
          42 
          ]: print_sentence_tree(t) 
         
          (S 
         
          (NP (DT The) (JJ brown) (NN fox)) 
         
          (VP (VBZ  
          is 
          )) 
         
          (ADJP (JJ quick)) 
         
          ( 
          -  
          (CC  
          and 
          )) 
         
          (NP (PRP he)) 
         
          (VP (VBZ  
          is 
          ) (VBG jumping)) 
         
          (PP (IN over)) 
         
          (NP (DT the) (JJ lazy) (NN dog))) 
         
          In [ 
          43 
          ]: visualize_sentence_tree(t)

上面的輸出顯示了從例句中建立、表示和可視化淺層分析樹的方法。對於同一個例句，可視化結果與樹形很是類似。最低一級表示實際的標識值；上一級別表示每一個標識的 POS 標籤；而再上一級表示語塊短語的標籤。

構建本身的淺層分析器

下面將會使用正則表達式，基於標籤的學習器等技術構建本身的淺層分析器。與以前的 POS 標籤相似，若是須要的話，會使用一些訓練數據來訓練分析器，而後使用測試數據和例句對分析器進行評估。在 nltk 中，可使用 treebank 語料庫，它帶有語塊標註。首先，加載語料庫，並使用如下代碼段準備訓練數據集合測試數據集：

 
          from  
          nltk.corpus  
          import  
          treebank_chunk 
         
          data  
          =  
          treebank_chunk.chunked_sents() 
         
          train_data  
          =  
          data[: 
          4000 
          ] 
         
          test_data  
          =  
          data[ 
          4000 
          :]

 
          In [ 
          31 
          ]:  
          print 
          (train_data[ 
          7 
          ]) 
         
          (S 
         
          (NP A 
          / 
          DT Lorillard 
          / 
          NNP spokewoman 
          / 
          NN) 
         
          said 
          / 
          VBD 
         
          , 
          / 
          , 
         
          `` 
          / 
          `` 
         
          (NP This 
          / 
          DT) 
         
          is 
          / 
          VBZ 
         
          (NP an 
          / 
          DT old 
          / 
          JJ story 
          / 
          NN) 
         
          . 
          / 
          .)

從上面的輸出能夠看出，數據點是使用短語和 POS 標籤完成標註的句子，這將有助於訓練淺層分析器。下面將從使用正則表達式開始進行淺層分析，同時還會使用分塊和加縫隙的概念。經過分塊，可使用並指定特定的模式來識別想要在句子中分塊或分段的內容，例如一些基於特定元數據（如每一個標識的 POS 標籤）的短語。加縫隙過程與分塊過程相反，在該過程當中，指定一些特定的標識使其不屬於任何語塊，而後造成除這些標識以外的必要語塊。一塊兒看一個簡單的例句，經過使用 RegexpParser 類，能夠利用正則表達式建立淺層分析器，以說明名詞短語的分塊和加縫隙過程，以下所示：

 
          simple_sentence  
          =  
          'the quick fox jumped over the lazy dog' 
         
          from  
          nltk.chunk  
          import  
          RegexpParser 
         
          from  
          pattern.en  
          import  
          tag 
         
          tagged_simple_sent  
          =  
          tag(simple_sentence)

 
     
      
        
          In [ 
          36 
          ]:  
          print 
          (tagged_simple_sent) 
         
 
          [( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'jumped' 
          ,  
          'VBD' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

 
          chunk_grammar  
          =  
          """ 
         
          NP: {<DT>?<JJ>*<NN.*>} 
         
          """ 
         
          rc  
          =  
          RegexpParser(chunk_grammar) 
         
          c  
          =  
          rc.parse(tagged_simple_sent)

 
          chink_grammar  
          =  
          """ 
         
          NP: {<.*>+} # chunk everything as NP 
         
          }<VBD|IN>+{ 
         
          """ 
         
          rc  
          =  
          RegexpParser(chink_grammar) 
         
          c  
          =  
          rc.parse(tagged_simple_sent)

從上面的輸出中能夠看出，在試驗性 NP（名詞短語）淺層分析器上使用分塊和加縫隙方法獲得了相同的結果。請記住，短語是包含在組塊（語塊）集合中的標識序列，縫隙則是被排除在語塊以外的標識和標識序列。

如今，要訓練一個更爲通用的基於正則表達式的淺層分析器，並在測試 treebank 數據上測試其性能。在程序內部，須要執行幾個步驟來完成此分析器。首先，須要將 nltk 中用於表示被解析語句的 Tree 結構轉換爲 ChunkString 對象。而後，使用定義好的分塊和加縫隙規則建立一個 RegexpParser 對象。最後，使用 ChunkRule 和 ChinkRule 類及其對象建立完整的、帶有必要語塊的淺層分析樹。如下代碼段展現了基於正則表達式的淺層分析器：

 
     
      
        
          In [ 
          46 
          ]: tagged_sentence  
          =  
          tag(sentence) 
         
 
          In [ 
          47 
          ]:  
          print 
          (tagged_sentence) 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

 
          grammar  
          =  
          """ 
         
          NP: {<DT>?<JJ>?<NN.*>}  
         
          ADJP: {<JJ>} 
         
          ADVP: {<RB.*>} 
         
          PP: {<IN>}      
         
          VP: {<MD>?<VB.*>+} 
         
          """ 
         
          rc  
          =  
          RegexpParser(grammar) 
         
          c  
          =  
          rc.parse(tagged_sentence)

 
          In [ 
          51 
          ]:  
          print 
          (c) 
         
          (S 
         
          (NP The 
          / 
          DT brown 
          / 
          JJ fox 
          / 
          NN) 
         
          (VP  
          is 
          / 
          VBZ) 
         
          (ADJP quick 
          / 
          JJ) 
         
          and 
          / 
          CC 
         
          he 
          / 
          PRP 
         
          (VP  
          is 
          / 
          VBZ jumping 
          / 
          VBG) 
         
          (PP over 
          / 
          IN) 
         
          (NP the 
          / 
          DT lazy 
          / 
          JJ dog 
          / 
          NN))

 
          In [ 
          52 
          ]:  
          print 
          (rc.evaluate(test_data)) 
         
          ChunkParse score: 
         
          IOB Accuracy:   
          54.5 
          % 
          % 
         
          Precision:      
          25.0 
          % 
          % 
         
          Recall:         
          52.5 
          % 
          % 
         
          F 
          - 
          Measure:      
          33.9 
          % 
          %

上面輸出的例句分析樹很是相似於前面分析器給出的分析樹。此外，測試數據的正特標準率達到了 54.5%，這是一個很不錯的開頭。

還記得以前提到的帶註釋的文本標記元數據在許多方面都是頗有用的嗎？接下來，將使用分好塊並標記好的 treebank 訓練數據，構建一個淺層分析器。會用到兩個分塊函數：一個是 tree2conlltags 函數，它能夠爲每一個詞元獲取三組數據，單詞、標籤和塊標籤；另外一個是 conlltags2tree 函數，它能夠從上述三元組數據中生成分析樹。稍後，將使用這些函數來訓練分析器。首先，一塊兒來看看這兩個函數是如何工做的。請記住，塊標籤使用前面提到的 IOB 格式：

 
          from  
          nltk.chunk.util  
          import  
          tree2conlltags, conlltags2tree

 
          In [ 
          37 
          ]: train_sent  
          =  
          train_data[ 
          7 
          ] 
         
          ...:  
          print 
          (train_sent) 
         
          ...: 
         
          ...: 
         
          (S 
         
          (NP A 
          / 
          DT Lorillard 
          / 
          NNP spokewoman 
          / 
          NN) 
         
          said 
          / 
          VBD 
         
          , 
          / 
          , 
         
          `` 
          / 
          `` 
         
          (NP This 
          / 
          DT) 
         
          is 
          / 
          VBZ 
         
          (NP an 
          / 
          DT old 
          / 
          JJ story 
          / 
          NN) 
         
          . 
          / 
          .)

 
     
      
        
          In [ 
          38 
          ]: wtc  
          =  
          tree2conlltags(train_sent) 
         
 
               
          ...: wtc 
         
 
               
          ...: 
         
 
               
          ...: 
         
 
          Out[ 
          38 
          ]: 
         
 
          [( 
          'A' 
          ,  
          'DT' 
          ,  
          'B-NP' 
          ), 
         
 
            
          ( 
          'Lorillard' 
          ,  
          'NNP' 
          ,  
          'I-NP' 
          ), 
         
 
            
          ( 
          'spokewoman' 
          ,  
          'NN' 
          ,  
          'I-NP' 
          ), 
         
 
            
          ( 
          'said' 
          ,  
          'VBD' 
          ,  
          'O' 
          ), 
         
 
            
          ( 
          ',' 
          ,  
          ',' 
          ,  
          'O' 
          ), 
         
 
            
          ( 
          '``' 
          ,  
          '``' 
          ,  
          'O' 
          ), 
         
 
            
          ( 
          'This' 
          ,  
          'DT' 
          ,  
          'B-NP' 
          ), 
         
 
            
          ( 
          'is' 
          ,  
          'VBZ' 
          ,  
          'O' 
          ), 
         
 
            
          ( 
          'an' 
          ,  
          'DT' 
          ,  
          'B-NP' 
          ), 
         
 
            
          ( 
          'old' 
          ,  
          'JJ' 
          ,  
          'I-NP' 
          ), 
         
 
            
          ( 
          'story' 
          ,  
          'NN' 
          ,  
          'I-NP' 
          ), 
         
 
            
          ( 
          '.' 
          ,  
          '.' 
          ,  
          'O' 
          )] 
         
 
      
 
     
   

 
          In [ 
          39 
          ]: tree  
          =  
          conlltags2tree(wtc) 
         
          In [ 
          40 
          ]:  
          print 
          (tree) 
         
          (S 
         
          (NP A 
          / 
          DT Lorillard 
          / 
          NNP spokewoman 
          / 
          NN) 
         
          said 
          / 
          VBD 
         
          , 
          / 
          , 
         
          `` 
          / 
          `` 
         
          (NP This 
          / 
          DT) 
         
          is 
          / 
          VBZ 
         
          (NP an 
          / 
          DT old 
          / 
          JJ story 
          / 
          NN) 
         
          . 
          / 
          .)

如今，已經知道了這些函數是如何工做的，接下來，定義一個函數 conll_tag_chunks() 從分塊標註好的句子中提取 POS 和塊標籤。從 POS 標註到使用組合標籤器（包含 backoff 標籤器）訓練數據的過程當中，還能夠再次使用 combined_taggers() 函數，如如下代碼段所示：

 
          def  
          conll_tag_chunks(chunk_sents): 
         
          tagged_sents  
          =  
          [tree2conlltags(tree)  
          for  
          tree  
          in  
          chunk_sents] 
         
          return  
          [[(t, c)  
          for  
          (w, t, c)  
          in  
          sent]  
          for  
          sent  
          in  
          tagged_sents] 
         
          def  
          combined_tagger(train_data, taggers, backoff 
          = 
          None 
          ): 
         
          for  
          tagger  
          in  
          taggers: 
         
          backoff  
          =  
          tagger(train_data, backoff 
          = 
          backoff) 
         
          return  
          backoff

如今，定義一個 NGramTagChunker 類，將標記好的句子做爲訓練輸入，獲取他們的 WTC 三元組，即單詞（word）、POS 標籤（POS tag）和塊標籤（Chunk tag）三元組，並使用 UnigramTagger 做爲 backoff 標籤器訓練一個 BigramTagger。還將定義一個 parse() 函數來對新的句子執行淺層分析：

 
          from  
          nltk.tag  
          import  
          UnigramTagger, BigramTagger 
         
          from  
          nltk.chunk  
          import  
          ChunkParserI 
         
          class  
          NGramTagChunker(ChunkParserI): 
         
          def  
          __init__( 
          self 
          , train_sentences, 
         
          tagger_classes 
          = 
          [UnigramTagger, BigramTagger]): 
         
          train_sent_tags  
          =  
          conll_tag_chunks(train_sentences) 
         
          self 
          .chunk_tagger  
          =  
          combined_tagger(train_sent_tags, tagger_classes) 
         
          def  
          parse( 
          self 
          , tagged_sentence): 
         
          if  
          not  
          tagged_sentence: 
         
          return  
          None 
         
          pos_tags  
          =  
          [tag  
          for  
          word, tag  
          in  
          tagged_sentence] 
         
          chunk_pos_tags  
          =  
          self 
          .chunk_tagger.tag(pos_tags) 
         
          chunk_tags  
          =  
          [chunk_tag  
          for  
          (pos_tag, chunk_tag)  
          in  
          chunk_pos_tags] 
         
          wpc_tags  
          =  
          [(word, pos_tag, chunk_tag)  
          for  
          ((word, pos_tag), chunk_tag) 
         
          in  
          zip 
          (tagged_sentence, chunk_tags)] 
         
          return  
          conlltags2tree(wpc_tags)

在上述類中，構造函數使用基於語句 WTC 三元組的 n 元分詞標籤訓練淺層分析器。在程序內部，它將一列訓練語句做爲輸入，這些訓練句使用分好塊的分析樹元數據做標註。該函數使用以前定義的 conll_tag_chunks() 函數來獲取全部分塊分析樹的 WTC 三元組數據列表。而後，該函數使用這些三元組數據來訓練一個 Bigram 標籤器，它使用 Unigram 標籤器做爲 backoff 標籤器，而且將訓練模型存儲在 self.chunk_tagger 中。請記住，能夠在訓練中使用 tagger_classes 參數來分析其餘 n 元分詞標籤。完成訓練後，可使用 parse() 函數來評估測試數據上的標籤並對新的句子進行淺層分析。在程序內部，該函數使用經 POS 標註的句子做爲輸入，從句子中分離出 POS 標籤，並使用訓練完的 self.chunk_tagger 獲取句子的 IOB 塊標籤。而後，將其與原始句子標識相結合，並使用 conlltags2tree() 函數獲取最終的淺層分析樹。

如下代碼段展現了分析器：

 
          In [ 
          43 
          ]: ntc  
          =  
          NGramTagChunker(train_data) 
         
          ...:  
          print 
          (ntc.evaluate(test_data)) 
         
          ...: 
         
          ...: 
         
          ChunkParse score: 
         
          IOB Accuracy:   
          99.6 
          % 
          % 
         
          Precision:      
          98.4 
          % 
          % 
         
          Recall:        
          100.0 
          % 
          % 
         
          F 
          - 
          Measure:      
          99.2 
          % 
          %

 
          In [ 
          45 
          ]: tree  
          =  
          ntc.parse(tagged_sentence) 
         
          ...:  
          print 
          (tree) 
         
          ...: 
         
          ...: 
         
          (S 
         
          (NP The 
          / 
          DT brown 
          / 
          JJ fox 
          / 
          NN) 
         
          is 
          / 
          VBZ 
         
          (NP quick 
          / 
          JJ) 
         
          and 
          / 
          CC 
         
          (NP he 
          / 
          PRP) 
         
          is 
          / 
          VBZ 
         
          jumping 
          / 
          VBG 
         
          over 
          / 
          IN 
         
          (NP the 
          / 
          DT lazy 
          / 
          JJ dog 
          / 
          NN))

從以上輸出能夠看出，在 treebank 測試集數據上，分析器整體準確率達到了 99.6%!

如今，一塊兒在 conll2000 語料庫上對分析器進行訓練和蘋果。conll2000 語料庫是一個更大的語料庫，它包含了 "華爾街日報" 摘錄。將在前 7500 個句子上訓練分析器，並在其他 3448 個句子上進行性能測試：

 
          from  
          nltk.corpus  
          import  
          conll2000 
         
          wsj_data  
          =  
          conll2000.chunked_sents() 
         
          train_wsj_data  
          =  
          wsj_data[: 
          7500 
          ] 
         
          test_wsj_data  
          =  
          wsj_data[ 
          7500 
          :] 
         
          print 
          (train_wsj_data[ 
          10 
          ])

 
          In [ 
          46 
          ]:  
          print 
          (train_wsj_data[ 
          10 
          ]) 
         
          ...: 
         
          ...: 
         
          (S 
         
          (NP He 
          / 
          PRP) 
         
          (VP reckons 
          / 
          VBZ) 
         
          (NP the 
          / 
          DT current 
          / 
          JJ account 
          / 
          NN deficit 
          / 
          NN) 
         
          (VP will 
          / 
          MD narrow 
          / 
          VB) 
         
          (PP to 
          / 
          TO) 
         
          (NP only 
          / 
          RB  
          #/# 1.8/CD billion/CD) 
         
          (PP  
          in 
          / 
          IN) 
         
          (NP September 
          / 
          NNP) 
         
          . 
          / 
          .)

 
          In [ 
          47 
          ]: tc  
          =  
          NGramTagChunker(train_wsj_data) 
         
          ...:  
          print 
          (tc.evaluate(test_wsj_data)) 
         
          ...: 
         
          ...: 
         
          ChunkParse score: 
         
          IOB Accuracy:   
          89.4 
          % 
          % 
         
          Precision:      
          80.8 
          % 
          % 
         
          Recall:         
          86.0 
          % 
          % 
         
          F 
          - 
          Measure:      
          83.3 
          % 
          %

上面的程序輸出顯示，分析器總體準確率大概是 89%。

基於依存關係的分析

在基於依存關係的分析中，會使用依存語法來分析和推斷語句中每一個標識在結構和語義上的關係。基於依存關係的語法能夠幫助咱們使用依存標籤標註句子。依存標籤是標記之間的一對一的映射，標識預存之間的依存關係。基於依存語法的分析樹是一個有標籤且有方向的樹或圖，能夠更加精確地標識語句。分析樹中的節點始終是詞彙分類的標識，有標籤的邊表示起始點及其從屬項（依賴起始點的標識）的依存關係。有向邊上的標籤表示依存關係中的語法角色。

依存關係分析器推薦

將使用幾個庫來生成基於依存關係的分析樹，並對例句進行檢測。首先，將使用 spacy 庫來分析例句，生成全部標識機器依存關係。

如下代碼段展現瞭如何從例句中獲取每一個標識的依存關係：

 
          sentence  
          =  
          'The brown fox is quick and he is jumping over the lazy dog' 
         
          from  
          spacy.lang.en  
          import  
          English 
         
          parser  
          =  
          English() 
         
          parsed_sent  
          =  
          parser(sentence) 
         
          dependency_pattern  
          =  
          '{left}<---{word}[{w_type}]--->{right}\n--------' 
         
          for  
          token  
          in  
          parsed_sent: 
         
          print 
          (dependency_pattern. 
          format 
          (word 
          = 
          token.orth_, 
         
          w_type 
          = 
          token.dep_, 
         
          left 
          = 
          [t.orth_ 
         
          for  
          t 
         
          in  
          token.lefts], 
         
          right 
          = 
          [t.orth_ 
         
          for  
          t 
         
          in  
          token.rights]))

輸出結果：

若是出現以下錯誤，

 
            ... 
           
            ModuleNotFoundError: No module named  
            'spacy.en'

請執行：

 
            $ python  
            - 
            m spacy download en 
           
            >>>  
            import  
            spacy 
           
            >>> nlp  
            =  
            spacy.load( 
            'en' 
            )

 
          import  
          os 
         
          java_path  
          =  
          '/usr/local/jdk/bin/java' 
         
          os.environ[ 
          'JAVAHOME' 
          ]  
          =  
          java_path 
         
          from  
          nltk.parse.stanford  
          import  
          StanfordDependencyParser 
         
          from  
          nltk.parse.corenlp  
          import  
          StanforCoreNLPDependencyParser 
         
          sdp  
          =  
          StanfordDependencyParser(path_to_jar 
          = 
          '/root/stanford-parser-full-2018-02-27/stanford-parser.jar' 
          , 
         
          path_to_models_jar 
          = 
          '/root/stanford-parser-full-2018-02-27/stanford-english-corenlp-2018-02-27-models.jar' 
          )    
         
          result  
          =  
          list 
          (sdp.raw_parse(sentence))

 
          In [ 
          43 
          ]: result[ 
          0 
          ] 
         
          Out[ 
          43 
          ]: <DependencyGraph with  
          14  
          nodes>

 
     
      
        
          In [ 
          44 
          ]: [item  
          for  
          item  
          in  
          result[ 
          0 
          ].triples()] 
         
 
          Out[ 
          44 
          ]: 
         
 
          [(( 
          'quick' 
          ,  
          'JJ' 
          ),  
          'nsubj' 
          , ( 
          'fox' 
          ,  
          'NN' 
          )), 
         
 
            
          (( 
          'fox' 
          ,  
          'NN' 
          ),  
          'det' 
          , ( 
          'The' 
          ,  
          'DT' 
          )), 
         
 
            
          (( 
          'fox' 
          ,  
          'NN' 
          ),  
          'amod' 
          , ( 
          'brown' 
          ,  
          'JJ' 
          )), 
         
 
            
          (( 
          'quick' 
          ,  
          'JJ' 
          ),  
          'cop' 
          , ( 
          'is' 
          ,  
          'VBZ' 
          )), 
         
 
            
          (( 
          'quick' 
          ,  
          'JJ' 
          ),  
          'cc' 
          , ( 
          'and' 
          ,  
          'CC' 
          )), 
         
 
            
          (( 
          'quick' 
          ,  
          'JJ' 
          ),  
          'conj' 
          , ( 
          'jumping' 
          ,  
          'VBG' 
          )), 
         
 
            
          (( 
          'jumping' 
          ,  
          'VBG' 
          ),  
          'nsubj' 
          , ( 
          'he' 
          ,  
          'PRP' 
          )), 
         
 
            
          (( 
          'jumping' 
          ,  
          'VBG' 
          ),  
          'aux' 
          , ( 
          'is' 
          ,  
          'VBZ' 
          )), 
         
 
            
          (( 
          'jumping' 
          ,  
          'VBG' 
          ),  
          'nmod' 
          , ( 
          'dog' 
          ,  
          'NN' 
          )), 
         
 
            
          (( 
          'dog' 
          ,  
          'NN' 
          ),  
          'case' 
          , ( 
          'over' 
          ,  
          'IN' 
          )), 
         
 
            
          (( 
          'dog' 
          ,  
          'NN' 
          ),  
          'det' 
          , ( 
          'the' 
          ,  
          'DT' 
          )), 
         
 
            
          (( 
          'dog' 
          ,  
          'NN' 
          ),  
          'amod' 
          , ( 
          'lazy' 
          ,  
          'JJ' 
          ))] 
         
 
      
 
     
   

 
          In [ 
          49 
          ]: dep_tree  
          =  
          [parse.tree()  
          for  
          parse  
          in  
          result][ 
          0 
          ] 
         
          In [ 
          50 
          ]:  
          print 
          (dep_tree) 
         
          (quick (fox The brown)  
          is  
          and  
          (jumping he  
          is  
          (dog over the lazy))) 
         
          In [ 
          51 
          ]: dep_tree.draw()

上面的輸出結果展現瞭如何輕鬆地爲例句生成依存分析樹，並分析和理解標識間的關係。斯坦福分析器是十分強大且穩定的，它可以很好的與 nltk 集成。在這裏，有一點須要說明，那就是須要安裝 graphviz 纔可以生成圖形。

創建本身的依存關係分析器

從頭開始構建本身的依存關係分析器並不容易，由於須要大量的、充足的數據、而且僅僅按照語法產生式規則檢查並不老是可以很好地蘋果分析器效果。下面的代碼段展現瞭如何構建本身的依存關係分析器。

 
          import  
          nltk 
         
          tokens  
          =  
          nltk.word_tokenize(sentence) 
         
          dependency_rules  
          =  
          """ 
         
          'fox' -> 'The' | 'brown' 
         
          'quick' -> 'fox' | 'is' | 'and' | 'jumping' 
         
          'jumping' -> 'he' | 'is' | 'dog' 
         
          'dog' -> 'over' | 'the' | 'lazy' 
         
          """ 
         
          dependency_grammar  
          =  
          nltk.grammar.DependencyGrammar.fromstring(dependency_rules)

 
          In [ 
          62 
          ]:  
          print 
          (dependency_grammar) 
         
          ...: 
         
          ...: 
         
          Dependency grammar with  
          12  
          productions 
         
          'fox'  
          - 
          >  
          'The' 
         
          'fox'  
          - 
          >  
          'brown' 
         
          'quick'  
          - 
          >  
          'fox' 
         
          'quick'  
          - 
          >  
          'is' 
         
          'quick'  
          - 
          >  
          'and' 
         
          'quick'  
          - 
          >  
          'jumping' 
         
          'jumping'  
          - 
          >  
          'he' 
         
          'jumping'  
          - 
          >  
          'is' 
         
          'jumping'  
          - 
          >  
          'dog' 
         
          'dog'  
          - 
          >  
          'over' 
         
          'dog'  
          - 
          >  
          'the' 
         
          'dog'  
          - 
          >  
          'lazy'

 
          dp  
          =  
          nltk.ProjectiveDependencyParser(dependency_grammar) 
         
          res  
          =  
          [item  
          for  
          item  
          in  
          dp.parse(tokens)] 
         
          tree  
          =  
          res[ 
          0 
          ]

 
          In [ 
          64 
          ]:  
          print 
          (tree) 
         
          (quick (fox The brown)  
          is  
          and  
          (jumping he  
          is  
          (dog over the lazy)))

能夠看出，上面的依存關係分析樹與斯坦福分析器生成的分析樹是相通的。事實上，可使用 tree.drwa() 來可視化樹形結構，並將其與上一個樹形結構進行比較。分析器的擴展性一直是一個挑戰，在大型項目中，大量工做用來生成基於依存語法規則的系統。一些例子包括詞彙功能語法（Lexical Functional Grammar，LFG）Pargram 項目和詞彙化樹形聯合語法（Lexicalized Tree Adjoining Grammar）XTAG 項目。

基於成分結構的分析

基於成分結構的語法經常使用來分析和肯定語句的組成部分。此外，這種語法的另外一個重要用途是找出這些組成成分的內部結構以及它們之間的關係。對於不一樣類型的短語，根據其包含的組件類型，一般會有幾種不一樣的短語規則，可使用它們來構建分析樹。若是須要溫習相關內容請查閱一些分析樹示例。

基於成分結構的語法能夠幫助咱們將句子分解成各類成分。而後，能夠進一步將這些成分分解成更細的細分項，而且重複這個過程直至將各類成分分解成獨立的標識或單詞。這些語法具備各類產生式規則，通常而言，一個與上下文語境無關的語法（CFG）或短語結構語法就知足以完成上述操做。

一旦擁有了一套語法規則，就能夠構建一個成分結構分析器，它根據這些規則處理輸入的語句，並輔助咱們構建分析樹。分析器是爲語法賦予生命的東西，也能夠說是語法的程序語言解釋。目前，有各類類型的分析算法。包括以下幾類：

遞歸降低解析（Recursive Descent parsing）。
移位歸約解析（Shift Reduce parsing）。
圖表解析（Chart parsing）。
自下而上解析（Bottom-up parsing）。
自上而下解析（Top-down parsing）。
PCFG 解析（PCFG parsing）。

如須要，請參見更詳細的信息：http://www.nltk.org/book/ch08.html

稍後，當咱們只想本身的分析器是，會簡要的介紹部分分析器，並重點介紹 PCFG 解析。遞歸降低解析一般遵循自上而下的解析方法，它從輸入語句中讀取表示，而後嘗試將其與語法產生式規則中的最終符進行匹配。它始終超前一個標識，並在每次得到匹配時，將輸入讀取指針前移。

位移歸約解析遵循自下而上的解析方法，它找出與語法產生式規則右側一致的標識序列（單詞或短語），而後用該規則左側的標識替換它。這個過程一直持續，直到整個句子只剩下造成分析樹的必要項。

圖標解析採用動態規則，它存儲中間結果，並在須要時從新使用這些結果，以得到顯著的效能提高。這種狀況下，圖標分析器存儲部分解決方案，並在須要時查找它們已得到完整的解決方案。

成文結構分析器推薦

在這裏，將使用 nltk 和 StanfordParser 來生成分析數。在運行代碼以解析例句以前，首先設置 Java 路徑，而後將像是並可視化分析樹：

 
          import  
          os 
         
          java_path  
          =  
          '/usr/local/jdk/bin/java' 
         
          os.environ[ 
          'JAVAHOME' 
          ]  
          =  
          java_path 
         
          from  
          nltk.parse.stanford  
          import  
          StanfordDependencyParser 
         
          sdp  
          =  
          StanfordDependencyParser(path_to_jar 
          = 
          '/root/stanford-parser-full-2018-02-27/stanford-parser.jar' 
          , 
         
          path_to_models_jar 
          = 
          '/root/stanford-parser-full-2018-02-27/stanford-english-corenlp-2018-02-27-models.jar' 
          )    
         
          result  
          =  
          list 
          (sdp.raw_parse(sentence))

摺疊源碼

 
     
      
        
          In [ 
          67 
          ]:  
          print 
          (result[ 
          0 
          ]) 
         
 
          defaultdict(<function DependencyGraph.__init__.< 
          locals 
          >.< 
          lambda 
          > at  
          0x7f91506c17b8 
          >, 
         
 
                       
          { 
          0 
          : { 
          'address' 
          :  
          0 
          , 
         
 
                            
          'ctag' 
          :  
          'TOP' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, { 
          'root' 
          : [ 
          5 
          ]}), 
         
 
                            
          'feats' 
          :  
          None 
          , 
         
 
                            
          'head' 
          :  
          None 
          , 
         
 
                            
          'lemma' 
          :  
          None 
          , 
         
 
                            
          'rel' 
          :  
          None 
          , 
         
 
                            
          'tag' 
          :  
          'TOP' 
          , 
         
 
                            
          'word' 
          :  
          None 
          }, 
         
 
                        
          1 
          : { 
          'address' 
          :  
          1 
          , 
         
 
                            
          'ctag' 
          :  
          'DT' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          3 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'det' 
          , 
         
 
                            
          'tag' 
          :  
          'DT' 
          , 
         
 
                            
          'word' 
          :  
          'The' 
          }, 
         
 
                        
          2 
          : { 
          'address' 
          :  
          2 
          , 
         
 
                            
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          3 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'amod' 
          , 
         
 
                            
          'tag' 
          :  
          'JJ' 
          , 
         
 
                            
          'word' 
          :  
          'brown' 
          }, 
         
 
                        
          3 
          : { 
          'address' 
          :  
          3 
          , 
         
 
                            
          'ctag' 
          :  
          'NN' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, { 
          'det' 
          : [ 
          1 
          ],  
          'amod' 
          : [ 
          2 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'nsubj' 
          , 
         
 
                            
          'tag' 
          :  
          'NN' 
          , 
         
 
                            
          'word' 
          :  
          'fox' 
          }, 
         
 
                        
          4 
          : { 
          'address' 
          :  
          4 
          , 
         
 
                            
          'ctag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'cop' 
          , 
         
 
                            
          'tag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'word' 
          :  
          'is' 
          }, 
         
 
                        
          5 
          : { 
          'address' 
          :  
          5 
          , 
         
 
                            
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                
          { 
          'cc' 
          : [ 
          6 
          ], 
         
 
                                                 
          'conj' 
          : [ 
          9 
          ], 
         
 
                                                 
          'cop' 
          : [ 
          4 
          ], 
         
 
                                                 
          'nsubj' 
          : [ 
          3 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          0 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'root' 
          , 
         
 
                            
          'tag' 
          :  
          'JJ' 
          , 
         
 
                            
          'word' 
          :  
          'quick' 
          }, 
         
 
                        
          6 
          : { 
          'address' 
          :  
          6 
          , 
         
 
                            
          'ctag' 
          :  
          'CC' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'cc' 
          , 
         
 
                            
          'tag' 
          :  
          'CC' 
          , 
         
 
                            
          'word' 
          :  
          'and' 
          }, 
         
 
                        
          7 
          : { 
          'address' 
          :  
          7 
          , 
         
 
                            
          'ctag' 
          :  
          'PRP' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          9 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'nsubj' 
          , 
         
 
                            
          'tag' 
          :  
          'PRP' 
          , 
         
 
                            
          'word' 
          :  
          'he' 
          }, 
         
 
                        
          8 
          : { 
          'address' 
          :  
          8 
          , 
         
 
                            
          'ctag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          9 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'aux' 
          , 
         
 
                            
          'tag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'word' 
          :  
          'is' 
          }, 
         
 
                        
          9 
          : { 
          'address' 
          :  
          9 
          , 
         
 
                            
          'ctag' 
          :  
          'VBG' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                
          { 
          'aux' 
          : [ 
          8 
          ], 
         
 
                                                 
          'nmod' 
          : [ 
          13 
          ], 
         
 
                                                 
          'nsubj' 
          : [ 
          7 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'conj' 
          , 
         
 
                            
          'tag' 
          :  
          'VBG' 
          , 
         
 
                            
          'word' 
          :  
          'jumping' 
          }, 
         
 
                        
          10 
          : { 
          'address' 
          :  
          10 
          , 
         
 
                             
          'ctag' 
          :  
          'IN' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'case' 
          , 
         
 
                             
          'tag' 
          :  
          'IN' 
          , 
         
 
                             
          'word' 
          :  
          'over' 
          }, 
         
 
                        
          11 
          : { 
          'address' 
          :  
          11 
          , 
         
 
                             
          'ctag' 
          :  
          'DT' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'det' 
          , 
         
 
                             
          'tag' 
          :  
          'DT' 
          , 
         
 
                             
          'word' 
          :  
          'the' 
          }, 
         
 
                        
          12 
          : { 
          'address' 
          :  
          12 
          , 
         
 
                             
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'amod' 
          , 
         
 
                             
          'tag' 
          :  
          'JJ' 
          , 
         
 
                             
          'word' 
          :  
          'lazy' 
          }, 
         
 
                        
          13 
          : { 
          'address' 
          :  
          13 
          , 
         
 
                             
          'ctag' 
          :  
          'NN' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                 
          { 
          'amod' 
          : [ 
          12 
          ], 
         
 
                                                  
          'case' 
          : [ 
          10 
          ], 
         
 
                                                  
          'det' 
          : [ 
          11 
          ]}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          9 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'nmod' 
          , 
         
 
                             
          'tag' 
          :  
          'NN' 
          , 
         
 
                             
          'word' 
          :  
          'dog' 
          }}) 
         
 
      
 
     
   

構建本身的成分結構分析器

構建本身的成分結構分析器有各類各樣的方法，包括場景 CFG 產生式規則，而後使用該語法規則構建分析器等。想要構建本身的 CFG，可使用 nltk.CFG.fromstring 函數來輸入產生式規則，而後在使用 ChartParser 或 RecursuveDescentParser 分析器（它們均屬於 nltk 包）。

將會考慮構建一個擴展性良好且運行高效的成分結構分析器。常規 CFG 分析器（如圖表分析器、遞歸降低分析器）的問題是解析語句時很容易被大量的解析工做量所壓垮，致使運行速度很是緩慢。而這正像是 PCFG（機率上下文無關語法，Probabilistic Context Free Grammer）這樣的加權語法和像維特比分析器這樣的機率分析器在運行中被證實更有效的地方。PCFG 是一種上下文無關的語法，它將每一個產生式規則與一個機率值相關聯。從 PCFG 產生一個分析樹的機率是每一個產生式機率的乘積。

下面將使用 nltk 的 ViterbiParser 來訓練 treebank 語料庫中的分析器，treebank 語料庫爲每一個句子提供了帶註釋的分析樹。這個分析器是一個自下而上的 PCFG 分析器，它使用動態規則來查找每一個步驟中最有可能的分析結果。能夠經過加載必要的訓練數據和依存關係來開始構建分析器：

 
          import  
          nltk 
         
          from  
          nltk.grammar  
          import  
          Nonterminal 
         
          from  
          nltk.corpus  
          import  
          treebank 
         
          training_set  
          =  
          treebank.parsed_sents()

 
          In [ 
          71 
          ]:  
          print 
          (training_set[ 
          1 
          ]) 
         
          (S 
         
          (NP 
          - 
          SBJ (NNP Mr.) (NNP Vinken)) 
         
          (VP 
         
          (VBZ  
          is 
          ) 
         
          (NP 
          - 
          PRD 
         
          (NP (NN chairman)) 
         
          (PP 
         
          (IN of) 
         
          (NP 
         
          (NP (NNP Elsevier) (NNP N.V.)) 
         
          (, ,) 
         
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) 
         
          (. .))

如今，從標記和註釋完的訓練句子中獲取規則，並構建語法的產生式規則：

 
          In [ 
          72 
          ]: treebank_productions  
          =  
          list 
          ( 
         
          ...:                          
          set 
          (production 
         
          ...:                              
          for  
          sent  
          in  
          training_set 
         
          ...:                              
          for  
          production  
          in  
          sent.productions() 
         
          ...:                         ) 
         
          ...:                     ) 
         
          ...: 
         
          ...: treebank_productions[ 
          0 
          : 
          10 
          ] 
         
          ...: 
         
          ...: 
         
          Out[ 
          72 
          ]: 
         
          [NN  
          - 
          >  
          'literature' 
          , 
         
          NN  
          - 
          >  
          'turnover' 
          , 
         
          NP  
          - 
          > DT NNP NNP JJ NN NN, 
         
          VP  
          - 
          > VB NP PP 
          - 
          CLR ADVP 
          - 
          LOC, 
         
          CD  
          - 
          >  
          '2.375' 
          , 
         
          S 
          - 
          HLN  
          - 
          > NP 
          - 
          SBJ 
          - 
          1  
          VP :, 
         
          VBN  
          - 
          >  
          'Estimated' 
          , 
         
          NN  
          - 
          >  
          'Power' 
          , 
         
          NNS  
          - 
          >  
          'constraints' 
          , 
         
          NNP  
          - 
          >  
          'Wellcome' 
          ]

 
          # add productions for each word, POS tag 
         
          for  
          word, tag  
          in  
          treebank.tagged_words(): 
         
          t  
          =  
          nltk.Tree.fromstring( 
          "(" 
          +  
          tag  
          +  
          " "  
          +  
          word   
          + 
          ")" 
          ) 
         
          for  
          production  
          in  
          t.productions(): 
         
          treebank_productions.append(production) 
         
          # build the PCFG based grammar  
         
          treebank_grammar  
          =  
          nltk.grammar.induce_pcfg(Nonterminal( 
          'S' 
          ), 
         
          treebank_productions)

如今有了必要的語法和生產式評估，將使用如下代碼段，經過語法訓練建立本身的分析器，而後使用例句對分析器進行評估：

 
     
      
        
          In [ 
          74 
          ]:  
          # build the parser 
         
 
               
          ...: viterbi_parser  
          =  
          nltk.ViterbiParser(treebank_grammar) 
         
 
               
          ...: 
         
 
               
          ...:  
          # get sample sentence tokens 
         
 
               
          ...: tokens  
          =  
          nltk.word_tokenize(sentence) 
         
 
               
          ...: 
         
 
               
          ...:  
          # get parse tree for sample sentence 
         
 
               
          ...: result  
          =  
          list 
          (viterbi_parser.parse(tokens)) 
         
 
               
          ...: 
         
 
               
          ...: 
         
 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
         
 
          ValueError                                Traceback (most recent call last) 
         
 
          <ipython 
          - 
          input 
          - 
          74 
          - 
          78944f4bb64d 
          >  
          in  
          <module>() 
         
 
                 
          6 
         
 
                 
          7  
          # get parse tree for sample sentence 
         
 
          - 
          - 
          - 
          - 
          >  
          8  
          result  
          =  
          list 
          (viterbi_parser.parse(tokens)) 
         

             
         
 
          / 
          usr 
          / 
          local 
          / 
          lib 
          / 
          python3. 
          6 
          / 
          dist 
          - 
          packages 
          / 
          nltk 
          / 
          parse 
          / 
          viterbi.py  
          in  
          parse( 
          self 
          , tokens) 
         
 
               
          110 
         
 
               
          111          
          tokens  
          =  
          list 
          (tokens) 
         
 
          - 
          - 
          >  
          112          
          self 
          ._grammar.check_coverage(tokens) 
         
 
               
          113 
         
 
               
          114          
          # The most likely constituent table.  This table specifies the 
         

             
         
 
          / 
          usr 
          / 
          local 
          / 
          lib 
          / 
          python3. 
          6 
          / 
          dist 
          - 
          packages 
          / 
          nltk 
          / 
          grammar.py  
          in  
          check_coverage( 
          self 
          , tokens) 
         
 
               
          658              
          missing  
          =  
          ', ' 
          .join( 
          '%r'  
          %  
          (w,)  
          for  
          w  
          in  
          missing) 
         
 
               
          659              
          raise  
          ValueError( 
          "Grammar does not cover some of the " 
         
 
          - 
          - 
          >  
          660                               
          "input words: %r."  
          %  
          missing) 
         
 
               
          661 
         
 
               
          662      
          def  
          _calculate_grammar_forms( 
          self 
          ): 
         

             
         
 
          ValueError: Grammar does  
          not  
          cover some of the  
          input  
          words:  
          "'brown', 'fox', 'lazy', 'dog'" 
          . 
         
 
      
 
     
   

很不幸的是，在嘗試用新建的分析器解析例句的標識時，收到了一個錯誤提示。錯誤的緣由很明顯：例句中的一些單詞不包含在基於 treebank 的語法中，由於這些單詞並不在咱們的 breebank 語料庫中。用於該語法使用 POS 標籤和短語標籤來構建基於訓練數據的分析樹，將在語法中爲例句添加標識和 POS 標籤，而後從新構建分析器：

 
     
      
        
          In [ 
          75 
          ]:  
          # get tokens and their POS tags 
         
 
               
          ...:  
          from  
          pattern.en  
          import  
          tag as pos_tagger 
         
 
               
          ...: tagged_sent  
          =  
          pos_tagger(sentence) 
         
 
               
          ...: 
         
 
               
          ...:  
          print 
          (tagged_sent) 
         
 
               
          ...: 
         
 
               
          ...: 
         
 
          [( 
          'The' 
          ,  
          'DT' 
          ), ( 
          'brown' 
          ,  
          'JJ' 
          ), ( 
          'fox' 
          ,  
          'NN' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'quick' 
          ,  
          'JJ' 
          ), ( 
          'and' 
          ,  
          'CC' 
          ), ( 
          'he' 
          ,  
          'PRP' 
          ), ( 
          'is' 
          ,  
          'VBZ' 
          ), ( 
          'jumping' 
          ,  
          'VBG' 
          ), ( 
          'over' 
          ,  
          'IN' 
          ), ( 
          'the' 
          ,  
          'DT' 
          ), ( 
          'lazy' 
          ,  
          'JJ' 
          ), ( 
          'dog' 
          ,  
          'NN' 
          )] 
         
 
      
 
     
   

 
          # extend productions for sample sentence tokens 
         
          for  
          word, tag  
          in  
          tagged_sent: 
         
          t  
          =  
          nltk.Tree.fromstring( 
          "(" 
          +  
          tag  
          +  
          " "  
          +  
          word   
          + 
          ")" 
          ) 
         
          for  
          production  
          in  
          t.productions(): 
         
          treebank_productions.append(production) 
         
          # rebuild grammar 
         
          treebank_grammar  
          =  
          nltk.grammar.induce_pcfg(Nonterminal( 
          'S' 
          ), 
         
          treebank_productions)                                         
         
          # rebuild parser 
         
          viterbi_parser  
          =  
          nltk.ViterbiParser(treebank_grammar) 
         
          # get parse tree for sample sentence 
         
          result  
          =  
          list 
          (viterbi_parser.parse(tokens))

摺疊源碼

 
     
      
        
          In [ 
          77 
          ]:  
          print 
          (result[ 
          0 
          ]) 
         
 
          defaultdict(<function DependencyGraph.__init__.< 
          locals 
          >.< 
          lambda 
          > at  
          0x7f91506c17b8 
          >, 
         
 
                       
          { 
          0 
          : { 
          'address' 
          :  
          0 
          , 
         
 
                            
          'ctag' 
          :  
          'TOP' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, { 
          'root' 
          : [ 
          5 
          ]}), 
         
 
                            
          'feats' 
          :  
          None 
          , 
         
 
                            
          'head' 
          :  
          None 
          , 
         
 
                            
          'lemma' 
          :  
          None 
          , 
         
 
                            
          'rel' 
          :  
          None 
          , 
         
 
                            
          'tag' 
          :  
          'TOP' 
          , 
         
 
                            
          'word' 
          :  
          None 
          }, 
         
 
                        
          1 
          : { 
          'address' 
          :  
          1 
          , 
         
 
                            
          'ctag' 
          :  
          'DT' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          3 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'det' 
          , 
         
 
                            
          'tag' 
          :  
          'DT' 
          , 
         
 
                            
          'word' 
          :  
          'The' 
          }, 
         
 
                        
          2 
          : { 
          'address' 
          :  
          2 
          , 
         
 
                            
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          3 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'amod' 
          , 
         
 
                            
          'tag' 
          :  
          'JJ' 
          , 
         
 
                            
          'word' 
          :  
          'brown' 
          }, 
         
 
                        
          3 
          : { 
          'address' 
          :  
          3 
          , 
         
 
                            
          'ctag' 
          :  
          'NN' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, { 
          'det' 
          : [ 
          1 
          ],  
          'amod' 
          : [ 
          2 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'nsubj' 
          , 
         
 
                            
          'tag' 
          :  
          'NN' 
          , 
         
 
                            
          'word' 
          :  
          'fox' 
          }, 
         
 
                        
          4 
          : { 
          'address' 
          :  
          4 
          , 
         
 
                            
          'ctag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'cop' 
          , 
         
 
                            
          'tag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'word' 
          :  
          'is' 
          }, 
         
 
                        
          5 
          : { 
          'address' 
          :  
          5 
          , 
         
 
                            
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                
          { 
          'cc' 
          : [ 
          6 
          ], 
         
 
                                                 
          'conj' 
          : [ 
          9 
          ], 
         
 
                                                 
          'cop' 
          : [ 
          4 
          ], 
         
 
                                                 
          'nsubj' 
          : [ 
          3 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          0 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'root' 
          , 
         
 
                            
          'tag' 
          :  
          'JJ' 
          , 
         
 
                            
          'word' 
          :  
          'quick' 
          }, 
         
 
                        
          6 
          : { 
          'address' 
          :  
          6 
          , 
         
 
                            
          'ctag' 
          :  
          'CC' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'cc' 
          , 
         
 
                            
          'tag' 
          :  
          'CC' 
          , 
         
 
                            
          'word' 
          :  
          'and' 
          }, 
         
 
                        
          7 
          : { 
          'address' 
          :  
          7 
          , 
         
 
                            
          'ctag' 
          :  
          'PRP' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          9 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'nsubj' 
          , 
         
 
                            
          'tag' 
          :  
          'PRP' 
          , 
         
 
                            
          'word' 
          :  
          'he' 
          }, 
         
 
                        
          8 
          : { 
          'address' 
          :  
          8 
          , 
         
 
                            
          'ctag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          9 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'aux' 
          , 
         
 
                            
          'tag' 
          :  
          'VBZ' 
          , 
         
 
                            
          'word' 
          :  
          'is' 
          }, 
         
 
                        
          9 
          : { 
          'address' 
          :  
          9 
          , 
         
 
                            
          'ctag' 
          :  
          'VBG' 
          , 
         
 
                            
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                
          { 
          'aux' 
          : [ 
          8 
          ], 
         
 
                                                 
          'nmod' 
          : [ 
          13 
          ], 
         
 
                                                 
          'nsubj' 
          : [ 
          7 
          ]}), 
         
 
                            
          'feats' 
          :  
          '_' 
          , 
         
 
                            
          'head' 
          :  
          5 
          , 
         
 
                            
          'lemma' 
          :  
          '_' 
          , 
         
 
                            
          'rel' 
          :  
          'conj' 
          , 
         
 
                            
          'tag' 
          :  
          'VBG' 
          , 
         
 
                            
          'word' 
          :  
          'jumping' 
          }, 
         
 
                        
          10 
          : { 
          'address' 
          :  
          10 
          , 
         
 
                             
          'ctag' 
          :  
          'IN' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'case' 
          , 
         
 
                             
          'tag' 
          :  
          'IN' 
          , 
         
 
                             
          'word' 
          :  
          'over' 
          }, 
         
 
                        
          11 
          : { 
          'address' 
          :  
          11 
          , 
         
 
                             
          'ctag' 
          :  
          'DT' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'det' 
          , 
         
 
                             
          'tag' 
          :  
          'DT' 
          , 
         
 
                             
          'word' 
          :  
          'the' 
          }, 
         
 
                        
          12 
          : { 
          'address' 
          :  
          12 
          , 
         
 
                             
          'ctag' 
          :  
          'JJ' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, {}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          13 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'amod' 
          , 
         
 
                             
          'tag' 
          :  
          'JJ' 
          , 
         
 
                             
          'word' 
          :  
          'lazy' 
          }, 
         
 
                        
          13 
          : { 
          'address' 
          :  
          13 
          , 
         
 
                             
          'ctag' 
          :  
          'NN' 
          , 
         
 
                             
          'deps' 
          : defaultdict(< 
          class  
          'list' 
          >, 
         
 
                                                 
          { 
          'amod' 
          : [ 
          12 
          ], 
         
 
                                                  
          'case' 
          : [ 
          10 
          ], 
         
 
                                                  
          'det' 
          : [ 
          11 
          ]}), 
         
 
                             
          'feats' 
          :  
          '_' 
          , 
         
 
                             
          'head' 
          :  
          9 
          , 
         
 
                             
          'lemma' 
          :  
          '_' 
          , 
         
 
                             
          'rel' 
          :  
          'nmod' 
          , 
         
 
                             
          'tag' 
          :  
          'NN' 
          , 
         
 
                             
          'word' 
          :  
          'dog' 
          }})