Spear Parser(二) 樹庫Token讀取類EdgeLexer

時間 2020-05-12

標籤 spear parser token 讀取 edgelexer 简体版

原文原文鏈接

濱州樹庫標註實例app

句法模型訓練最基礎的一步，就是從樹庫中抽取規則。而規則是由一些非終結符，詞彙等信息組成的，因此Training第一步是要能提取這些信息。濱州樹庫(Penn Tree Bank) WSJ mrg標註風格的樹庫是這樣的。ide

 
( (S 
(NP-SBJ 
(NP (NNP Pierre) (NNP Vinken) ) 
(, ,) 
(ADJP 
(NP (CD 61) (NNS years) ) 
(JJ old) ) 
(, ,) ) 
(VP (MD will) 
(VP (VB join) 
(NP (DT the) (NN board) ) 
(PP-CLR (IN as) 
(NP (DT a) (JJ nonexecutive) (NN director) )) 
(NP-TMP (NNP Nov.) (CD 29) ))) 
(. .) )) 
  

很明顯這種樹的標記由三類不一樣的符號組成。左括號(,右括號),以及像S、NP-SBJ、director這樣的字符串。函數

樹庫Token讀取類EdgeLexerspa

Spear中提供了一個類EdgeLexer來讀取這三種Token,而且從文件角度考慮加入了一個終止的Token。這個類在模型訓練的時候，複雜讀取這四種Token,而且在Token是字符串的狀況下，返回讀取的內容。code

EdgeLexer的聲明以下所示。ip

 
    1: 
   class EdgeLexer 
    2: { 
    3: 
   public: 
    4: 
   /*幾種不一樣的Token*/ 
    5: 
   static 
   const 
   int TOKEN_EOF = 0; 
   /*終止符*/ 
    6: 
   static 
   const 
   int TOKEN_STRING = 1; 
   /*字符串*/ 
    7: 
   static 
   const 
   int TOKEN_LP = 2; 
   /*左括號*/ 
    8: 
   static 
   const 
   int TOKEN_RP = 3; 
   /*右括號*/ 
    9: EdgeLexer(IStream &); 
    10: 
   /*核心函數*/ 
    11: 
   int lexem(String &); 
    12: 
   int getLineCount() 
   const { 
   return _lineCount; }; 
    13: 
   private: 
    14: 
   /** The stream */ 
    15: IStream & _stream; 
    16: 
   /** Line count */ 
    17: 
   int _lineCount; 
    18: 
   /** Advance over white spaces */ 
    19: 
   void skipWhiteSpaces(); 
    20: 
   bool isSpace(Char c) 
   const; 
    21: }; 
  

從代碼的聲明能夠看出，EdgeLexer完成了行數統計，空白符判斷，Token讀取的三種行爲。EdgeLexer的實現以下所示。字符串

 
    1: 
   /**構造函數*/ 
    2: EdgeLexer::EdgeLexer(IStream & stream) 
    3: : _stream(stream), _lineCount(1){} 
    4: 
   /**判斷一個字符是不是空白符  
    5: 
    *@ c 要判斷的字符  
    6: 
    */ 
    7: 
   bool EdgeLexer::isSpace(Char c) 
   const 
    8: { 
    9: 
   if(c != W( 
   ' ') && 
    10: c != W( 
   '\t') && 
    11: c != W( 
   '\n') && 
    12: c != W( 
   '\r')){ 
    13: 
   return false; 
    14: } 
    15: 
   return true; 
    16: } 
    17: 
   /**跳過空白符**/ 
    18: 
   void EdgeLexer::skipWhiteSpaces() 
    19: { 
    20: Char c; 
    21: 
   while((c = _stream.get()) != EOF && isSpace(c)){ 
    22: 
   if(c == W( 
   '\n')){ 
    23: _lineCount ++; 
    24: } 
    25: } 
    26: _stream.unget(); 
    27: } 
    28: 
   /**判斷是不是空白、左右括號的宏*/ 
    29: 
   #define STRING_CHAR(c) ( \ 
    30: c != W( 
   '(') && \ 
    31: c != W( 
   ')') && \ 
    32: ! isSpace(c) \ 
    33: ) 
    34: 
   /**  
    35: 
   *讀一個詞條，而且返回詞的類型，將詞的內容存到text中  
    36: 
   *若是不是STRING,而是括號，EOF，則只返回類型，不返回內容  
    37: 
   *@text 存儲返回的字符串  
    38: 
   *若是終止則返回TOKEN_EOF,若是爲字符串則返回TOKEN_STRING  
    39: 
   */ 
    40: 
   int EdgeLexer::lexem(String & text) 
    41: { 
    42: skipWhiteSpaces(); 
    43: Char c = _stream.get(); 
    44: 
   if(c == EOF){ 
    45: 
   return TOKEN_EOF; 
    46: } 
   else 
   if(c == W( 
   '(')){ 
    47: 
   return TOKEN_LP; 
    48: } 
   else 
   if(c == W( 
   ')')){ 
    49: 
   return TOKEN_RP; 
    50: } 
   else 
   if(STRING_CHAR(c)){ 
    51: OStringStream buffer; 
    52: buffer <&lt; c; 
    53: 
   while((c = _stream.get()) != TOKEN_EOF && STRING_CHAR(c)){ 
    54: buffer &lt;&lt; c; 
    55: } 
    56: text = buffer.str(); 
    57: _stream.unget(); 
    58: 
   return TOKEN_STRING; 
    59: } 
    60: 
   // should never get here  
    61: 
   return TOKEN_EOF; 
    62: } 
  

EdgeLexer的使用實例get

讀取TreeBank文件，輸出全部的非終結符和詞彙信息。string

 
    1: 
   #include &lt;fstream> 
    2: 
   using 
   namespace std; 
    3: 
   int main( 
   int argc, 
   char **argv) 
    4: { 
    5: 
   if(argc!=2){ 
    6: printf( 
   "[Usage]:%s [treebank]\n",argv[0]); 
    7: exit(0); 
    8: } 
    9: ifstream is(argv[1]); 
    10: EdgeLexer lex(is); 
    11: string text; 
    12: 
   int l; 
    13: 
   while((l = lex.lexem(text)) != EdgeLexer::TOKEN_EOF){ 
    14: 
   //cout &lt;&lt; l;  
    15: 
   if(l == EdgeLexer::TOKEN_STRING){ 
   //輸出字符串  
    16: cout &lt;&lt; 
   " " &lt;&lt; text; 
    17: cout &lt;&lt;endl; 
    18: } 
    19: 
   //cout &lt;&lt; endl;  
    20: } 
    21: } 
  

1. Spear Parser(一):智能指針類
2. 淺談SPEAR算法
3. Expressbody-parser（二）
4. 精讀《手寫 JSON Parser》
5. 精讀《syntax-parser 源碼》
6. HTML-Parser
7. 各類讀取文件類
8. sql parser
9. #define XML_GetUserData(parser) (*(void **)(parser))
10. keycloak 獲取 access token validate token
更多相關文章...
• PHP MySQL 讀取數據 - PHP教程
• MyBatis BlobTypeHandler讀取Blob字段 - MyBatis教程
• Kotlin學習（二）基本類型
• JDK13 GA發佈：5大特性解讀

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。