濱州樹庫標註實例app
句法模型訓練最基礎的一步,就是從樹庫中抽取規則。而規則是由一些非終結符,詞彙等信息組成的,因此Training第一步是要能提取這些信息。濱州樹庫(Penn Tree Bank) WSJ mrg標註風格的樹庫是這樣的。ide
1: ( (S
2: (NP-SBJ
3: (NP (NNP Pierre) (NNP Vinken) )
4: (, ,)
5: (ADJP
6: (NP (CD 61) (NNS years) )
7: (JJ old) )
8: (, ,) )
9: (VP (MD will)
10: (VP (VB join)
11: (NP (DT the) (NN board) )
12: (PP-CLR (IN as)
13: (NP (DT a) (JJ nonexecutive) (NN director) ))
14: (NP-TMP (NNP Nov.) (CD 29) )))
15: (. .) ))
很明顯這種樹的標記由三類不一樣的符號組成。左括號(,右括號),以及像S、NP-SBJ、director這樣的字符串。函數
樹庫Token讀取類EdgeLexerspa
Spear中提供了一個類EdgeLexer來讀取這三種Token,而且從文件角度考慮加入了一個終止的Token。這個類在模型訓練的時候,複雜讀取這四種Token,而且在Token是字符串的狀況下,返回讀取的內容。code
EdgeLexer的聲明以下所示。ip
1:
class EdgeLexer
2: {
3:
public:
4:
/*幾種不一樣的Token*/
5:
static
const
int TOKEN_EOF = 0;
/*終止符*/
6:
static
const
int TOKEN_STRING = 1;
/*字符串*/
7:
static
const
int TOKEN_LP = 2;
/*左括號*/
8:
static
const
int TOKEN_RP = 3;
/*右括號*/
9: EdgeLexer(IStream &);
10:
/*核心函數*/
11:
int lexem(String &);
12:
int getLineCount()
const {
return _lineCount; };
13:
private:
14:
/** The stream */
15: IStream & _stream;
16:
/** Line count */
17:
int _lineCount;
18:
/** Advance over white spaces */
19:
void skipWhiteSpaces();
20:
bool isSpace(Char c)
const;
21: };
從代碼的聲明能夠看出,EdgeLexer完成了行數統計,空白符判斷,Token讀取的三種行爲。EdgeLexer的實現以下所示。字符串
1:
/**構造函數*/
2: EdgeLexer::EdgeLexer(IStream & stream)
3: : _stream(stream), _lineCount(1){}
4:
/**判斷一個字符是不是空白符
5:
*@ c 要判斷的字符
6:
*/
7:
bool EdgeLexer::isSpace(Char c)
const
8: {
9:
if(c != W(
' ') &&
10: c != W(
'\t') &&
11: c != W(
'\n') &&
12: c != W(
'\r')){
13:
return false;
14: }
15:
return true;
16: }
17:
/**跳過空白符**/
18:
void EdgeLexer::skipWhiteSpaces()
19: {
20: Char c;
21:
while((c = _stream.get()) != EOF && isSpace(c)){
22:
if(c == W(
'\n')){
23: _lineCount ++;
24: }
25: }
26: _stream.unget();
27: }
28:
/**判斷是不是空白、左右括號的宏*/
29:
#define STRING_CHAR(c) ( \
30: c != W(
'(') && \
31: c != W(
')') && \
32: ! isSpace(c) \
33: )
34:
/**
35:
*讀一個詞條,而且返回詞的類型,將詞的內容存到text中
36:
*若是不是STRING,而是括號,EOF,則只返回類型,不返回內容
37:
*@text 存儲返回的字符串
38:
*若是終止則返回TOKEN_EOF,若是爲字符串則返回TOKEN_STRING
39:
*/
40:
int EdgeLexer::lexem(String & text)
41: {
42: skipWhiteSpaces();
43: Char c = _stream.get();
44:
if(c == EOF){
45:
return TOKEN_EOF;
46: }
else
if(c == W(
'(')){
47:
return TOKEN_LP;
48: }
else
if(c == W(
')')){
49:
return TOKEN_RP;
50: }
else
if(STRING_CHAR(c)){
51: OStringStream buffer;
52: buffer << c;
53:
while((c = _stream.get()) != TOKEN_EOF && STRING_CHAR(c)){
54: buffer << c;
55: }
56: text = buffer.str();
57: _stream.unget();
58:
return TOKEN_STRING;
59: }
60:
// should never get here
61:
return TOKEN_EOF;
62: }
EdgeLexer的使用實例get
讀取TreeBank文件,輸出全部的非終結符和詞彙信息。string
1:
#include <fstream>
2:
using
namespace std;
3:
int main(
int argc,
char **argv)
4: {
5:
if(argc!=2){
6: printf(
"[Usage]:%s [treebank]\n",argv[0]);
7: exit(0);
8: }
9: ifstream is(argv[1]);
10: EdgeLexer lex(is);
11: string text;
12:
int l;
13:
while((l = lex.lexem(text)) != EdgeLexer::TOKEN_EOF){
14:
//cout << l;
15:
if(l == EdgeLexer::TOKEN_STRING){
//輸出字符串
16: cout <<
" " << text;
17: cout <<endl;
18: }
19:
//cout << endl;
20: }
21: }