Go 譯文之詞法分析與解析 Part Two

時間 2019-12-06

標籤譯文詞法分析解析简体版

原文原文鏈接

做者：Adam Presley | 地址：adampresley.github.io/2015/05/12/…html

譯者前言

本文是關於詞法器實現的具體介紹，若是在閱讀時遇到困難，建議參考源碼閱讀，文中的代碼片斷爲了介紹思路。如何解析會在下一篇介紹。git

最近簡單看了下 Go 源碼，在 src/go 目錄下有幾個模塊，token、scanner 和 parser 應該就是 Go 詞法相關實現的核心代碼，打開 token 目錄會發現其中的源碼和上一節介紹的內容有諸多類似之處。github

因爲最近併發任務比較多，不能以最快的速度更新。詞法的相關內容，除了本系列，我把其餘一些相關文章的連接都貼在下面，若是英文閱讀功底不錯，可自行閱讀。併發

A look at Go lexer/scanner packages
Rob Pike's Functional Way
Handwritten Parser & Lexers In Go函數

譯文以下：ui

本系列的第一篇文章（英文原版），我介紹了關於詞法分析與解析的一些基本概念和 INI 文件內容的基本組成。以後，咱們建立了部分相關結構體與常量，幫助實現接下來的 INI 文本解析器。this

本篇文章將實際深刻到詞法分析的細節。spa

詞法分析 (lexing)，指的是將輸入文本轉化爲一系列 Token 的過程。Token 是比文本更小的單元，將它們組合在一塊兒纔可能產生有實際意義的內容，如程序、配置文件等。翻譯

本系列文章中的 INI 文件，Token 包括左括號、右括號、SectionName、Key，Value 以及等於號。用正確的順序組合它們，你就會有一個 INI 文件。詞法器的職責是讀取 INI 文件內容、分析建立 Token，以及經過 channel 將 Token 發送給解析器。code

詞法分析器

爲了實現文本到 Token 的轉化，咱們還須要追蹤一些信息，好比文本內容，當前分析文本的位置，以及當前分析的 Token 的開始和結束位置。

完成分析後，咱們還要將 Token 發送給解析器，能夠經過 channel 傳遞。

咱們還須要一個函數實現詞法器狀態的追蹤。Rob Pike 的演講中談到利用函數追蹤詞法器當前和接下來指望的狀態。簡單而言，就是一個函數處理一個 Token，並返回下一個狀態函數生成下一個指望 Token。下面，我就簡單翻譯爲狀態函數吧.

舉個例子吧！

INI 中 Section 由三部分組成，分別是左括號、SectionName 以及右括號。第一個函數將會生成左括號類型的 Token，返回 SectionName 的狀態函數，它會分析處理 SectionName 的相關邏輯，並返回處理右括號的狀態函數。總的順序是，左括號 -> section 名稱 -> 右括號。

百聞不如意見，具體看下詞法器的結構吧。以下：

Lexer.go

type Lexer struct {
  Name   string
  Input  string  // 輸入文本
  Tokens chan lexertoken.Token // 用於向詞法分析器發送 Token 的 channel
  State  LexFn   // 上面提到的狀態函數

  Start int      // token 的開始位置，結束位置能夠經過 start + len(token) 得到
  Pos   int      // 詞法器處理文本位置，當確認 Token 結尾時，即至關於知道 Token 的 end position
  Width int
}
複製代碼

LexFn.go

type LexFn func(*Lexer) LexFn // 詞法器狀態函數的定義，返回下一個指望 Token 的分析函數。 複製代碼

上篇文章，咱們已經定義了 Token 結構。LexFn，是用於處理 Token 的詞法器狀態函數類型。

如今再爲咱們的額詞法器增長一些能力。Lexer 是用於文本處理的，爲了獲取下一個 Token，咱們爲 Lexer 增長諸如讀取 rune 字符串、跳過空格，和其餘一些有用的方法。基本都是文本處理的一些簡單方法。

/* Puts a token onto the token channel. The value of this token is read from the input based on the current lexer position. */
func (this *Lexer) Emit(tokenType lexertoken.TokenType) {
    this.Tokens <- lexertoken.Token{Type: tokenType, Value: this.Input[this.Start:this.Pos]}
    this.start = this.Pos
}

/* Increment the position */
func (this *Lexer) Inc() {
    this.Pos++
    if this.Pos >= utf8.RuneCountInString(this.Input) {
        this.Emit(lexertoken.TOKEN_EOF)
    }
}

/* Return a slice of the input from the current lexer position to the end of the input string. */
func (this *Lexer) InputToEnd() string {
    return this.Input[this.Post:]
}

/* Skips whitespace until we get something meaningful */
func (this *Lexer) SkipWhiteSpace() {
    for {
        ch := this.Next()
        if !unicode.IsSpace(ch) {
            this.Dec()
            break
        }

        if ch == lexertoken.EOF {
            this.Emit(lexertoken.TOKEN_EOF)
            break
        }
    }
}
複製代碼

重點須要瞭解的是，Token 的讀取與發送。主要涉及幾個步驟，以下：

首先，一直讀取字符，直到造成一個肯定的 Token，舉例說明，SectionName 的狀態函數，只有讀到右括號才能確認 SectionName。接着，將 Token 和 Token 類型經過 channel 發送給解析器。最後，判斷下一個指望的狀態函數，並返回。

咱們先定義一個啓動函數。它一樣是解析器（下篇文章）的啓動入口。它初始化了一個 Lexer，賦予它第一個狀態函數。

第一個指望的 Token 多是什麼？一個特殊符號仍是一個關鍵詞？

在咱們的例子中，第一個狀態函數將會用一個通用的名稱 LexBegin 命名，由於在 INI 文件中，section 開始能夠，但也能夠沒有 section，以 key/value 開投。LexBegin 會負責處理這個邏輯。

/* Start a new lexer with a given input string. This returns the instance of the lexer and a channel of tokens. Reading this stream is the way to parse a given input and perform processing. */
func BeginLexing(name, input string) *lexer.Lexer {
    l := &lexer.Lexer{
        Name: name,
        Input: input,
        State: lexer.LexBegin,
        Tokens: make(chan lexertoken.Token, 3),
    }

    return l
}
複製代碼

開始

第一個狀態函數 LexBegin。

/* This lexer function starts everything off. It determines if we are beginning with a key/value assignment or a section. */
func LexBegin(lexer *Lexer) LexFn {
    lexer.SkipWhitespace()
    if strings.HasPrefix(lexer.InputToEnd(), lexertoken.LEFT_BRACKET) {
        return LexLeftBracket
    } else {
        return LexKey
    }
}
複製代碼

正如所見，首先是跳過全部空格，INI 文件中，空格是沒有意義。接着，咱們須要確認第一個字符是不是左括號，是的話，則返回 LexLetBracket，不然便是 key 類型，返回 LexKey 狀態函數。

Section

開始 section 的處理邏輯介紹。

INI 文件中的 SectionName 是由左右括號包裹起來的。咱們能夠將 Key/Value 組織在某個 Section 中。在 LexBegin 中，若是發現了左括號，則會返回 LexLeftBracket 函數。

LexLeftBracket 的代碼以下：

/* This lexer function emits a TOKEN_LEFT_BRACKET then returns the lexer for a section header. */
func LexLeftBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.LEFT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_LEFT_BRACKET)
    return LexSection
}
複製代碼

代碼很簡單！根據括號長度（長度位 1），將詞法器的位置後移，接着向 channel 發送 TOKEN_LEFT_BRACKET。

在這個場景下，Token 內容並無什麼意義。當 Emit 執行完成後，開始位置被賦值爲詞法器當前位置，這將會爲下一個 Token 作好準備。最後，返回用於處理 SectioName 的狀態函數，LexSection。

/* This lexer function exits a TOKEN_SECTION with the name of an INI file section header. */
func LexSection(lexer *Lexer) LexFn {
    for {
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_MISSING_RIGHT_BRACKET)
        }

        if strings.HasPrefix(lexer.InputEnd(), lexertoken.RIGHT_BRACKET) {
            lexer.Emit(lexertoken.TOKEN_SECTION)
            return LexRightBracket
        }

        lexer.Inc()
    }
}
複製代碼

邏輯稍微有點複雜，但基本邏輯同樣。

函數中經過一個循環遍歷字符，直到遇到 RIGHT_BRACKET，即右括號，才能夠確認 SectionName 的結束位置。若是遇到 EOF，則說明是一個錯誤格式的 INI，咱們應該進行錯誤提示，並經過 channel 發送給解析器。若是正常，將一直循環，直到發現右括號，而後 TOKEN_SECTION 和相應文本發送出去。

LexSection 返回的狀態函數是 LexerRightBracket，邏輯與 LexerLeftBracket 相似，不一樣的是，它返回的狀態函數是 LexBegin，緣由是 Section 多是空 Section，也可能有 Key/Value。

/* This lexer function emits a TOKEN_RIGHT_BRACKET then returns the lexer for a begin. */
func LexRightBracket(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.RIGHT_BRACKET)
    lexer.Emit(lexertoken.TOKEN_RIGHT_BRACKET)
    return LexBegin
}
複製代碼

Key/Value

繼續 Key/Value 處理的介紹，它的表達形式很是簡單：key=value。

首先是 Key 的處理，和 LexSection 相似，一直循環直到遇到等於號才能肯定一個完整的 Key。而後執行 Emit 將 Key 發送，並返回狀態函數 LexEqualSign。

/* This lexer function emits a TOKEN_KEY with the name of an key that will assigned a value */
func LexKey(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.EQUAL_SIGN) {
            lexer.Emit(lexertoken.TOKEN_KEY)
            return LexEqualSign
        }

        lexer.Inc()
        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}
複製代碼

等號的處理很是簡單，和左右括號相似。直接發送 TOKEN_EQUAL_SIGN 類型 Token 給解析器，並返回 LexValue。

/* This lexer functions emits a TOKEN_EQUAL_SIGN then returns the lexer for value. */
func LexEqualSign(lexer *Lexer) LexFn {
    lexer.Pos += len(lexertoken.EQUAL_SIGN)
    lexer.Emit(lexertoken.EQUAL_SIGN)

    return LexValue
}
複製代碼

最後介紹的狀態函數是 LexValue，用於 Key/Value 中的 Value 部分的處理。它會在遇到換行符時確認一個完整的Value。它返回的狀態函數是 LexBegin，以此繼續下一輪的分析。

/* This lexer function emits a TOKEN_VALUE with the value to be assigned to a key. */
func LexValue(lexer *Lexer) LexFn {
    for {
        if strings.HasPrefix(lexer.InputToEnd(), lexertoken.NEWLINE) {
            lexer.Emit(lexertoken.TOKEN_VALUE)
            return LexBegin
        }

        lexer.Inc()

        if lexer.IsEOF() {
            return lexer.Errorf(errors.LEXER_ERROR_UNEXPECTED_EOF)
        }
    }
}
複製代碼

接下來

在 Part 3，本系列的最後一篇，咱們將會介紹如何建立一個基本的解析器，將從 lexer 得到的 Token 處理爲咱們指望獲得的結構化數據。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。