實現一個簡單的解釋器（4）

時間 2020-03-03

標籤實現一個簡單解釋器简体版

原文原文鏈接

譯自：https://ruslanspivak.com/lsbasi-part4/
（已得到做者受權，個別語句翻譯的不到位，我會將原句跟在後邊做爲參考）python

你是在被動地學習這些文章中的材料仍是在積極地實踐它？但願你一直在積極練習。git

孔子曾經說過：github

「聞之我也野。」
正則表達式

「視之我也饒。」
express

「行之我也明。」
編程

在上一篇文章中，咱們學習瞭如何解析（識別）和解釋具備任意數量的加或減運算的算術表達式，例如"7 - 3 + 2 - 1"，還了解了語法圖以及如何用它來表示(specify)編程語言的語法。編程語言

今天，你將學習如何解析和解釋具備任意數量的乘法和除法運算的算術表達式，例如"7 * 4 / 2 * 3"。本文中的除法將是整數除法，所以，若是表達式爲"9 / 4"，則答案將是整數：2。ide

今天，我還將談論普遍用於表示某一編程語言語法的標記方法，稱爲上下文無關文法或BNF（簡稱文法）（Backus-Naur形式）。在本文中，我將不使用純BNF表示法，而使用修改後的EBNF表示法。函數

咱們爲何要使用文法，這是幾個緣由：工具

一、文法以簡潔的方式指定編程語言的語法，與語法圖不一樣，文法很是緊湊，在之後的文章中會愈來愈多地使用文法。

二、文法能夠做爲出色的文檔。

三、即便你是從頭開始手動編寫解析器，文法也是一個不錯的起點。一般，你能夠按照一組簡單的規則將文法轉換爲代碼。

四、有一套稱爲解析器生成器的工具，能夠接受文法做爲輸入，並根據該文法自動生成解析器，我將在本系列的後面部分討論這些工具。

這是一個描述算術表達式的文法，例如"7 * 4 / 2 * 3"（這只是該文法能夠生成的衆多表達式之一）：

文法由一系列規則(rules)組成，也稱爲productions，咱們的文法有兩個規則：

一個規則由稱爲head或left-hand side的非終結符(non-terminal)開始，而後跟上一個冒號(colon)，最後由稱爲body或right-hand side的終結符(terminal)和/或非終結符(non-terminal)組成：(A rule consists of a non-terminal, called the head or left-hand side of the production, a colon, and a sequence of terminals and/or non-terminals, called the body or right-hand side of the production)
（這裏我翻譯的很繞，直接看圖會更清楚一點）

在上面顯示的文法中，諸如MUL，DIV和INTEGER之類的標記稱爲終結符(terminals)，而諸如expr和factor之類的變量稱爲非終結符(non-terminals)，非終結符一般由一系列終結符和/或非終結符組成：(In the grammar I showed above, tokens like MUL, DIV, and INTEGER are called terminals and variables like expr and factor are called non-terminals. Non-terminals usually consist of a sequence of terminals and/or non-terminals)

第一條規則左側的非終結符稱爲開始符(start symbol)，在咱們的文法中，開始符是expr：

你能夠將規則expr解釋爲：「expr能夠只是一個因數(factor)，也能夠可選地跟上乘法或除法運算符，而後再乘以另外一個因數，以後又可選地跟上乘法或除法運算符，而後再乘以另外一個因數，固然以後也能夠繼續循環下去」(An expr can be a factor optionally followed by a multiplication or division operator followed by another factor, which in turn is optionally followed by a multiplication or division operator followed by another factor and so on and so forth)

是什麼因數(factor)？在本文中，它只是一個整數。

讓咱們快速瀏覽一下文法中使用的符號及其含義：

一、"|"，表示或者，所以(MUL | DIV)表示MUL或DIV。

二、"(…)"，表示(MUL | DIV)中的終結符和/或非終結符的一個組(grouping)。

三、"(…)*"，表示該組能夠出現零次或屢次。

若是你過去使用過正則表達式，那麼這些符號你應該很是熟悉。

文法經過解釋如何造成句子來定義語言(A grammar defines a language by explaining what sentences it can form)，那麼咱們如何使用文法來推導出算術表達式呢？有這幾個步驟：首先從起始符expr開始，而後用該非終止符的規則主體重複替換非終止符，直到生成僅包含終止符的句子爲止，這樣咱們就經過文法來造成了語言。(first you begin with the start symbol expr and then repeatedly replace a non-terminal by the body of a rule for that non-terminal until you have generated a sentence consisting solely of terminals. Those sentences form a language defined by the grammar)

若是文法不能導出某個算術表達式，則表示它不支持該表達式，那麼解析器在嘗試識別該表達式時將生成錯誤。

這裏有幾個例子，你能夠看看來加深理解：
一、這是推導一個整數"3"的步驟：

二、這是推導"3 * 7"的步驟：

三、這是推導"3 * 7 / 2"的步驟：

確實這部分包含不少理論！

當我第一次閱讀文法和相關術語時會感受像這樣：

我能夠向你保證，我絕對不是這樣的：

我花了一些時間來熟悉符號(notation)，理解它如何工做方式以及它與解析器和詞法分析器之間的關係，我必須告訴你，因爲它在編譯器相關文章中獲得了普遍的使用，從長遠來看，你必定會碰到它，因此我推薦你早點認識它:)

如今，讓咱們將該文法映射到代碼。

這是咱們將文法轉換爲源代碼的一些指導方法(guideline)，遵循它們能夠將文法從字面上轉換爲有效的解析器：

一、語法中定義的每一個規則R能夠成爲具備相同名稱的函數(method)，而且對該規則的引用將成爲方法調用：R()，函數的主體中的語句流也依照與一樣的指導方法。(Each rule, R, defined in the grammar, becomes a method with the same name, and references to that rule become a method call: R(). The body of the method follows the flow of the body of the rule using the very same guidelines.)

二、(a1 | a2 | aN)轉換爲if-elif-else語句。(Alternatives (a1 | a2 | aN) become an if-elif-else statement.)

三、(…)轉換while語句，能夠循環零次或屢次。(An optional grouping (…) becomes a while statement that can loop over zero or more times.)

四、對Token的引用T轉換爲對eat函數的調用：eat(T)，也就是若是T的類型與當前的Token類型一致的話，eat函數將消耗掉T，而後從詞法分析器獲取一個新的Token並將其賦值給current_token這個變量。(Each token reference T becomes a call to the method eat: eat(T). The way the eat method works is that it consumes the token T if it matches the current lookahead token, then it gets a new token from the lexer and assigns that token to the current_token internal variable.)

準則總結以下所示：

讓咱們按照上述指導方法將文法轉換爲代碼。

咱們的語法有兩個規則：一個expr規則和一個因數規則。讓咱們從因數規則（生產）開始。根據準則，須要建立一個稱爲factor的函數（準則1），該函數需調用一次eat函數來消耗類型爲INTEGER的Token（準則4）：

def factor(self):
    self.eat(INTEGER)

很容易就能夠寫出來。

繼續！

規則expr轉換爲expr函數（準則1），規則的主體從對factor的引用開始，咱們轉換爲factor()函數調用，可選的組(…)*轉換爲while循環，而(MUL | DIV)轉換爲if-elif-else語句，將這些組合在一塊兒，咱們獲得如下expr函數：

def expr(self):
    self.factor()

    while self.current_token.type in (MUL, DIV):
        token = self.current_token
        if token.type == MUL:
            self.eat(MUL)
            self.factor()
        elif token.type == DIV:
            self.eat(DIV)
            self.factor()

花一些時間看看我如何將語法映射到源代碼，確保你瞭解。

爲了方便起見，我將上面的代碼放入了parser.py文件中，該文件包含一個詞法分析器和一個不帶解釋器的分析器，你能夠直接從GitHub下載文件並使用，它是交互式的，你能夠在其中輸入表達式並查看根據文法構建的解析器是否能夠識別表達式。

這是我在計算機上的運行效果：

$ python parser.py
calc> 3
calc> 3 * 7
calc> 3 * 7 / 2
calc> 3 *
Traceback (most recent call last):
  File "parser.py", line 155, in <module>
    main()
  File "parser.py", line 151, in main
    parser.parse()
  File "parser.py", line 136, in parse
    self.expr()
  File "parser.py", line 130, in expr
    self.factor()
  File "parser.py", line 114, in factor
    self.eat(INTEGER)
  File "parser.py", line 107, in eat
    self.error()
  File "parser.py", line 97, in error
    raise Exception('Invalid syntax')
Exception: Invalid syntax

試試看！

咱們再來看看相同expr規則的語法圖：

是時候開始編寫新的算術表達式解釋器的源代碼了，如下是一個計算器的代碼，該計算器能夠處理包含任意數量整數乘法和除法運算的算術表達式，還能夠看到咱們將詞法分析器重構爲一個單獨的Lexer類，並更新了Interpreter類以將Lexer實例做爲參數：

# Token types
#
# EOF (end-of-file) token is used to indicate that
# there is no more input left for lexical analysis
INTEGER, MUL, DIV, EOF = 'INTEGER', 'MUL', 'DIV', 'EOF'


class Token(object):
    def __init__(self, type, value):
        # token type: INTEGER, MUL, DIV, or EOF
        self.type = type
        # token value: non-negative integer value, '*', '/', or None
        self.value = value

    def __str__(self):
        """String representation of the class instance.

        Examples:
            Token(INTEGER, 3)
            Token(MUL, '*')
        """
        return 'Token({type}, {value})'.format(
            type=self.type,
            value=repr(self.value)
        )

    def __repr__(self):
        return self.__str__()


class Lexer(object):
    def __init__(self, text):
        # client string input, e.g. "3 * 5", "12 / 3 * 4", etc
        self.text = text
        # self.pos is an index into self.text
        self.pos = 0
        self.current_char = self.text[self.pos]

    def error(self):
        raise Exception('Invalid character')

    def advance(self):
        """Advance the `pos` pointer and set the `current_char` variable."""
        self.pos += 1
        if self.pos > len(self.text) - 1:
            self.current_char = None  # Indicates end of input
        else:
            self.current_char = self.text[self.pos]

    def skip_whitespace(self):
        while self.current_char is not None and self.current_char.isspace():
            self.advance()

    def integer(self):
        """Return a (multidigit) integer consumed from the input."""
        result = ''
        while self.current_char is not None and self.current_char.isdigit():
            result += self.current_char
            self.advance()
        return int(result)

    def get_next_token(self):
        """Lexical analyzer (also known as scanner or tokenizer)

        This method is responsible for breaking a sentence
        apart into tokens. One token at a time.
        """
        while self.current_char is not None:

            if self.current_char.isspace():
                self.skip_whitespace()
                continue

            if self.current_char.isdigit():
                return Token(INTEGER, self.integer())

            if self.current_char == '*':
                self.advance()
                return Token(MUL, '*')

            if self.current_char == '/':
                self.advance()
                return Token(DIV, '/')

            self.error()

        return Token(EOF, None)


class Interpreter(object):
    def __init__(self, lexer):
        self.lexer = lexer
        # set current token to the first token taken from the input
        self.current_token = self.lexer.get_next_token()

    def error(self):
        raise Exception('Invalid syntax')

    def eat(self, token_type):
        # compare the current token type with the passed token
        # type and if they match then "eat" the current token
        # and assign the next token to the self.current_token,
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            self.error()

    def factor(self):
        """Return an INTEGER token value.

        factor : INTEGER
        """
        token = self.current_token
        self.eat(INTEGER)
        return token.value

    def expr(self):
        """Arithmetic expression parser / interpreter.

        expr   : factor ((MUL | DIV) factor)*
        factor : INTEGER
        """
        result = self.factor()

        while self.current_token.type in (MUL, DIV):
            token = self.current_token
            if token.type == MUL:
                self.eat(MUL)
                result = result * self.factor()
            elif token.type == DIV:
                self.eat(DIV)
                result = result / self.factor()

        return result


def main():
    while True:
        try:
            # To run under Python3 replace 'raw_input' call
            # with 'input'
            text = raw_input('calc> ')
        except EOFError:
            break
        if not text:
            continue
        lexer = Lexer(text)
        interpreter = Interpreter(lexer)
        result = interpreter.expr()
        print(result)


if __name__ == '__main__':
    main()

將以上代碼保存到calc4.py文件中，或直接從GitHub下載，嘗試一下，看看它是否正確。

這是我在筆記本電腦上運行效果：

$ python calc4.py
calc> 7 * 4 / 2
14
calc> 7 * 4 / 2 * 3
42
calc> 10 * 4  * 2 * 3 / 8
30

這是今天的練習：

一、編寫描述包含任意數量的+，-，* 或/運算符的算術表達式的文法。使用此文法，應該可以推導出諸如"2 + 7 * 4"，"7 - 8 / 4"，"14 + 2 * 3 - 6 / 2"之類的表達式，依此類推。

二、使用上一問編寫的文法，實現一個解釋器，該解釋器能夠計算包含任意數量的+，-，* 或/運算符的算術表達式。解釋器應該可以處理"2 + 7 * 4"，"7 - 8 / 4"，"14 + 2 * 3 - 6 / 2"等表達式。

最後再來複習回憶一下，記住今天文章的文法，回答如下問題，並根據須要參考如下圖片：

一、什麼是無上下文文法？

二、本文中文法有多少條規則(rule)/產生式(production)？

三、什麼是終結符(terminal)？（識別圖中的全部終結符）

四、什麼是非終結符？（標識圖片中的全部非終結符）

五、什麼是規則的頭部(head)？（識別圖片中的全部頭部/左側(left-hand sides)）

六、什麼是規則的主體(body)？（標識圖中的全部主體/右側(right-hand sides)）

七、什麼是文法的開始符(start symbol)？

嘿，你一直閱讀到最後！這篇文章包含了至關多的理論知識，因此我爲你的完成感到自豪。

下次我會再寫一篇新文章，敬請期待，不要忘記練習，它們對你有好處。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。