Java 實現詞法分析器

時間 2019-12-20

標籤 java 實現詞法分析器欄目 Java 简体版

原文原文鏈接

最後一次更新於 2019/12/19java

效果演示圖

項目介紹

詞法分析器是編譯器的重要組成部分用於生成某種形式的中間語言，該中間語言可用於將一種計算機編程語言轉換爲機器語言。所以，本倉庫引入了一種新的詞法分析器軟件，該軟件能夠準確有效地識別符號並報告錯誤。本倉庫的目的是幫助人們加深對詞法分析器的理解。git

軟件設計與架構

個人詞法分析器對於沒有任何計算機相關背景的用戶來講也是容易上手的。首先，用戶能夠點擊 "Choose file" 來選擇被讀取的目標文件。github

選擇完畢後，用戶能夠點擊 Start 按鈕，程序將把文檔分析成一串一串的令牌，生成結果如圖2所示。若是文檔中有詞法錯誤，程序將會像圖2(a)這樣報錯。檢查完生成結果後，用戶若是不想再繼續分析別的文檔，能夠直接點擊 Finish 按鈕結束當前窗口(如圖2(b)所示)。算法

軟件設計與架構

本節分爲兩小節，分別爲個人設計理念和相應的體系結構設計。個人設計理論基礎來自於 SCC.312的第7講。編程

設計理念

因爲 有限狀態機 (FMS) 僅接受遵循某些語法規則的符號，所以我編寫了本身的語法規則，FMS能夠實現該規則以識別符號並捕獲意外的標識符。表1中列出了詳細的語法規則。在實踐中，我將這些規則拆分爲更多小規則，以使FMS可以識別每一個規則。數據結構

如圖3所示，個人FMS主要包含9個分支，每一個分支表明一種類型的輸入流。個人詞法分析器嚴格遵循Java的語法定義，經過如下過程的示意圖，用戶很容易直觀地理解 Java 通用語法規則。架構

到目前爲止，咱們的討論仍停留在理論上的實現方法上。可是，此FMS是 非肯定性有限狀態機（NFMS），這意味着它自己很難實現。所以，在下一部分中，本倉庫將介紹個人設計體系結構，該體系結構經過將此棘手的問題拆分爲多個簡單的 Java 模塊來解決此問題。app

軟件架構

如咱們所知，詞法分析器的簡單組件是識別器和翻譯器。個人架構設計基於這種簡單的結構。識別者的職責是捕獲全部合法符號，例如關鍵字，標識符等。在解析過程當中，根據個人語法規則，若是有任何字符不能被接受，它將報告一個錯誤。對於翻譯器，它將保存每一個符號的類型和值，並根據須要爲其提供惟一的類型代碼。這種簡單的結構如圖4所示。編程語言

此外，因爲 NFMS 包含許多功能，所以我決定將這些功能做爲多個單一的功能進行。另外，我專門爲識別器、有限狀態機和翻譯器編寫了多個模塊，以區分它們的不一樣職責。我經過 BlueJ 建立了 UML 圖，如圖5所示，該圖能夠幫助您更輕鬆地理解不一樣角色之間的關係。ide

爲了更好地理解詞法分析器的過程，我繪製了一個流程圖，如圖6所示。檢查過程的順序(註釋→字符串→關鍵字或標識符→數字→合法符號)是解析成功的關鍵。

除了通常的詞法識別以外，我還添加了一些屬於語法分析器的功能。比方說，個人詞法分析器會跟蹤全部類型的括號和註釋符號，以檢查通用編程語言語法。若是用戶犯了一些低級的語法錯誤，那麼這對分析器來講但是個輕鬆活了，個人詞法分析器即可以快速解析這些錯誤並提供相關的警告。

爲了模擬計算機對基本語言的執行，我選擇了逐個字符地識別符號。個人詞法分析器的主要技術是 向前看。一般狀況下，個人程序只向前看一步。可是，某些識別方法須要超過三步。

軟件函數

軟件的功能被分爲不一樣的函數。在此存儲庫中，我將介紹個人核心算法。

識別器

在個人詞法分析器中，識別器的主要做用是從緩衝區讀取文件內容，並將這些字符串做爲字符傳輸到 FMS，以下所示。

/**
     * This method is used to read the line character by character.
     * @param line The content of the file.
     * @param row The current row.
     * @return The parsing status.
     */
    private int checkEverySingleWord(String line, int row){
        for (col = 0; col < line.length(); col ++){
            if ((col = machine.changeState(line, col, row, sb)) == ERROR)
                return ERROR;
            sb.setLength(0);
        }
        return SUCCESS;
    }

識別器的另外一個功能是成爲一個簡單的語法分析器。如前所述，個人識別器能夠經過跟蹤括號和註釋的數量來找出註釋和括號的多餘符號。如下代碼介紹了它的工做方式以及根據不一樣狀況報告的錯誤類型。「skip」表示發生了一些錯誤，所以不須要進一步分析。

/**
     * This method is used to read the buffer line by line.
     */
    private void checkEverySingleLine() {

        boolean skip = false;
        try {
        while (((line = br.readLine())!= null)) {
            if (checkEverySingleWord(line, row) == ERROR) {
                skip = true;
                break;
            }
            row ++;
        }
        br.close();
        if (!skip) {
            /* If there is anything redundant, report an error. */
            if (ct.getCommentState() > 0){
                ErrorReport.unclosedComtError(ct.getUnclosedRowPos(), ct.getUnclosedColPos(), tArea);
                ErrorReport.parsingError(--row, col, tArea);
            } else if (ct.getCommentState() < 0){
                ErrorReport.illegalStartError(ct.getUnclosedRowPos(), ct.getUnclosedColPos(), tArea);
            } else if (strTracker.hasRedundantQuote()){
                ErrorReport.unclosedStrError(strTracker.getUnclosedRowPos(), strTracker.getUnclosedColPos(), tArea);
            } else if (!st.hasRedundantBrackets()){
                tArea.append("Successfully parsing!\n");
            }
        }
        // Set back to default state.
        finishBtn.setEnabled(true);
        openBtn.setEnabled(true);
        } catch (IOException e) {
            ErrorReport.ioError(tArea);
        }
    }

有限狀態機

有限狀態機在個人詞法分析器中承擔了主要的識別工做。首先，程序將檢查註釋狀態。若是尚未遇到相應的閉註釋符號，則咱們認爲當前行在註釋中。換句話說，不須要進一步的必要檢查。可是，此過程很是複雜，所以應考慮一些極端錯誤狀況。代碼以下所示。

/**
     * This method is used to record the number of open or close comment symbol and then change to another state.
     * @param line The content of the file.
     * @param col The current column.
     * @param sb The object of StringBuilder class.
     * @return The next state of FMS.
     */
    private int isComment(String line, int col, StringBuilder sb) {

        char c = line.charAt(col);
        // Look ahead one step.
        int lookForward = col + 1;
        boolean skip = false;

        // If the string is "*/".
        if (c == '*' && lookForward < line.length() && line.charAt(lookForward) == '/') {
            skip = true;
            ct.setCommentState(-1);
            if (ct.getCommentState() == -1) ct.updateUnclosedPosition(row, col);
        } else if (c == '/' && lookForward < line.length() && (line.charAt(lookForward) == '*')) {
            // The current string is "/*".
            skip = true;
            ct.setCommentState(1);
            if (ct.getCommentState() == 1) ct.updateUnclosedPosition(row, col);
        }
        // Skip the col we have checked.
        col = (skip)? lookForward : col;
        if (ct.getCommentState() == 0) {
            // If the string is "//", just ignore rest of the line.
            if (c == '/' && line.charAt(lookForward) == '/') return line.length() - 1;
            // Go to the next state of FMS.
            return isString(c, line, col, sb);
        } else  {
            /* Complex situation, only return the column of the last index of the target symbol. */
            int finalPos = 0;
            int endPos = line.indexOf("*/", col) + 1;
            int startPos = line.indexOf("/**", col) + 2;
            if (startPos != 1) {
                finalPos = Math.max(startPos, endPos);
                if(startPos > endPos) ct.setCommentState(1);
                return finalPos;
            } else if ((startPos = line.indexOf("/*", col) + 1) != 0){
                if(startPos > endPos) ct.setCommentState(1);
                finalPos = Math.max(startPos, endPos);
                return finalPos;
            } else if (endPos != 0){
                ct.setCommentState(-1);
                return endPos;
            } else {
                // Finish, go to next line.
                return line.length() - 1;
            }
        }
    }

在檢查當前字符串是關鍵字仍是標識符時，使用的方法是向前看 N 步。因爲 Java 容許標識符以 _ 或 $ 做爲前綴，所以個人詞法分析器也遵循相同的語法規則。代碼以下所示。

/**
     * This method is used to check whether the current symbol is keyword or identifier.
     * @param c The current character.
     * @param line The content of the file.
     * @param col The current column.
     * @param sb The object of StringBuilder class.
     * @return The next state of FMS.
     */
    private int isKeywordOrIdentifier(char c, String line, int col, StringBuilder sb){
        /* Java allows the identifier with prefix of "_" or "$" */
        if(isLetter(c) || c == '_' || c == '$'){
            sb.append(c);
            col ++;

            while (col <line.length() && (c = line.charAt(col)) != ' '){
                if(st.isSpecialSymbol(c)) {
                    col--;
                    break;
                }
                sb.append(c);
                col ++;
            }
            String word = sb.toString();
            if(isPreservedWord(word)) translator.addToken(word.toUpperCase() + "_TOKEN", word, tArea);
            else if (!isLargerThan32Byte(word, col)) {
                translator.addToken("IDENTIFIER_TOKEN", word, tArea);
            } else {
                ErrorReport.illegalDefinedSizeError(row, col, tArea);
                return REPORT_ERROR;
            }
            return col;
        }
        return isUnsignedNumber(c, line, col, sb);
    }

咱們還應該留意用戶定義的標識符的長度最大爲32個字節。相應功能以下所示。

/**
     * This method is used to check whether the length of user defined identifier's name exceeds the 32 Bytes.
     * @param identifier The name of user defined identifier.
     * @param col The current column.
     * @return A boolean result.
     */
    private boolean isLargerThan32Byte(String identifier, int col){
        try {
            if (identifier.getBytes("utf-8").length > 32) return true;
        } catch (UnsupportedEncodingException e) {
            ErrorReport.unsupportedEncodingError(row, col, tArea);
            return true;
        }
        return false;
    }

另外一種更復雜的朝前看N步算法被用於檢查它是字符串類型仍是字符類型。實際上，從String類型中提取字符串很容易。可是，在識別 Char 類型時變得很是困難。不注意細節的朋友們可能認爲該模式僅僅應用於單字符如a這樣的場景。這是徹底錯誤的，Char 類型是最複雜的類型，由於它能夠組合轉義字符(例如'\u0024'，'\000'和'\b')。這要求個人詞法分析器最多向前看三步。完整的識別過程以下所示。

/**
     * This method is used to check whether the current symbol is start of string or character.
     * @param c The current character.
     * @param line The content of the file.
     * @param col The current column.
     * @param sb The object of StringBuilder class.
     * @return The next state of FMS.
     */
    private int isString(char c, String line, int col, StringBuilder sb){
        int lookForwardOneStep = col + 1;
        int lookForwardTwoSteps = col + 2;
        if(c == '\"') {
            /* This line maybe contains a string. Mark it. */
            if (strTracker.hasRedundantQuote()) {
                strTracker.addStringToToken(translator, tArea);
                strTracker.setStrState();
            } else {
                strTracker.clearBuilder();
                strTracker.setStrState();
                strTracker.updateUnclosedPosition(row, col);
            }
        } else if (c == '\'' && lookForwardOneStep < line.length() && line.charAt(lookForwardOneStep) == '\\'){
            /* This is maybe a escape character. */
            char lookFwdChar = line.charAt(lookForwardTwoSteps);
            char lookFwdNextChar = line.charAt(col + 3);
            char[] list = {'\"','\'','\\','r','n','f','t','b'};
            for (char item : list) {
                if (item == lookFwdChar && lookFwdNextChar == '\'') {
                    translator.addToken("CHAR_TOKEN", "\\" + item, tArea);
                    return col + 3;
                }
            }

            if (lookFwdChar == 'u') col = col + 3; // Need to look ahead 3 steps.
            else col = col + 2;

            builder.setLength(0);
            while (col < line.length() && (c = line.charAt(col)) != '\''){
                if (!Character.isDigit(c)) {
                    /* This is definitely not a character. */
                    ErrorReport.unclosedCharError(row, col, tArea);
                    builder.setLength(0);
                    return REPORT_ERROR;
                }
                builder.append(c);
                col++;
            }

            /* Check out whether the current character is the start of Hexadecimal or octal character. */
            if (Integer.valueOf(builder.toString(),8) >= Integer.valueOf("000",8) &&
                    Integer.valueOf(builder.toString(),8) <= Integer.valueOf("377",8)){
                translator.addToken("CHAR_TOKEN", "\\u" + builder.toString(), tArea);
                return col;
            } else if (lookFwdChar == 'u' && Integer.valueOf(builder.toString(),16) >= Integer.valueOf("0000",16)
                    && Integer.valueOf(builder.toString(),16) <= Integer.valueOf("FFFF",16)){
                translator.addToken("CHAR_TOKEN", "\\" + builder.toString(), tArea);
                return col;
            } else {
                ErrorReport.unclosedCharError(row, col, tArea);
                return REPORT_ERROR;
            }

        } else if (c == '\'' && lookForwardTwoSteps < line.length() && line.charAt(lookForwardTwoSteps) == '\'') {
            /* This is definitely a character. */
            translator.addToken("CHAR_TOKEN", Character.toString(line.charAt(lookForwardOneStep)), tArea);
            return col + 2;
        } else if (c == '\'') {
            /* This is definitely character syntax error. */
            ErrorReport.unclosedCharError(row, col, tArea);
            return REPORT_ERROR;
        } else if (strTracker.hasRedundantQuote()) {
            /* Must belong to a string. */
            strTracker.appendChar(c);
            return col;
        }
        // Go to the next state of FMS.
        return isKeywordOrIdentifier(c, line, col, sb);
    }

翻譯器

翻譯器的職責是從 FMS 收集令牌，併爲每一個令牌生成惟一的ID。我製做了一個 Token 類，該類專門用於生成令牌。而後，我建立了一個名爲 Index 的新類來擴展 Token 類，以便爲每一個令牌生成惟一的ID。最後，全部標記將有序地附加到 ArrayList 中，該列表跟蹤標記及其位置，以進行進一步的語法分析。不一樣數據結構內的關係如圖14所示。

Translator類的基本操做以下所示。

public class Translator {
    // Variables declaration
    private int id;
    private ArrayList<Index> orders = new ArrayList<Index>();

    /* Initialization. */
    public Translator () {
        id = 0;
    }

    /**
     * This method is used to add a token into its ArrayList.
     * @param type The type of the token.
     * @param value The value of the token.
     * @param tArea The object of JTextArea class.
     */
    public void addToken(String type, String value, JTextArea tArea){
        /* If the current token exists, do not create a new object. */

        if (!isExist(value, tArea)) {
            Index index = new Index(type, value, ++id);
            tArea.append("< " + type + ", " + value + ", " + id + " >" + "\n");
            orders.add(index);
        }

    }

    /**
     * This method is used to check whether the current token exists or not.
     * @param value The value of the token.
     * @param tArea The object of JTextArea class.
     * @return A boolean checking result.
     */
    private boolean isExist(String value, JTextArea tArea){
        for (Index index : orders) {
            if (index.equals(value)) {
                tArea.append(index.getInfo());
                orders.add(index);
                return true;
            }
        }
        return false;
    }
}