從零寫一個編譯器（五）：語法分析之自動機的缺陷和改進

時間 2019-11-09

標籤一個編譯器語法分析自動機缺陷改進简体版

原文原文鏈接

項目的完整代碼在 C2j-Compilerjava

前言

在上一篇，已經成功的構建了有限狀態自動機，可是這個自動機還存在兩個問題：node

沒法處理shift/reduce矛盾
狀態節點太多，致使自動機過大，效率較低

這一節就要解決這兩個問題git

shift/reduce矛盾

看上一節那個例子的一個節點github

e -> t .
t -> t . * f

這時候經過狀態節點0輸入t跳轉到這個節點，可是這時候狀態機沒法分清是根據推導式1作reduce仍是根據推導式2作shift操做，這種狀況就稱之爲shift / reduce矛盾。算法

SLR(1)語法

在以前的LL(1)語法分析過程當中，有一個FOLLOW set，也就是指的是，對某個非終結符，根據語法推導表達式構建出的全部能夠跟在該非終結符後面的終結符集合，咱們稱做該非終結符的FOLLOW set.數據結構

以前的博文目錄閉包

FOLLOW(s) = {EOI}
FOLLOW(e) = {EOI, },+}
FOLLOW(t) = {EOI, }, + , * } 
FOLLOW(f) = {EOI, }, +, * }

也就是說若是當前的輸入字符屬於e的FOLLOW SET，那麼就能夠根據第一個推導式作reduce操做app

若是構建的狀態機，出現reduce / shift矛盾的節點均可以根據上面的原則處理的話，那麼這種語法，咱們稱之爲SLR(1)語法。ide

LR(1)語法

可是若是當前的輸入字符，既屬於第一個推導式的FLLOW SET，又是第二個推導式 . 右邊的符號，這樣shift /reduce矛盾就難以解決了。學習

當咱們根據一個輸入符號來判斷是否能夠進行reduce操做時，只須要判斷在咱們作完了reduce操做後，當前的輸入符號是否可以合法的跟在reduce後的非終結符的後面，也就是隻要收集只要該符號可以被reduce到退回它的節點的全部路徑的能跟在後面的終結符

這種能合法的跟在某個非終結符後面的符號集合，咱們稱之爲look ahead set, 它是FOLLOW set的子集。

在給出LookAhead Set的算法前要先明確兩個個概念：

First Set

對一個給定的非終結符，經過一系列語法推導後，能出如今推導最左端的全部終結符的集合，統稱爲該非終結符的FIRST SET

nullable

若是一個非終結符，它能夠推導出空集，那麼這樣的非終結符咱們稱之爲nullable的非終結符

nullable在以前SyntaxProductionInit裏的初始化時已經賦值了

First Set的構建

在前面的陳述後，爲了可以解決shift/reduce矛盾，就須要一個lookAhead Set，固然在構建LookAhead Set前，就須要先有First Set

First Set構建算法

若是A是一個終結符，那麼FIRST(A)={A}
對於如下形式的語法推導:
s -> A a
s是非終結符，A是終結符，a 是零個或多個終結符或非終結符的組合，那麼A屬於FIRST(s).
對於推導表達式：
s -> b a
s和b是非終結符，並且b不是nullable的，那麼first(s) = first(b)
對於推導表達式:
s -> a1 a2 … an b
若是a1, a2 … an 是nullable 的非終結符，b是非終結符但不是nullable的，或者b是終結符，那麼
first(s) 是 first(a1)… first(an) 以及first(b)的集合。

FirstSetBuilder類

First Set構建都在FirstSetBuilder類裏實現

這些就是用代碼將上面的邏輯實現而已

這時候以前在SyntaxProductionInit初始化用到的symbolMap、symbolArray兩個數據結構終於派上用場了

public void buildFirstSets() {
    while (runFirstSetPass) {
        runFirstSetPass = false;

        Iterator<Symbols> it = symbolArray.iterator();
        while (it.hasNext()) {
            Symbols symbol = it.next();
            addSymbolFirstSet(symbol);
        }
    }

    ConsoleDebugColor.outlnPurple("First sets :");
    debugPrintAllFirstSet();
    ConsoleDebugColor.outlnPurple("First sets end");

}

private void addSymbolFirstSet(Symbols symbol) {
    if (Token.isTerminal(symbol.value)) {
        if (!symbol.firstSet.contains(symbol.value)) {
            symbol.firstSet.add(symbol.value);
        }

        return ;
    }

    ArrayList<int[]> productions = symbol.productions;
    for (int[] rightSize : productions) {
        if (rightSize.length == 0) {
            continue;
        }

        if (Token.isTerminal(rightSize[0]) && !symbol.firstSet.contains(rightSize[0])) {
            runFirstSetPass = true;
            symbol.firstSet.add(rightSize[0]);
        } else if (!Token.isTerminal(rightSize[0])) {
            int pos = 0;
            Symbols curSymbol;
            do {
                curSymbol = symbolMap.get(rightSize[pos]);
                if (!symbol.firstSet.containsAll(curSymbol.firstSet)) {
                    runFirstSetPass = true;

                    for (int j = 0; j < curSymbol.firstSet.size(); j++) {
                        if (!symbol.firstSet.contains(curSymbol.firstSet.get(j))) {
                            symbol.firstSet.add(curSymbol.firstSet.get(j));
                        }
                    }
                }
                pos++;
            } while (pos < rightSize.length && curSymbol.isNullable);
        }
    }
}

LookAhead Set的算法

[S -> a .r B, C]
r -> r1

r是一個非終結符，a, B是0個或多個終結符或非終結符的集合。

在自動機進入r -> r1所在的節點時，若是採起的是reduce操做，那麼自動機的節點將會退回[S -> a .r B, C]這個推導式所在的節點，因此要正確的進行reduce操做就要保證當前的輸入字符，必須屬於FIRST(B)

因此推導式2的look ahead集合就是FIRST(B),若是B是空，那麼2的look ahead集合就等於C, 若是B是nullable的，那麼推導式2的look ahead集合就是FIRST(B) ∪ C

computeFirstSetOfBetaAndc

計算LookAhead set在每個production的方法裏

public ArrayList<Integer> computeFirstSetOfBetaAndc() {
    ArrayList<Integer> set = new ArrayList<>();
    for (int i = dotPos + 1; i < right.size(); i++) {
        set.add(right.get(i));
    }

    ProductionManager manager = ProductionManager.getInstance();
    ArrayList<Integer> firstSet = new ArrayList<>();

    if (set.size() > 0) {
        for (int i = 0; i < set.size(); i++) {
            ArrayList<Integer> lookAhead = manager.getFirstSetBuilder().getFirstSet(set.get(i));

            for (int s : lookAhead) {
                if (!firstSet.contains(s)) {
                    firstSet.add(s);
                }
            }

            if (!manager.getFirstSetBuilder().isSymbolNullable(set.get(i))) {
                break;
            }

            if (i == lookAhead.size() - 1) {
                //beta is composed by nulleable terms
                firstSet.addAll(this.lookAhead);
            }
        }
    } else {
        firstSet.addAll(lookAhead);
    }

    return firstSet;
}

居然計算了Lookahead Set，那麼在計算閉包時，每一個節點裏的推導式都要加上LookAhead Set以便以後求語法分析表

private void makeClosure() {
    ConsoleDebugColor.outlnPurple("==== state begin make closure sets ====");

    Stack<Production> productionStack = new Stack<>();
    for (Production production : productions) {
        productionStack.push(production);
    }

    while (!productionStack.isEmpty()) {
        Production production = productionStack.pop();

        ConsoleDebugColor.outlnPurple("production on top of stack is : ");
        production.debugPrint();
        production.debugPrintBeta();

        if (Token.isTerminal(production.getDotSymbol())) {
            ConsoleDebugColor.outlnPurple("Symbol after dot is not non-terminal, ignore and process next item");
            continue;
        }

        int symbol = production.getDotSymbol();
        ArrayList<Production> closures = productionManager.getProduction(symbol);
        ArrayList<Integer> lookAhead = production.computeFirstSetOfBetaAndc();

        Iterator<Production> it = closures.iterator();
        while (it.hasNext()) {
            Production oldProduct = it.next();
            Production newProduct = oldProduct.cloneSelf();

            newProduct.addLookAheadSet(lookAhead);
            if (!closureSet.contains(newProduct)) { 
                closureSet.add(newProduct);
                productionStack.push(newProduct); 
                removeRedundantProduction(newProduct);
            } else {
                ConsoleDebugColor.outlnPurple("the production is already exist!");
            }
        }
    }

    debugPrintClosure();
    ConsoleDebugColor.outlnPurple("==== make closure sets end ====");
}

removeRedundantProduction是處理冗餘的產生式，好比

1. [t -> . t * f, {* EOI}]
2. [t -> .t  *  f {EOI}]

這樣就能夠認爲產生式1能夠覆蓋產生式2

private void removeRedundantProduction(Production product) {
    boolean removeHappended = true;

    while (removeHappended) {
        removeHappended = false;

        Iterator it = closureSet.iterator();
        while (it.hasNext()) {
            Production item = (Production) it.next();
            if (product.isCover(item)) {
                removeHappended = true;
                closureSet.remove(item);
                break;
            }
        }
    }
}

有限狀態自動機的壓縮

到如今咱們已經算出了LookAhead Set，已經能夠正確的計算語法分析表了，可是還有一個問題就是，如今的自動機節點過多，很是影響效率，因此下面的任務就是壓縮有限狀態自動機

在咱們以前構造的LR(1)有限自動機裏，若是根據C語言的推導式，應該會產生600多個狀態節點，可是是由於以前在構造狀態節點時，若是相同的推導式可是它的lookAhead Sets不同，就認爲這是兩個不同的產生式。

下面是對狀態節點的equals的重寫

@Override
public boolean equals(Object obj) {
    return checkProductionEqual(obj, false);
}

public boolean checkProductionEqual(Object obj, boolean isPartial) {
    ProductionsStateNode node = (ProductionsStateNode) obj;

    if (node.productions.size() != this.productions.size()) {
        return false;
    }

    int equalCount = 0;

    for (int i = 0; i < node.productions.size(); i++) {
        for (int j = 0; j < this.productions.size(); j++) {
            if (!isPartial) {
                if (node.productions.get(i).equals(this.productions.get(j))) {
                    equalCount++;
                    break;
                }
            } else {
                if (node.productions.get(i).productionEquals(this.productions.get(j))) {
                    equalCount++;
                    break;
                }
            }
        }
    }

    return equalCount == node.productions.size();
}

因此對這些推導式相同可是LookAhead Sets不一樣的節點，就能夠進行合併，以達到壓縮節點數量的目的

合併類似的節點最好的地方，天然就是在添加節點和節點之間的跳轉關係的時候了

public void addTransition(ProductionsStateNode from, ProductionsStateNode to, int on) {
    /* Compress the finite state machine nodes */
    if (isTransitionTableCompressed) {
        from = getAndMergeSimilarStates(from);
        to = getAndMergeSimilarStates(to);
    }

    HashMap<Integer, ProductionsStateNode> map = transitionMap.get(from);
    if (map == null) {
        map = new HashMap<>();
    }

    map.put(on, to);
    transitionMap.put(from, map);
}

getAndMergeSimilarStates的邏輯也很簡單，遍歷當前的全部節點，找出類似，把編號大的合併到小的節點上

private ProductionsStateNode getAndMergeSimilarStates(ProductionsStateNode node) {
    Iterator<ProductionsStateNode> it = stateList.iterator();
    ProductionsStateNode currentNode = null, returnNode = node;

    while (it.hasNext()) {
        currentNode = it.next();

        if (!currentNode.equals(node) && currentNode.checkProductionEqual(node, true)) {
            if (currentNode.stateNum < node.stateNum) {
                currentNode.stateMerge(node);
                returnNode = currentNode;
            } else {
                node.stateMerge(currentNode);
                returnNode = node;
            }
            break;
        }
    }

    if (!compressedStateList.contains(returnNode)) {
        compressedStateList.add(returnNode);
    }

    return returnNode;

}

public void stateMerge(ProductionsStateNode node) {
    if (!this.productions.contains(node.productions)) {
        for (int i = 0; i < node.productions.size(); i++) {
            if (!this.productions.contains(node.productions.get(i)) && !mergedProduction.contains(node.productions.get(i))
            ) {
                mergedProduction.add(node.productions.get(i));
            }
        }
    }
}