TypeScript 源碼詳細解讀(2)詞法1-字符處理

時間 2020-01-19

標籤 typescript 源碼詳細解讀詞法字符處理简体版

原文原文鏈接

本節文章研究的代碼位於 tsc/src/compiler/scanner.tshtml

字符

任何源碼都是由不少字符組成的，這些字符能夠是字母、數字、空格、符號、漢字等……git

每個字符都有一個編碼值，好比字符「a」的編碼值是97，字符「林」的編碼值是26519。github

每一個字符對應的編碼值是多少是由編碼表決定的，上面所示的編碼值是全球統一的編碼表 Unicode 中的編碼值，若是沒有特別聲明，全部編碼值都是以 Unicode 爲準的。算法

通常地，字符的編碼值都是有序的，好比字符「a」的編碼值是97，字符「b」的編碼值是98，字符「c」的編碼值是99，漢字則是按照筆劃順序排序的。在給字符串排序時，也是根據每一個字符的編碼值大小進行排序的。數組

若是想要判斷一個字符是否是英文字母，只須要判斷這個字符的編碼值是否位於字符「a」的編碼值和字符「z」的編碼值之間便可。緩存

在 JavaScript 中，能夠經過 "a".charCodeAt(0) 獲取字符「a」的編碼值；經過 String.fromCharCode(97) 獲取指定編碼值對應的字符。app

CharacterCodes 枚舉

在代碼中若是直接寫 99，你可能不清楚這個數字的含義，但若是寫成 CharacterCodes.c，你就能夠很快明白。經過枚舉給每一個編碼值定義一個名稱，方便讀者理解，同時咱們也不須要去記憶每一個字符的實際編碼值。CharacterCodes 枚舉位於 tsc/src/compiler/types.ts，源碼以下：ide

/* @internal */
export const enum CharacterCodes {
        _0 = 0x30,
        _1 = 0x31,
        // ...(略)
        _9 = 0x39,

        a = 0x61,
        b = 0x62,// ...(略)
        z = 0x7A,

        A = 0x41,// ...(略)
        Z = 0x5a,

        ampersand = 0x26,             // &
        asterisk = 0x2A,              // *
        // ...(略)
    }

字符判斷

要判斷一個字符是否是數字字符，只需確認它的字符編碼是否是在「0」和「9」的編碼值之間：函數

function isDigit(ch: number): boolean {  // 參數 ch 表示一個編碼值
    return ch >= CharacterCodes._0 && ch <= CharacterCodes._9;
}

同理，還能夠判斷其它字符，好比判斷是否是換行符：性能

export function isLineBreak(ch: number): boolean {
    // ES5 7.3:
    // The ECMAScript line terminator characters are listed in Table 3.
    //     Table 3: Line Terminator Characters
    //     Code Unit Value     Name                    Formal Name
    //     \u000A              Line Feed               <LF>
    //     \u000D              Carriage Return         <CR>
    //     \u2028              Line separator          <LS>
    //     \u2029              Paragraph separator     <PS>
    // Only the characters in Table 3 are treated as line terminators. Other new line or line
    // breaking characters are treated as white space but not as line terminators.

    return ch === CharacterCodes.lineFeed ||
        ch === CharacterCodes.carriageReturn ||
        ch === CharacterCodes.lineSeparator ||
        ch === CharacterCodes.paragraphSeparator;
}

根據 ES 規範，換行符一共有 4 個，雖然日常咱們只實用前兩個，但對有些語言來講，後兩個也是須要的。

判斷是否是空格：

export function isWhiteSpaceLike(ch: number): boolean {
    return isWhiteSpaceSingleLine(ch) || isLineBreak(ch);
}

/** Does not include line breaks. For that, see isWhiteSpaceLike. */
export function isWhiteSpaceSingleLine(ch: number): boolean {
    // Note: nextLine is in the Zs space, and should be considered to be a whitespace.
    // It is explicitly not a line-break as it isn't in the exact set specified by EcmaScript.
    return ch === CharacterCodes.space ||
        ch === CharacterCodes.tab ||
        ch === CharacterCodes.verticalTab ||
        ch === CharacterCodes.formFeed ||
        ch === CharacterCodes.nonBreakingSpace ||
        ch === CharacterCodes.nextLine ||
        ch === CharacterCodes.ogham ||
        ch >= CharacterCodes.enQuad && ch <= CharacterCodes.zeroWidthSpace ||
        ch === CharacterCodes.narrowNoBreakSpace ||
        ch === CharacterCodes.mathematicalSpace ||
        ch === CharacterCodes.ideographicSpace ||
        ch === CharacterCodes.byteOrderMark;
}

有的地方須要把換行當空格處理，有的地方不須要，因此 TypeScript 拆成兩個函數，一個包括換行符，一個不包括。

判斷標識符（Identifier）

標識符即俗稱的變量名，咱們都知道 JS 中變量名是不能隨便取的，是有規則的，好比開頭不能是數字。

在 ES 規範中，明確地點名了：哪些字符能夠作標識符；哪些字符能夠作標識符但不能以它開頭。TypeScript 實現了 isUnicodeIdentifierStart 和 isUnicodeIdentifierPart 來分別判斷。

哪些字符能夠作標識符，實際上是沒有簡單的規律的，這些都是在 ES 規範一個個手動指定的，規範中這個列表很長，最簡單的實現就是：手動記錄每一個字符是否容許做標識符，而後查表。

不過字符不少，每一個字符單獨記錄要佔用很大空間，因此 TypeScript 設計了一個小算法來壓縮內存，算法基於這麼一個事實：通常地，容許做爲標識符的字符都是連續的一段（好比「a」到「z」）。

只要記錄每段的開頭和結尾部分，就能夠比原先的記錄該段的全部字符，要更節約內存。

將全部開始位置和結束位置放在同一個數組，數組的奇數位即圖中的藍色段，表示每段開頭，偶數位即綠色段，表示每段結尾。

當須要查找一個字符是否是標識符時，採用二分搜索算法，快速定位確認它是否在包含的段中。

const unicodeESNextIdentifierStart = [65, 90, 97, 122, 170, /*...(略) */, 194560, 195101]
const unicodeESNextIdentifierPart = [48, 57, 65,  /*...(略) */, 917999]

function lookupInUnicodeMap(code: number, map: readonly number[]): boolean {
    // 因爲代碼中多數字符仍是英文字符，若是是就不查表直接判斷
    // Bail out quickly if it couldn't possibly be in the map.
    if (code < map[0]) {
        return false;
    }

    // 如下是標準二分搜索算法，不懂的同窗請本身補課
    // Perform binary search in one of the Unicode range maps
    let lo = 0;
    let hi: number = map.length;
    let mid: number;

    while (lo + 1 < hi) {
        mid = lo + (hi - lo) / 2;
        // mid has to be even to catch a range's beginning
        mid -= mid % 2;
        if (map[mid] <= code && code <= map[mid + 1]) {
            return true;
        }
        if (code < map[mid]) {
            hi = mid;
        } else {
            lo = mid + 2;
        }
    }

    return false;
}

接下來就能夠看明白 isUnicodeIdentifierStart 和 isUnicodeIdentifierPart 這兩個函數了：

/* @internal */ export function isUnicodeIdentifierStart(code: number, languageVersion: ScriptTarget | undefined) {
    return languageVersion! >= ScriptTarget.ES2015 ?
        lookupInUnicodeMap(code, unicodeESNextIdentifierStart) :
        languageVersion! === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierStart) :
        lookupInUnicodeMap(code, unicodeES3IdentifierStart);
}

function isUnicodeIdentifierPart(code: number, languageVersion: ScriptTarget | undefined) {
    return languageVersion! >= ScriptTarget.ES2015 ?
        lookupInUnicodeMap(code, unicodeESNextIdentifierPart) :
        languageVersion! === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierPart) :
        lookupInUnicodeMap(code, unicodeES3IdentifierPart);
}

因爲 TypeScript 支持不一樣版本的 ES 代碼，且不一樣版本的 ES 規範對標識符的定義有細微查表，因此 TypeScript 內部準備了不一樣版本的表。

經過以上倆函數的結合，也就能夠判斷一個字符串是否是合法的標識符了：

/* @internal */
export function isIdentifierText(name: string, languageVersion: ScriptTarget | undefined): boolean {
    let ch = codePointAt(name, 0);
    if (!isIdentifierStart(ch, languageVersion)) {
        return false;
    }

    for (let i = charSize(ch); i < name.length; i += charSize(ch)) {
        if (!isIdentifierPart(ch = codePointAt(name, i), languageVersion)) {
            return false;
        }
    }

    return true;
}

行列號和索引

若是將源碼當作字符串，每一個字符都有一個字符串的下標索引，同時這個字符又能夠理解爲源碼中的第幾行第幾列。

給定一個字符串的索引，能夠經過掃描這個索引以前有幾個換行符肯定這個索引屬於第幾行第幾列，反過來，經過行列號也能夠確認這個位置對應的字符串索引。

在源碼中若是發現一個錯誤，編譯器須要向用戶報告錯誤，並明確指出位置，通常地，編譯器須要將錯誤的行列報出來（若是報的是索引那你本身慢慢數……），爲了可以在報錯時知道這些位置，編譯器在詞法掃描階段就須要保存一切源碼位置了，那編譯器存的是行列號仍是索引呢？

有的編譯器選擇了存行列號，由於行列號纔是用戶最後須要的，但行列號意味着須要兩個字段存儲這個信息，若是將它們分別處理，每次處理行列號的地方都須要兩行代碼，若是將它們合併爲一個對象，這在 JavaScript 引擎中會形成大量的引用對象，影響性能。所以 TypeScript 選擇：存儲索引。出錯的時候，再將索引換算成行列號顯示出來。

TypeScript 用 Position（位置）這個術語表示索引，用 LineAndCharacter（行和字符）這個術語表示行列號。這三者都是從 0 開始計數的，即 line = 0 表示第一行。

爲何是 LineAndCharacter 而不是 LineAndColumn(行列），主要爲了和 VSCode 中的 LineColumn 區分，多數狀況，LineAndCharacter 和 LineAndColumn 是同樣的，除非碰到製表符（TAB）縮進，一個 TAB 始終是一個字符，但它可能跨越 2 列、4 列、8列等（具體根據用戶配置）。TypeScript 並不在乎 TAB 這個字符，統一將它當一個字符處理能夠簡單許多，因此爲了不和 VSCode 的行列混淆，改用了別的稱呼。

基於索引計算行列號須要遍歷這個索引以前的全部字符，爲了加速計算，TypeScript 做了一個小優化：緩存每行第一個字符的索引，而後經過二分搜索查找對應的行列（又是二分？）

首先計算每行第一個字符的索引表：

/* @internal */
export function computeLineStarts(text: string): number[] {
    const result: number[] = new Array();
    let pos = 0;
    let lineStart = 0;
    while (pos < text.length) {
        const ch = text.charCodeAt(pos);
        pos++;
        switch (ch) {
            case CharacterCodes.carriageReturn:
                if (text.charCodeAt(pos) === CharacterCodes.lineFeed) {
                    pos++;
                }
                // falls through
            case CharacterCodes.lineFeed:
                result.push(lineStart);
                lineStart = pos;
                break;
            default:
                if (ch > CharacterCodes.maxAsciiCharacter && isLineBreak(ch)) {
                    result.push(lineStart);
                    lineStart = pos;
                }
                break;
        }
    }
    result.push(lineStart);
    return result;
}

而後檢索索引表查詢行列號：

/* @internal */
/**
 * We assume the first line starts at position 0 and 'position' is non-negative.
 */
export function computeLineAndCharacterOfPosition(lineStarts: readonly number[], position: number): LineAndCharacter {
    let lineNumber = binarySearch(lineStarts, position, identity, compareValues);
    if (lineNumber < 0) {
        // If the actual position was not found,
        // the binary search returns the 2's-complement of the next line start
        // e.g. if the line starts at [5, 10, 23, 80] and the position requested was 20
        // then the search will return -2.
        //
        // We want the index of the previous line start, so we subtract 1.
        // Review 2's-complement if this is confusing.
        lineNumber = ~lineNumber - 1;
        Debug.assert(lineNumber !== -1, "position cannot precede the beginning of the file");
    }
    return {
        line: lineNumber,
        character: position - lineStarts[lineNumber]
    };
}

同時使用索引表也能夠實現從行列號查詢索引：

/* @internal */
export function computePositionOfLineAndCharacter(lineStarts: readonly number[], line: number, character: number, debugText?: string, allowEdits?: true): number {
    if (line < 0 || line >= lineStarts.length) {
        if (allowEdits) {
            // Clamp line to nearest allowable value
            line = line < 0 ? 0 : line >= lineStarts.length ? lineStarts.length - 1 : line;
        }
        else {
            Debug.fail(`Bad line number. Line: ${line}, lineStarts.length: ${lineStarts.length} , line map is correct? ${debugText !== undefined ? arraysEqual(lineStarts, computeLineStarts(debugText)) : "unknown"}`);
        }
    }

    const res = lineStarts[line] + character;
    if (allowEdits) {
        // Clamp to nearest allowable values to allow the underlying to be edited without crashing (accuracy is lost, instead)
        // TODO: Somehow track edits between file as it was during the creation of sourcemap we have and the current file and
        // apply them to the computed position to improve accuracy
        return res > lineStarts[line + 1] ? lineStarts[line + 1] : typeof debugText === "string" && res > debugText.length ? debugText.length : res;
    }
    if (line < lineStarts.length - 1) {
        Debug.assert(res < lineStarts[line + 1]);
    }
    else if (debugText !== undefined) {
        Debug.assert(res <= debugText.length); // Allow single character overflow for trailing newline
    }
    return res;
}