本節文章研究的代碼位於 tsc/src/compiler/scanner.tshtml
每一個字符對應的編碼值是多少是由編碼表決定的,上面所示的編碼值是全球統一的編碼表 Unicode 中的編碼值,若是沒有特別聲明,全部編碼值都是以 Unicode 爲準的。算法
在 JavaScript 中,能夠經過 "a".charCodeAt(0) 獲取字符「a」的編碼值;經過 String.fromCharCode(97) 獲取指定編碼值對應的字符。app
在代碼中若是直接寫 99,你可能不清楚這個數字的含義,但若是寫成 CharacterCodes.c,你就能夠很快明白。經過枚舉給每一個編碼值定義一個名稱,方便讀者理解,同時咱們也不須要去記憶每一個字符的實際編碼值。CharacterCodes 枚舉位於 tsc/src/compiler/types.ts,源碼以下:ide
/* @internal */ export const enum CharacterCodes { _0 = 0x30, _1 = 0x31, // ...(略) _9 = 0x39, a = 0x61, b = 0x62,// ...(略) z = 0x7A, A = 0x41,// ...(略) Z = 0x5a, ampersand = 0x26, // & asterisk = 0x2A, // * // ...(略) }
function isDigit(ch: number): boolean { // 參數 ch 表示一個編碼值 return ch >= CharacterCodes._0 && ch <= CharacterCodes._9; }
export function isLineBreak(ch: number): boolean { // ES5 7.3: // The ECMAScript line terminator characters are listed in Table 3. // Table 3: Line Terminator Characters // Code Unit Value Name Formal Name // \u000A Line Feed <LF> // \u000D Carriage Return <CR> // \u2028 Line separator <LS> // \u2029 Paragraph separator <PS> // Only the characters in Table 3 are treated as line terminators. Other new line or line // breaking characters are treated as white space but not as line terminators. return ch === CharacterCodes.lineFeed || ch === CharacterCodes.carriageReturn || ch === CharacterCodes.lineSeparator || ch === CharacterCodes.paragraphSeparator; }
根據 ES 規範,換行符一共有 4 個,雖然日常咱們只實用前兩個,但對有些語言來講,後兩個也是須要的。
export function isWhiteSpaceLike(ch: number): boolean { return isWhiteSpaceSingleLine(ch) || isLineBreak(ch); } /** Does not include line breaks. For that, see isWhiteSpaceLike. */ export function isWhiteSpaceSingleLine(ch: number): boolean { // Note: nextLine is in the Zs space, and should be considered to be a whitespace. // It is explicitly not a line-break as it isn't in the exact set specified by EcmaScript. return ch === CharacterCodes.space || ch === CharacterCodes.tab || ch === CharacterCodes.verticalTab || ch === CharacterCodes.formFeed || ch === CharacterCodes.nonBreakingSpace || ch === CharacterCodes.nextLine || ch === CharacterCodes.ogham || ch >= CharacterCodes.enQuad && ch <= CharacterCodes.zeroWidthSpace || ch === CharacterCodes.narrowNoBreakSpace || ch === CharacterCodes.mathematicalSpace || ch === CharacterCodes.ideographicSpace || ch === CharacterCodes.byteOrderMark; }
有的地方須要把換行當空格處理,有的地方不須要,因此 TypeScript 拆成兩個函數,一個包括換行符,一個不包括。
標識符即俗稱的變量名,咱們都知道 JS 中變量名是不能隨便取的,是有規則的,好比開頭不能是數字。
在 ES 規範中,明確地點名了:哪些字符能夠作標識符;哪些字符能夠作標識符但不能以它開頭。TypeScript 實現了 isUnicodeIdentifierStart 和 isUnicodeIdentifierPart 來分別判斷。
哪些字符能夠作標識符,實際上是沒有簡單的規律的,這些都是在 ES 規範一個個手動指定的,規範中這個列表很長,最簡單的實現就是:手動記錄每一個字符是否容許做標識符,而後查表。
不過字符不少,每一個字符單獨記錄要佔用很大空間,因此 TypeScript 設計了一個小算法來壓縮內存,算法基於這麼一個事實:通常地,容許做爲標識符的字符都是連續的一段(好比「a」到「z」)。
const unicodeESNextIdentifierStart = [65, 90, 97, 122, 170, /*...(略) */, 194560, 195101] const unicodeESNextIdentifierPart = [48, 57, 65, /*...(略) */, 917999] function lookupInUnicodeMap(code: number, map: readonly number[]): boolean { // 因爲代碼中多數字符仍是英文字符,若是是就不查表直接判斷 // Bail out quickly if it couldn't possibly be in the map. if (code < map[0]) { return false; } // 如下是標準二分搜索算法,不懂的同窗請本身補課 // Perform binary search in one of the Unicode range maps let lo = 0; let hi: number = map.length; let mid: number; while (lo + 1 < hi) { mid = lo + (hi - lo) / 2; // mid has to be even to catch a range's beginning mid -= mid % 2; if (map[mid] <= code && code <= map[mid + 1]) { return true; } if (code < map[mid]) { hi = mid; } else { lo = mid + 2; } } return false; }
接下來就能夠看明白 isUnicodeIdentifierStart 和 isUnicodeIdentifierPart 這兩個函數了:
/* @internal */ export function isUnicodeIdentifierStart(code: number, languageVersion: ScriptTarget | undefined) { return languageVersion! >= ScriptTarget.ES2015 ? lookupInUnicodeMap(code, unicodeESNextIdentifierStart) : languageVersion! === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierStart) : lookupInUnicodeMap(code, unicodeES3IdentifierStart); } function isUnicodeIdentifierPart(code: number, languageVersion: ScriptTarget | undefined) { return languageVersion! >= ScriptTarget.ES2015 ? lookupInUnicodeMap(code, unicodeESNextIdentifierPart) : languageVersion! === ScriptTarget.ES5 ? lookupInUnicodeMap(code, unicodeES5IdentifierPart) : lookupInUnicodeMap(code, unicodeES3IdentifierPart); }
因爲 TypeScript 支持不一樣版本的 ES 代碼,且不一樣版本的 ES 規範對標識符的定義有細微查表,因此 TypeScript 內部準備了不一樣版本的表。
/* @internal */ export function isIdentifierText(name: string, languageVersion: ScriptTarget | undefined): boolean { let ch = codePointAt(name, 0); if (!isIdentifierStart(ch, languageVersion)) { return false; } for (let i = charSize(ch); i < name.length; i += charSize(ch)) { if (!isIdentifierPart(ch = codePointAt(name, i), languageVersion)) { return false; } } return true; }
有的編譯器選擇了存行列號,由於行列號纔是用戶最後須要的,但行列號意味着須要兩個字段存儲這個信息,若是將它們分別處理,每次處理行列號的地方都須要兩行代碼,若是將它們合併爲一個對象,這在 JavaScript 引擎中會形成大量的引用對象,影響性能。所以 TypeScript 選擇:存儲索引。出錯的時候,再將索引換算成行列號顯示出來。
TypeScript 用 Position(位置)這個術語表示索引,用 LineAndCharacter(行和字符)這個術語表示行列號。這三者都是從 0 開始計數的,即 line = 0 表示第一行。
爲何是 LineAndCharacter 而不是 LineAndColumn(行列),主要爲了和 VSCode 中的 LineColumn 區分,多數狀況,LineAndCharacter 和 LineAndColumn 是同樣的,除非碰到製表符(TAB)縮進,一個 TAB 始終是一個字符,但它可能跨越 2 列、4 列、8列等(具體根據用戶配置)。TypeScript 並不在乎 TAB 這個字符,統一將它當一個字符處理能夠簡單許多,因此爲了不和 VSCode 的行列混淆,改用了別的稱呼。
基於索引計算行列號須要遍歷這個索引以前的全部字符,爲了加速計算,TypeScript 做了一個小優化:緩存每行第一個字符的索引,而後經過二分搜索查找對應的行列(又是二分?)
/* @internal */ export function computeLineStarts(text: string): number[] { const result: number[] = new Array(); let pos = 0; let lineStart = 0; while (pos < text.length) { const ch = text.charCodeAt(pos); pos++; switch (ch) { case CharacterCodes.carriageReturn: if (text.charCodeAt(pos) === CharacterCodes.lineFeed) { pos++; } // falls through case CharacterCodes.lineFeed: result.push(lineStart); lineStart = pos; break; default: if (ch > CharacterCodes.maxAsciiCharacter && isLineBreak(ch)) { result.push(lineStart); lineStart = pos; } break; } } result.push(lineStart); return result; }
/* @internal */ /** * We assume the first line starts at position 0 and 'position' is non-negative. */ export function computeLineAndCharacterOfPosition(lineStarts: readonly number[], position: number): LineAndCharacter { let lineNumber = binarySearch(lineStarts, position, identity, compareValues); if (lineNumber < 0) { // If the actual position was not found, // the binary search returns the 2's-complement of the next line start // e.g. if the line starts at [5, 10, 23, 80] and the position requested was 20 // then the search will return -2. // // We want the index of the previous line start, so we subtract 1. // Review 2's-complement if this is confusing. lineNumber = ~lineNumber - 1; Debug.assert(lineNumber !== -1, "position cannot precede the beginning of the file"); } return { line: lineNumber, character: position - lineStarts[lineNumber] }; }
/* @internal */ export function computePositionOfLineAndCharacter(lineStarts: readonly number[], line: number, character: number, debugText?: string, allowEdits?: true): number { if (line < 0 || line >= lineStarts.length) { if (allowEdits) { // Clamp line to nearest allowable value line = line < 0 ? 0 : line >= lineStarts.length ? lineStarts.length - 1 : line; } else { Debug.fail(`Bad line number. Line: ${line}, lineStarts.length: ${lineStarts.length} , line map is correct? ${debugText !== undefined ? arraysEqual(lineStarts, computeLineStarts(debugText)) : "unknown"}`); } } const res = lineStarts[line] + character; if (allowEdits) { // Clamp to nearest allowable values to allow the underlying to be edited without crashing (accuracy is lost, instead) // TODO: Somehow track edits between file as it was during the creation of sourcemap we have and the current file and // apply them to the computed position to improve accuracy return res > lineStarts[line + 1] ? lineStarts[line + 1] : typeof debugText === "string" && res > debugText.length ? debugText.length : res; } if (line < lineStarts.length - 1) { Debug.assert(res < lineStarts[line + 1]); } else if (debugText !== undefined) { Debug.assert(res <= debugText.length); // Allow single character overflow for trailing newline } return res; }
本節介紹了 scanner 中的一些獨立函數,這些函數都將被詞法掃描程序中調用。先獨立理解了這些概念,對徹底理解詞法掃描會有重大幫助。
下節將介紹:詞法掃描的實現(即 scanner.ts 中剩餘的其它函數)【更新於 2020-1-18】