【Java字符序列】Pattern

時間 2019-11-12

原文原文鏈接

簡介

Pattern，正則表達式的編譯表示，操做字符序列的利器。html

整個Pattern是一個樹形結構(對應於表達式中的‘|’)，通常爲鏈表結構，樹(鏈表)的基本元素是Node結點，Node有各類各樣的子結點，以知足不一樣的匹配模式。node

樣例1

以一個最簡單的樣例，走進源碼。正則表達式

1     public static void example() {
2         String regex = "EXAMPLE";
3         String text = "HERE IS A SIMPLE EXAMPLE";
4         Pattern pattern = Pattern.compile(regex, Pattern.LITERAL);
5         Matcher matcher = pattern.matcher(text);
6         matcher.find();
7     }

這個樣例實現了查找字串的功能。算法

Pattern.compile(String regex)

1     public static Pattern compile(String regex) {
2         return new Pattern(regex, 0);
3     }

這個方法經過調用構造方法返回一個Pattern對象。數組

構造方法

 1     private Pattern(String p, int f) {
 2         pattern = p;
 3         flags = f;
 4 
 5         if ((flags & UNICODE_CHARACTER_CLASS) != 0)
 6             flags |= UNICODE_CASE;
 7 
 8         capturingGroupCount = 1;
 9         localCount = 0;
10 
11         if (pattern.length() > 0) {
12             compile();
13         } else {
14             root = new Start(lastAccept);
15             matchRoot = lastAccept;
16         }
17     }

構造方法又調用compile()方法。數據結構

compile()

 1     private void compile() {
 2         if (has(CANON_EQ) && !has(LITERAL)) {
 3             normalize(); // 標準化
 4         } else {
 5             normalizedPattern = pattern;
 6         }
 7         patternLength = normalizedPattern.length();
 8 
 9         temp = new int[patternLength + 2]; // 將pattern字符的代碼點(codePoint)存在int數組中，多出2個槽，標識結束
10 
11         hasSupplementary = false;
12         int c, count = 0;
13         for (int x = 0; x < patternLength; x += Character.charCount(c)) {
14             c = normalizedPattern.codePointAt(x);
15             if (isSupplementary(c)) { // 肯定指定的代碼點是否爲輔助字符或未配對的代理
16                 hasSupplementary = true;
17             }
18             temp[count++] = c; // 存到數組中
19         }
20 
21         patternLength = count; // 如今是代碼點的個數
22 
23         if (!has(LITERAL))
24             RemoveQEQuoting(); // 處理\Q...\E的狀況
25 
26         buffer = new int[32]; // 分配臨時對象
27         groupNodes = new GroupHead[10]; // 組
28         namedGroups = null;
29 
30         if (has(LITERAL)) { // 純文本，示例會走這個分支
31             matchRoot = newSlice(temp, patternLength, hasSupplementary); // Slice結點
32             matchRoot.next = lastAccept;
33         } else {
34             matchRoot = expr(lastAccept); // 遞歸解析表達式
35             if (patternLength != cursor) { // 處理異常狀況
36                 if (peek() == ')') {
37                     throw error("Unmatched closing ')'");
38                 } else {
39                     throw error("Unexpected internal error");
40                 }
41             }
42         }
43 
44         if (matchRoot instanceof Slice) { // 若是是文本模式，則返回BnM結點(Boyer Moore算法，處理子字符串的高效算法)
45             root = BnM.optimize(matchRoot);
46             if (root == matchRoot) {
47                 root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot); // Start和LastNode(lastAccept)是首尾兩個結點，通用處理
48             }
49         } else if (matchRoot instanceof Begin || matchRoot instanceof First) { // Begin和End也是結點類型，大概是處理多行模式，不展開討論
50             root = matchRoot;
51         } else {
52             root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
53         }
54         // 清理工做
55         temp = null;
56         buffer = null;
57         groupNodes = null;
58         patternLength = 0;
59         compiled = true;
60     }

首先標準化表達式
將字符代碼點暫存int數組中，所謂代碼點指的是字符集裏每一個字符的編號，從0開始，常見的字符集ASCII和Unicode
返回相應類型的結點
root和matchRoot的關係，root表示能夠從給定文本的任意位置開始查找，matchRoot表示全字符匹配(從頭至尾)

先看正則表達式是文本的分支，即樣例中所示。this

newSlice(int[] buf, int count, boolean hasSupplementary)

 1     private Node newSlice(int[] buf, int count, boolean hasSupplementary) {
 2         int[] tmp = new int[count];
 3         if (has(CASE_INSENSITIVE)) {
 4             if (has(UNICODE_CASE)) {
 5                 for (int i = 0; i < count; i++) {
 6                     tmp[i] = Character.toLowerCase(Character.toUpperCase(buf[i]));
 7                 }
 8                 return hasSupplementary ? new SliceUS(tmp) : new SliceU(tmp);
 9             }
10             for (int i = 0; i < count; i++) {
11                 tmp[i] = ASCII.toLower(buf[i]);
12             }
13             return hasSupplementary ? new SliceIS(tmp) : new SliceI(tmp);
14         }
15         for (int i = 0; i < count; i++) {
16             tmp[i] = buf[i];
17         }
18         return hasSupplementary ? new SliceS(tmp) : new Slice(tmp);
19     }

該方法主要處理了一些狀況，好比是否關心大小寫等，直接看最後一句，根據hasSupplementary的值決定初始化SliceS仍是Slice，在此只關心Slice的狀況。spa

數據結構Slice

 1     static final class Slice extends SliceNode {
 2         Slice(int[] buf) {
 3             super(buf);
 4         }
 5 
 6         boolean match(Matcher matcher, int i, CharSequence seq) {
 7             int[] buf = buffer;
 8             int len = buf.length;
 9             for (int j = 0; j < len; j++) { // 從第一個字符開始比較，若是長度不等，或遇到不等的字符，返回false，不然調用next結點的match方法
10                 if ((i + j) >= matcher.to) {
11                     matcher.hitEnd = true;
12                     return false;
13                 }
14                 if (buf[j] != seq.charAt(i + j))
15                     return false;
16             }
17             return next.match(matcher, i + len, seq);
18         }
19     }

該類繼承了SliceNode，主要實現了match方法，該方法查看給定文本是否與給定表達式相等，從頭開始一個字符一個字符地比較。代理

SliceNode

 1     static class SliceNode extends Node {
 2         int[] buffer;
 3         SliceNode(int[] buf) {
 4             buffer = buf;
 5         }
 6         boolean study(TreeInfo info) {
 7             info.minLength += buffer.length;
 8             info.maxLength += buffer.length;
 9             return next.study(info);
10         }
11     }

全部Slice結點的基類，實現了Node結點，主要的study方法，累加TreeInfo的最小長度和最大長度。code

Node

 1     static class Node extends Object {
 2         Node next;
 3 
 4         Node() {
 5             next = Pattern.accept;
 6         }
 7 
 8         boolean match(Matcher matcher, int i, CharSequence seq) {
 9             matcher.last = i;
10             matcher.groups[0] = matcher.first; // 默認是一組(組[0-1])
11             matcher.groups[1] = matcher.last;
12             return true;
13         }
14 
15         boolean study(TreeInfo info) { // 零長度斷言
16             if (next != null) {
17                 return next.study(info);
18             } else {
19                 return info.deterministic;
20             }
21         }
22     }

頂級結點，match方法老是返回true，子類應重寫此方法，

group, 調用鏈以下：getSubSequence(groups[group * 2], groups[group * 2 + 1]) ---> CharSequence#subSequence(int start, int end).

每2個相鄰的元素表示一個組的首尾索引。

再回到compile方法，下一步調用BnM.optimize(matchRoot).

BnM

繼承Node結點

1     static class BnM extends Node {}

屬性

1         int[] buffer; // 表達式數組(裏面元素是代碼點)
2         int[] lastOcc; // 壞字符，表達式裏的每一個字符按順序（從表達式數組索引0開始）存到lastOcc數組中，存的位置是表達式元素的值對128取模，由於它的長度是128，存的值是patternLength - 移動步長
3         int[] optoSft; // 好後綴，長度等於表達式數組的長度，裏面的元素也表示patternLength - 移動步長

構造方法

1         BnM(int[] src, int[] lastOcc, int[] optoSft, Node next) {
2             this.buffer = src;
3             this.lastOcc = lastOcc;
4             this.optoSft = optoSft;
5             this.next = next;
6         }

optimize(Node node)

 1         static Node optimize(Node node) {
 2             if (!(node instanceof Slice)) {
 3                 return node;
 4             }
 5 
 6             int[] src = ((Slice) node).buffer;
 7             int patternLength = src.length;
 8             if (patternLength < 4) {
 9                 return node;
10             }
11             int i, j, k; // k無用
12             int[] lastOcc = new int[128];
13             int[] optoSft = new int[patternLength];
14             for (i = 0; i < patternLength; i++) { // 構造壞字符數組
15                 lastOcc[src[i] & 0x7F] = i + 1; // 若是不一樣的字符存在了同一個索引上，則上一個字符沿用後一個字符的【被減步數】,比原來的大了，因此總的步長小了，便不會錯過，而壞字符數組的規模則控制在了前128位，拿時間換空間是值得的，畢竟涵蓋了整個ASCII字符集
16             }
17             NEXT: for (i = patternLength; i > 0; i--) { // 構造好後綴數組
18                 for (j = patternLength - 1; j >= i; j--) { // 從後往前，處理全部子字符串的狀況，出現的子字符串同時也在頭部出現纔算有效
19                     if (src[j] == src[j - i]) {
20                         optoSft[j - 1] = i;
21                     } else {
22                         continue NEXT;
23                     }
24                 }
25                 while (j > 0) { // 填充剩餘的槽位
26                     optoSft[--j] = i;
27                 }
28             }
29             optoSft[patternLength - 1] = 1;
30             if (node instanceof SliceS)
31                 return new BnMS(src, lastOcc, optoSft, node.next);
32             return new BnM(src, lastOcc, optoSft, node.next);
33         }

預處理，構造出壞字符數組和好後綴數組。

 1         boolean match(Matcher matcher, int i, CharSequence seq) {
 2             int[] src = buffer;
 3             int patternLength = src.length;
 4             int last = matcher.to - patternLength;
 5 
 6             NEXT: while (i <= last) {
 7                 for (int j = patternLength - 1; j >= 0; j--) { // 從後往前比較字符
 8                     int ch = seq.charAt(i + j);
 9                     if (ch != src[j]) {
10                         i += Math.max(j + 1 - lastOcc[ch & 0x7F], optoSft[j]); // 每次移動步長，取壞字符和好後綴中較大者
11                         continue NEXT;
12                     }
13                 }
14                 matcher.first = i;
15                 boolean ret = next.match(matcher, i + patternLength, seq);
16                 if (ret) {
17                     matcher.first = i;
18                     matcher.groups[0] = matcher.first; // 默認一組（兩個索引肯定一個片斷，因此只需2個元素）
19                     matcher.groups[1] = matcher.last;
20                     return true;
21                 }
22                 i++;
23             }
24             matcher.hitEnd = true;
25             return false;
26         }

根據Boyer Moore算法比較子字符串。

study

1         boolean study(TreeInfo info) {
2             info.minLength += buffer.length;
3             info.maxValid = false;
4             return next.study(info);
5         }

Boyer Moore算法

可參考這個。

該算法最主要的特徵是，從右往左匹配，這樣每次能夠移動不止一個字符，有兩個依據，壞字符和好後綴，取較大值。

壞字符

從表達式最右邊的字符開始與文本中同索引字符比較，若相同則繼續往左，直至比較結束，即匹配；或遇到不等的字符，即稱該不等字符(文本中的字符)爲壞字符，根據表達式中是否包含壞字符和壞字符的位置來肯定移動步長，公式以下：

後移位數 = 壞字符的位置 - 搜索詞中的上一次出現位置

若是"壞字符"不包含在搜索詞之中，則上一次出現位置爲 -1。

好後綴

從右往左比較過程當中，相等的部分字符序列稱爲好後綴，最長好後綴的子序列也是好後綴，同時在表達式頭部出現的好後綴纔有效。公式以下：

後移位數 = 好後綴的位置 - 搜索詞中的上一次出現位置

"好後綴"的位置以最後一個字符爲準。

分析

其實，無論是壞字符仍是好後綴，它的目的是移動最大步長，以實現快速匹配字符串的，還得不影響正確性。

壞字符很好理解，若是表達式中不包含壞字符，這個時候移動的步長是表達式的長度，也是能移動的最大長度；假如這種狀況下，移動的長度小於表達式的長度，那麼上次的壞字符總能再次出現，結果仍是不匹配，因此直接移動到壞字符的後面，即表達式長度。

如果表達式中包含壞字符呢，確定是的表達式中的那個字符和壞字符對齊才行，如果不對齊，與別的字符比較，仍是不等，那若是表達式中包含不僅一個呢，爲了避免往回(左)移動，應該使得表達式中靠後的字符與壞字符對齊，這樣若是不匹配的話，能夠接着右移，避免回溯。

好後綴也好理解，若是頭部不包含好後綴，那麼徹底能夠移動表達式的長度，如果包含，只需將好後綴部分對齊便可。

Node鏈

matches()

matchRoot -> Slice -> LastNode -> Node

Slice和Node結點，前面已經介紹過了。Slice結點，從第一個字符開始比較，若是長度不等，或遇到不等的字符，返回false，不然調用next結點的match方法，這裏的next結點是LastNode.

Node結點的match方法總會返回true.

LastNode

 1     static class LastNode extends Node {
 2         boolean match(Matcher matcher, int i, CharSequence seq) {
 3             if (matcher.acceptMode == Matcher.ENDANCHOR && i != matcher.to) // 當acceptMode是ENDANCHOR時，此時是全匹配，因此須要檢查i是不是最後一個字符的下標
 4                 return false;
 5             matcher.last = i;
 6             matcher.groups[0] = matcher.first;
 7             matcher.groups[1] = matcher.last;
 8             return true;
 9         }
10     }

此結點是通用結點，用來最後檢測結果的，注意accetMode參數，用以區分是全匹配仍是部分匹配。

find()

root -> BnM -> LastNode -> Node

由BnM結點可知，匹配可從任意有效位置開始，其實就是查找子字符串，且acceptMode不是ENDANCHOR，因此在LastNode中，無需檢查i是否指向最後一個字符。

以上結點均已在上文中給出。

樣例2

1     public static void example() {
2         String regex = "\\d+";
3         String text = "0123456789";
4         Pattern pattern = Pattern.compile(regex);
5         Matcher matcher = pattern.matcher(text);
6         matcher.find();
7     }