可移植的 Scheme 正則表達式庫 pregexp.scm 文檔翻譯

時間 2019-12-20

標籤可移植 scheme 正則表達式 pregexp.scm pregexp scm 文檔翻譯欄目正則表達式简体版

原文原文鏈接

pregexp.scm 被不少 Scheme 實現做爲內置的正則表達式引擎使用。好比 Racket 裏使用的正則表達式引擎就是從它的基礎上發展而來的。甚至連文檔也大同小異。因此，本文的大部份內容對 Racket 也適用。難能難得的是，pregexp 沒有使用某個實現特有的語法或特性，因此它的可移植性很好，只須要少許的修改就可以在幾乎全部實現上跑起來。固然，pregexp 的開發年代很早了，也許可能 Racket 裏的實現會的一些性能改善或者 BUG 修復。git

1. 簡介

正則表達式是一個模式字符串，正則表達式匹配器會嘗試與另外一個字符串（的一部分）進行匹配，被匹配的字符串被視爲原始文本，而不是一個模式。程序員

正則表達式中的大多數字符會匹配原始文本中出現的本身。所以， "abc"會匹配包含a, b, c三個連續字符的字符串。正則表達式

在正則表達式模式中，一些字符被視爲「元字符」，一些字符序列被視爲「元序列」，也就是說，它表示的並非該字符自己。例如，在正則表達式 "a.c" 中，字符a和c表示的是字符 a和c自己，然而.能夠匹配任意的字符（除了換行符）。因此， "a.c"能夠匹配以a開頭，以c結尾的任意三個字符，好比： "abc", "aac", "afc", "a*c"...express

若是咱們須要精確匹配.自己，就須要使用轉義字符，就是在前面加上一個反斜槓 \，反斜槓也是一個元字符，可是它不匹配任何字符，而是將緊跟着它的元字符變成一個普通字符。好比: "a\\.c"能夠匹配"a.c", 使用雙斜槓的緣由是，在 Scheme 的字符串中，反斜槓自己就是轉義字符，要在Scheme字符串中包含一個反斜槓，就須要雙反斜槓。就像在 C 中同樣。另外一個例子是 \t，它以一種可讀的方式來表示 tab 字符。安全

咱們將字符串表示的正則表達式稱爲 U-regexp ，U 能夠被解釋爲 Unix-style 或者 universal 。由於這種正則表達式的表示法被廣泛接受。咱們的實現使用一種樹形的中間表示法，稱之爲 S-regexp ，S 能夠被理解爲 Scheme, symbolic 或者 S-expression. S-regexp 更冗長，而且不易讀，不易理解，可是便於 Scheme 的遞歸過程處理。性能

2. 正則表達式過程

pregexp.scm 提供了以下幾個過程： pregexp , pregexp-match-positions , pregexp-match, pregexp-split, pregexp-replace, pregexp-replace*, pregexp-quote. 由 pregexp.scm 引入的全部過程都有 'pregexp' 前綴，因此它們不太可能和 Scheme 中的其餘名稱衝突，包括由實現自己提供的正則表達式過程的名稱。spa

2.1 pregexp

pregexp 接受一個字符串表示的正則表達式模式(U-regexp), 返回一個 S-regexp 。rest

(pregexp "c.r")
=> (:sub (:or (:seq #\c :any #\r)))

2.2 pregexp-match-positions

pregexp-match-positions 過程接受一個正則表達式和一個原始文本字符串，若是匹配成功，返回一個 match，不然返回 #f。code

正則表達式能夠是 UNIX 風格的正則字符串，或者是樹形的 S-regexp 。在內部， pregexp-match-positions 首先將字符串表示的正則表達式編譯成 S-regexp ，而後再進行匹配。若是你發現一個正則表達式有可能會被屢次用到，那麼明智的作法是用 pregexp 過程將它顯式地轉換成 S-regexp ，而且保存在一個臨時變量中，這樣能夠節省從新編譯的時間。regexp

pregexp-match-positions 返回 #f(若是匹配失敗) 或者一個點對列表(若是匹配成功).

(pregexp-match-positions "brain" "bird")
=> #f

(pregexp-match-positions "needle" "hay needle stack")
=> ((4 . 10))

在第二個例子裏，整數 4 和 10 標誌着被匹配的子串，4 表明子串的索引開始，10 表明索引結束(10 索引處的字符並不包括在內，這與廣泛意義上的字符串索引是一致的)。

(substring "hay needle stack" 4 10)
=> "needle"

這裏， pregexp-match-positions 返回的列表僅包含一個索引對，該索引對錶示匹配的子串在整個字符串中的位置。當咱們稍後討論子模式時，咱們將看到單個匹配操做如何產生子匹配列表。

pegexp-match-positions 接受可選的第三和第四個參數，指定將要被匹配的子串。

(pregexp-match-positions "needle"
  "his hay needle stack -- my hay needle stack -- her hay needle stack"
  24 43)
=> ((31 . 37))

注意，返回的索引依然是相對於整個字符串來計算的。

2.3 pregexp-match

pregexp-match 的調用相似於 pregexp-match-positions ，可是它返回的是匹配的子串，而不是索引位置。

(pregexp-match "brain" "bird")
=> #f

(pregexp-match "needle" "hay needle stack")
=> ("needle")

pregexp-match 一樣接受可選的第三和第四個參數。

2.4 pregexp-split

pregexp-split 過程接受兩個參數，一個正則表達式以及一個文本字符串，返回文本字符串的子串構成的列表，由被匹配的子串充當分隔。

(pregexp-split ":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")
=> ("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

(pregexp-split " " "pea soup")
=> ("pea" "soup")

若是第一個參數指定爲空字符串，則返回由單個字符組成的列表：

(pregexp-split "" "smithereens")
=> ("s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s")

要在分隔符中表示超過一個的空格，須要使用正則表達式 " +", 而不是 " *"

(pregexp-split " +" "split pea     soup")
=> ("split" "pea" "soup")

(pregexp-split " *" "split pea     soup")
=> ("s" "p" "l" "i" "t" "p" "e" "a" "s" "o" "u" "p")

2.5 pregexp-replace

regexp-replace 過程將匹配的子串替換爲另外一個字符串

(pregexp-replace "te" "liberte" "ty")
=> "liberty"

若是沒有可匹配的子串，則原樣返回文本字符串(eq? 意義上的相等，即同一個對象)。

2.6 pregexp-replace*

pregexp-replace* 替換全部被匹配的子串：

(pregexp-replace* "te" "liberte egalite fraternite" "ty")
=> "liberty egality fratyrnity"

和 pregexp-replace 同樣，若是沒有匹配，則原樣返回原來的文本字符串

2.7 pregexp-quote

pregexp-quote 接受任意一個字符串，返回一個能夠精確地表示它的 U-regexp （字符串）。特別是，在輸入字符串中能夠用做正則表達式元字符的特殊字符會被反斜槓轉義，以便它們安全地只匹配本身。

(pregexp-quote "cons")
=> "cons"

(pregexp-quote "list?")
=> "list\\?"

當從一個混合了正則表達式字符串以及逐字的字符串構建複合的正則表達式時 pregexp-quote 至關有用。（爲何這麼繞？）

3 正則表達式模式語言

這裏完整地描述 pregexp 使用的正則表達式模式語言

3.1 基本的斷言

^ 和 $ 分別表示字符串的開頭和結尾。它們確保靠近它們的正則表達式匹配一個字符串的開頭或結尾。例如:

(pregexp-match-positions "^contact" "first contact")
=> #f

匹配失敗，由於 'contact' 並無出如今文本字符串的開頭。

(pregexp-match-positions "laugh$" "laugh laugh laugh laugh")
=> ((18 . 23))

該正則表達式匹配了最後一個 'laugh'。

元序列 \b 斷言存在單詞邊界。

(pregexp-match-positions "yack\\b" "yackety yack")
=> ((8 . 12))

'yackety' 裏的 'yack' 後邊沒有存在單詞邊界，因此它沒有被匹配。第二個 'yack' 則匹配成功。

元序列 \B 的意思正好相反。它斷言單詞邊界不存在。

(pregexp-match-positions "an\\B" "an analysis")
=> ((3 . 5))

多說一句，第一個出現的 'an'，後面是空格，因此沒有被匹配；而 'analysis' 開頭的 'an'，後面緊挨着的是'alysis'，沒有間隔存在，因此被匹配。

3.2 字符和字符類

一般，正則表達式中的字符與文本字符串中相同的字符相匹配。有時，使用正則表達式來引用單個字符是必要的或者方便的。所以，元序列 \n, \r, \t 以及 \. 分別匹配 newline, return, tab 以及. 。

元字符 . 匹配除了 \n 以外的任意字符。

(pregexp-match "p.t" "pet")
=> ("pet")

它一樣匹配 'pat', 'pit', 'pot', 'put', 以及 'p8t'，可是不能匹配 'pfffft'.

字符類匹配一組字符集合中的任意一個字符。典型的字符類是由方括號括起來的一組字符 [...], 它匹配方括號中包含的非空字符序列中的任意一個字符。所以，"p[aeiou]t" 能夠匹配 'pat', 'pet', 'pit', 'pot', 'put' 等等。

在方括號中，兩個字符之間的連號 - 指定 ASCII 碼錶裏，兩個字符之間的一個範圍。例如，"ta[b-dgn-p]" 匹配 'tab', 'tac', 'tad', 'tag', 以及 'tan', 'tao', 'tap'。

左括號後面的符號 ^ 反轉由剩下的內容指定的集合，即它指定除方括號中標識的字符以外的字符集合。例如，"do[^g]" 匹配由 'do' 開頭的全部三個字符，除了 'dog'。

要注意，方括號裏的 ^ 和它在方括號外的意思徹底不同。大多數其餘元字符(. * + ?等)到了方括號中就再也不是元字符了，雖然爲了 peace of mind 仍然能夠轉義它們。- 只有在方括號內纔是一個元字符，固然它不能是方括號裏的第一個，也不能是最後一個字符。

方括號字符類不能包含其餘帶方括號的字符類（儘管它們能包含某些其餘類型的字符類——下面將會看到）。所以，在一個帶方括號的字符類中，單獨的左括號再也不是一個元字符，它能夠表明它本身。例如："[a[b]" 匹配 'a', '[', 能及 'b'。

此外，因爲方括號字符類不能爲空，因此緊接在開頭的左括號以後的右括號也不被視爲元字符。例如："[]ab]" 匹配 ']', 'a' 和 'b'。

3.2.1 經常使用的字符類

一些標準字符類能夠方便地表示爲元序列，而不是顯式的方括號表達式。\d 匹配一個數字[0-9]；\s 匹配一個空白字符；\w 匹配多是「單詞」的一部分的字符。（遵循正則表達式的慣例，咱們認定「單詞」字符是 [A-Za-z0-9_] , 也就是能用作 C 語言標識符的字母、數字和下劃線）, 雖然這與一個 Scheme 程序員所認爲的單詞的定義相比可能太過嚴格（在 Lisp 和 Scheme 裏，標識符所能使用的字符太自由了）。

這些元序列的大寫版本表示相反的意思，\D 匹配非數字字符，\S 匹配非空白字符，\W 匹配非單詞字符。

將這些元序列放置在 Scheme 字符串中時，請記住要寫成雙反斜械：

(pregexp-match "\\d\\d"
  "0 dear, 1 have 2 read catch 22 before 9")
=> ("22")

這些字符類可使用在一個方括號表達式中，例如："[a-z\\d]"匹配一個小寫字母或者一個數字。

3.2.2 POSIX 字符類

POSIX 字符類是一種格式爲 [: ... :] 的特殊元序列，只能在方括號表達式中使用。支持的 POSIX 字符類包括：

[:alnum:]       ;; 字母和數字
[:alpha:]       ;; 字母
[:algor:]       ;; 字母 'c', 'h', 'a' 和 'd'
[:ascii:]       ;; 7位 ASCII 字符
[:blank:]       ;; 空白符，即 空格 和 製表符（不包括回車？）
[:cntrl:]       ;; 控制字符，即 ASCII 碼錶中小於 32 的那些
[:digit:]       ;; 數字，與 '\d' 相同
[:graph:]       ;; ???
[:lower:]       ;; 小寫字母
[:print:]       ;; ???
[:space:]       ;; 空白符，與 '\s' 相同
[:upper:]       ;; 大寫字母
[:word:]        ;; 字母，數字以及下劃線，與 \w 相同
[:xdigit:]      ;; 十六進制數字

例如，正則表達式"[[:alpha:]_]" 匹配一個字母或下劃線

(pregexp-match "[[:alpha:]_]" "--x--")
=> ("x")

(pregexp-match "[[:alpha:]_]" "--_--")
=> ("_")

(pregexp-match "[[:alpha:]_]" "--:--")
=> #f

POSIX 類只有在額外的方括號中才有效，當它不在方括號表達式中時，例如 "[:alpha:]"，不會被認爲是字母類。按照之前的原則，它只能匹配 ':', 'a', 'l', 'p', 'h' 這幾個字符。

(pregexp-match "[:alpha:]" "--a--")
=> ("a")

(pregexp-match "[:alpha:]" "--_--")
=> #f

經過在 [: 後面緊跟着插入一個 ^, 你獲得 POSIX 字符類的反轉。所以，[:^alpha:] 表示除了字母之外的全部字符。

3.3 量詞

量詞 *, + 以及 ? 分別匹配前面的子模式： 0或0個以上，1個或1個以上，0個或1個實例。

(pregexp-match-positions "c[ad]*r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]*r" "cr")
=> ((0 . 2))

(pregexp-match-positions "c[ad]+r" "cadaddadddr")
=> ((0 . 11))
(pregexp-match-positions "c[ad]+r" "cr")
=> #f

(pregexp-match-positions "c[ad]?r" "cadaddadddr")
=> #f
(pregexp-match-positions "c[ad]?r" "cr")
=> ((0 . 2))
(pregexp-match-positions "c[ad]?r" "car")
=> ((0 . 3))

3.3.1 數字量詞

你可使用大括號來指定比使用 * + ? 更精細的數量。

量詞 {m} 精確過匹配前面的子模式 m 個實例， m 必須是非負的整數。

量詞 {m,n}; 匹配最少 m 個，最多 n 個實例。m 和 n 必須是非負的整數，而且 m <= n。二者均可以省略，在這種狀況下，m 默認爲 0, 而 n 表示無限大。

很明顯，+ 和 ? 分別是 {1,} 和 {0,1} 的縮寫，* 是 {,} 的縮寫，而且與 {0,} 等價。

(pregexp-match "[aeiou]{3}" "vacuous")
=> ("uou")

(pregexp-match "[aeiou]{3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "evolve")
=> #f

(pregexp-match "[aeiou]{2,3}" "zeugma")
=> ("eu")

3.3.2 非貪心量詞

上面所描述的量詞都是貪心的，即，它們匹配所能匹配的最大數量的實例。

(pregexp-match "<.*>" "<tag1> <tag2> <tag3>")
=> ("<tag1> <tag2> <tag3>")

要將這些量詞變成 非貪心 的，在後面附加一個問號 ? 便可。非貪心量詞只匹配最小數量的實例。

(pregexp-match "<.*?>" "<tag1> <tag2> <tag3>")
=> ("<tag1>")

非貪心量詞分別是：*?, +?, ??, {m}?, {m,n}?。要注意元字符 ? 的兩種不一樣的用法。

3.4 集羣

集羣，就是用圓括號包圍起來的表達式(...), 將圓括號中的子模式識別爲一個單獨的正則表達式實體。它使得匹配器捕獲子模式，而且將文本字符串中匹配子模式的部分附加到總體匹配當中。所謂總體匹配，就是僞裝全部的圓括號都不存在（在子模式後面有量詞的狀況下，這種表述不正確），進行匹配。總體匹配後再將每一對圓括號都視爲一個單獨的正則表達式，分別進行匹配，最後匹配的結果會附加到總體匹配的結果裏面去。

(pregexp-match "([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1" "1970")

集羣還致使接下來的量詞將整個封閉起來的子模式視爲一個獨立的實體。

(pregexp-match "(poo )*" "poo poo platter")
=> ("poo poo " "poo ")

子匹配所返回的數量老是等於正則表達式中指定的子模式的數量。哪怕一個子模式匹配多個子串，或者是一個也不匹配。

(pregexp-match "([a-z ]+;)*" "lather; rinse; repeat;")
=> ("lather; rinse; repeat;" " repeat;")

在這裏，被量詞修飾的子模式匹配了三次，可是最後它只返回了一次。

被量詞修飾的子模式也有可能不匹配，即使整體是是匹配成功的。在這種狀況下，失敗的子匹配用 #f 表示。

(define date-re
  ;match `month year' or `month day, year'.
  ;subpattern matches day, if present
  (pregexp "([a-z]+) +([0-9]+,)? *([0-9]+)"))

(pregexp-match date-re "jan 1, 1970")
=> ("jan 1, 1970" "jan" "1," "1970")

(pregexp-match date-re "jan 1970")
=> ("jan 1970" "jan" #f "1970")

3.4.1 反向引用

子匹配能夠用於插入字符串參數的過程 pregexp-replace 和 pregexp-replace* . 插入字符串可使用\n做爲反向引用返回第 n 個子匹配。即匹配第 n 個子模式的子串。\0引用整個匹配，它也能夠指定爲\&。

(pregexp-replace "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the _pinta_, and the _santa maria_"

(pregexp-replace* "_(.+?)_"
  "the _nina_, the _pinta_, and the _santa maria_"
  "*\\1*")
=> "the *nina*, the *pinta*, and the *santa maria*"

;recall: \S stands for non-whitespace character

(pregexp-replace "(\\S+) (\\S+) (\\S+)"
  "eat to live"
  "\\3 \\2 \\1")
=> "live to eat"

在插入字符串中使用 \\ 指定一個字面的反斜槓。另外，\$ 表明空字符串，能夠用於將反引用 \n 與緊領的數字分隔開。

也能夠在正則表達式械中使用反向引用來引用回到模式中已經匹配的子模式。\n 表明第 n 個子匹配的精確重複。

(pregexp-match "([a-z]+) and \\1"
  "billions and billions")
=> ("billions and billions" "billions")

注意，反向引用不只僅是前面的子模式的重複。相反，它是已經由子模式匹配的特定子串的重複。

在上面的例子中，反向引用只能匹配 'billions', 它不能匹配 'millions'，就算是子模式回到 ([a-z]+) —— 原本就沒有這樣作的必要。

(pregexp-match "([a-z]+) and \\1"
  "billions and millions")
=> #f

The following corrects doubled words:

(pregexp-replace* "(\\S+) \\1"
  "now is the the time for all good men to to come to the aid of of the party"
  "\\1")
=> "now is the time for all good men to come to the aid of the party"

下面的例子標記了在數字字符串中全部當即重複的模式：

(pregexp-replace* "(\\d+)\\1"
  "123340983242432420980980234"
  "{\\1,\\1}")
=> "12{3,3}40983{24,24}3242{098,098}0234"

3.4.2 非捕獲集羣

有時會須要指定一個集羣（一般用於量化），但不能觸發子匹配信息的捕獲。這樣的集羣稱爲非捕獲集羣。在這種狀況下，使用 (?: 而不是 ( 做爲集羣的開始。在下面的例子中，非捕獲集羣消除了給定路徑名的「目錄」部分，而捕獲集羣標識了文件名。

(pregexp-match "^(?:[a-z]*/)*([a-z]+)$"
  "/usr/local/bin/mzscheme")
=> ("/usr/local/bin/mzscheme" "mzscheme")

3.4.3 Cloisters

在一個非捕獲集羣的 ? 和 : 之間的位置稱爲 cloister . 你能夠在那裏添加修飾符，這將產生一個被特殊處理的子模式。修飾符 i 使子模式匹配大小寫不敏感：

(pregexp-match "(?i:hearth)" "HeartH")
=> ("HeartH")

修飾符 x 使子模式匹配對空白符不敏感，即，子模式中的空格和註釋將被忽略。註釋一般以分號開頭，一直延續到行末。若是你須要在對空白不敏感的子模式中包含一個字面意義上的空格或者分號，能夠用反斜槓來轉義它們。

(pregexp-match "(?x: a   lot)" "alot")
=> ("alot")

(pregexp-match "(?x: a  \\  lot)" "a lot")
=> ("a lot")

(pregexp-match "(?x:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "a man; a plan; a canal")
=> ("a man; a plan; a canal")

全局變量 *pregexp-comment-char* 包含了註釋字符 (#\;) ，要使用 Perl 風格的註釋符，能夠：

(set! *pregexp-comment-char* #\#)

你能夠在 cloister 裏添加更多的修飾符

(pregexp-match "(?ix:
   a \\ man  \\; \\   ; ignore
   a \\ plan \\; \\   ; me
   a \\ canal         ; completely
   )"
 "A Man; a Plan; a Canal")
=> ("A Man; a Plan; a Canal")

在一個修飾符前添加減號- 會反轉其含義。所以，你可使用 -i 以及 -x 來推翻由封閉集羣引發的不敏感性。

(pregexp-match "(?i:the (?-i:TeX)book)"
  "The TeXbook")
=> ("The TeXbook")

This regexp will allow any casing for the and book but insists that TeX not be differently cased.

3.5 Alternation

You can specify a list of alternate subpatterns by separating them by |. The | separates subpatterns in the nearest enclosing cluster (or in the entire pattern string if there are no enclosing parens).

(pregexp-match "f(ee|i|o|um)" "a small, final fee")
=> ("fi" "i")

(pregexp-replace* "([yi])s(e[sdr]?|ing|ation)"
   "it is energising to analyse an organisation
   pulsing with noisy organisms"
   "\\1z\\2")
=> "it is energizing to analyze an organization
   pulsing with noisy organisms"

再次提醒，若是你但願僅使用 clustering merely to specify a list of alternate subpatterns ，可是不但願子匹配，請使用(?: 而不是 (

(pregexp-match "f(?:ee|i|o|um)" "fun for all")
=> ("fo")

關於 alternation 一個重要的事情是，最左邊的 alternate 老是被最早挑選，而無論它的長度。所以，若是一個 alternate 是以後 alternate 的前綴，則後者可能沒有機會被匹配。

(pregexp-match "call|call-with-current-continuation"
  "call-with-current-continuation")
=> ("call")

因此，爲了讓較長的 alternate 有被匹配的機會，請將較長的 alternate 放在較短的 alternate 前面。

(pregexp-match "call-with-current-continuation|call"
  "call-with-current-continuation")
=> ("call-with-current-continuation")

In any case, an overall match for the entire regexp is always preferred to an overall nonmatch. In the following, the longer alternate still wins, because its preferred shorter prefix fails to yield an overall match.

(pregexp-match "(?:call|call-with-current-continuation) constrained"
  "call-with-current-continuation constrained")
=> ("call-with-current-continuation constrained")

3.6 回溯

咱們已經看到，貪心量詞老是匹配最大次數，可是最重要的優先級是整個匹配成功。考慮

(pregexp-match "a*a" "aaaa")

該正則表達式由兩個子正則表達式組成，a 後面跟着 *a 。就算 * 是一個貪心量詞，
*a 也不被容許匹配 "aaaa" 中全部的 4 個 a , 它只能匹配最開始的 3 個 a，留下最後一個 a 用於第二個子正則表達式。這樣將確保整個正則表達式匹配成功。

正則表達式匹配器經過一個稱爲回溯的過程來作到這一點。匹配器暫時容許貪心量詞匹配全部的 4 個 a ，可是當它意識到這樣會致使總體匹配失敗時，它會回溯到更少的貪心匹配 3 個 a，甚至若是這樣還會失敗，好比下面的調用：

(pregexp-match "a*aa" "aaaa")

匹配器還會進一步回溯，只有當全部可能的回溯都嘗試過纔會發生總體匹配失敗。

回溯並不限於貪心量詞，非貪心量詞匹配儘量少的實例，並逐漸回溯到愈來愈多的實例，以實現總體匹配成功。在 alternation 的匹配中也會進行回溯，當左邊的 alternation 會致使總體匹配失敗時，會嘗試右邊的 alternation 。

3.6.1 禁止回溯

有時禁止回溯會更有效。例如，咱們可能但願作出選擇，或者咱們知道嘗試 alternatives 是徒勞的。非回溯式正則表達式包含在 (?>...). 之間

(pregexp-match "(?>a+)." "aaaa")
=> #f

在這個調用裏，子表達式 ?>a+ 貪婪地匹配全部 4 個 a，而且拒絕回溯的機會。因此總體匹配失敗。所以這個正則表達式的效果是匹配一個或多個 a，後面跟一個確定不是 a 的東西。

3.7 展望將來

You can have assertions in your pattern that look ahead or behind to ensure that a subpattern does or does not occur. These 「look around」 assertions are specified by putting the subpattern checked for in a cluster whose leading characters are: ?= (for positive lookahead), ?! (negative lookahead), ?<= (positive lookbehind), ?<! (negative lookbehind). Note that the subpattern in the assertion does not generate a match in the final result. It merely allows or disallows the rest of the match.

3.7.1 Lookahead

Positive lookahead (?=) peeks ahead to ensure that its subpattern could match.

(pregexp-match-positions "grey(?=hound)"
  "i left my grey socks at the greyhound")
=> ((28 . 32))

The regexp "grey(?=hound)" matches grey, but only if it is followed by hound. Thus, the first grey in the text string is not matched.

Negative lookahead (?!) peeks ahead to ensure that its subpattern could not possibly match.

(pregexp-match-positions "grey(?!hound)"
  "the gray greyhound ate the grey socks")
=> ((27 . 31))

The regexp "grey(?!hound)" matches grey, but only if it is not followed by hound. Thus the grey just before socks is matched.

3.7.2 Lookbehind

Positive lookbehind (?<=) checks that its subpattern could match immediately to the left of the current position in the text string.

(pregexp-match-positions "(?<=grey)hound"
  "the hound in the picture is not a greyhound")
=> ((38 . 43))

The regexp (?<=grey)hound matches hound, but only if it is preceded by grey.

Negative lookbehind (?<!) checks that its subpattern could not possibly match immediately to the left.

(pregexp-match-positions "(?<!grey)hound"
  "the greyhound in the picture is not a hound")
=> ((38 . 43))

The regexp (?<!grey)hound matches hound, but only if it is not preceded by grey.

Lookaheads and lookbehinds can be convenient when they are not confusing.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。