boost sp 2 regex語法規範正則表達式

時間 2019-11-12

標籤 boost regex 語法規範正則表達式欄目 C&C++ 简体版

原文原文鏈接

boost sp 2 regex語法規範正則表達式

這部分包含了boost.regex庫的正則表達式的語法。這是一份程序員指南，實際的語法由在程序中的正則表達式的選項決定。（譯註：即regex類構造函數的flag參數。）html

文字（Literals）git

除了一下字符，其它的任何字符都表示其字面意義(literal)。程序員

「.」, 「|」, 「*」, 「?」, 「+」, 「(「, 「)」, 「{「, 「}」, 「[「, 「]」, 「^」, 「$」和「\」正則表達式

要使用這些字符的字面意義，要在前面使用「\」字符。一個字面意義的字符匹配其自己，或者匹配 traits_type::translate() 的結果，這裏的traits_type 是 basic_regex算法

類的特性模板參數(the traits template parameter)。express

通配符（Wildcard）：點號」.」less

點號」.」匹配任意的單個字符。當在匹配算法中使用了 match_not_dot_null 選項，那麼點號不匹配空字符(null character)。當在匹配算法中使用了 match_not_dot_newlineide

選項，那麼點號不匹配換行字符（newline character）。函數

重複（Repeats）post

一個重複是一個表達式（譯註：正則表達式）重複任意次數。

一個表達式後接一個 「*」 表示重複任意次數（包括0次）。

一個表達式後接一個 「+」 表示重複任意次數（可是至少1次）。

若是表達式使用 regex_constants::bk_plus_qm編譯（譯註：regex類構造函數的flag參數），那麼「+」是一個普通的字符（譯註：即「+」表示其字面意義）， 「\+」 用來表示重複一或屢次。

一個表達式後接一個 「?」 表示重複0或1次。

若是表達式使用 regex_constants::bk_plus_qm 選項，那麼「?」是一個普通字符，「\?」 用來表示重複0或1次。

若是須要顯式的指定重複的最大最小次數的話，請使用邊界操做符「{}」，那麼「a{2}」表示字母「a」重複2次，「a{2,4}」表示字母「a」重複2至4次，「a{2,}」表示字母「a」重複至少2次（無上限）。注意：在{}之間是沒有任何空格的，而且上下邊界的大小是沒有上限的。

若是表達式使用 regex_constants::bk_braces 選項編譯，那麼「{」和「}」是普通字符， 「\{」 和 「\}」 用來表示邊界操做符。

全部的重複表達式是最短的前置子串（the shortest possible previous sub-expression）：單個字符，字符集合，或者是用諸如「()」括起來的子表達式。

例：

「ba*」匹配「b」, 「ba」, 「baaa」等等。

「ba+」匹配諸如「ba」, 「baaaa」此類，而不匹配「b」。

「ba?」匹配「b」或「ba」。

「ba{2,4}」匹配「baa」, 「baaa」和「baaaa」。

非「貪心」重複（Non-greedy repeats）

不管是否啓用「擴展(extended)」正則表達式語法（默認的），老是容許使用非貪心重複，只要在重複的後面加一個「?」。非貪心重複是匹配最短可能串(the shortest

possible string)的重複。

例如要匹配html的一對標籤，可使用：

「<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>」

這裏$1會包含標籤之間的文本，這段文本是最短匹配的字符串。

圓括號(Parenthesis)

圓括號有兩個做用：組成子表達式和標記匹配(to group items together into a sub-expression, and to mark what generated the match.)。例如，表達式「(ab)*」匹配所

有的「ababab」字符串。匹配算法 regex_match 和 regex_search 各須要一個match_results對象來報告是怎樣匹配的，函數返回後match_results會包含整個表達式和各個子表

達式的匹配。好比在上述的例子中，match_results[1]會包含表示最後一個「ab」的迭代器對(pair)。子表達式也容許匹配空串。若是子表達式匹配爲空 - 例如子表達式爲選擇

中的不匹配的那一部分 – 那麼一對迭代器指向輸入字符串的結尾，而且這個子表達式的matched屬性爲false。子表達式從左向右，從1開始索引，子表達式0是整個表達式。（譯

注：上述表達式或子表達式都是指正則表達式。）

非標記圓括號(Non-Marking Parenthesis)

有時你須要使用圓括號組成一個子表達式，可是不像要產生一個標記的子表達式（譯註：在match_results中的表達式都是標記的子表達式）。在這種狀況下，非標記圓括號

(?:expression) 可使用。例以下列表達式不產生子表達式：

「(?:abc)*」

前看斷言(Forward Lookahead Asserts )

這有兩種形式：一個是正的前看斷言；一個是負的前看斷言：

「(?=abc)」匹配0個字符，除非表達式以「abc」開頭。

「(?!abc)」匹配0個字符，除非表達式不以「abc」開頭。

（譯註：斷言並不匹配，例如：「(?=abc)abcdef」匹配「abcdef」，前面的「(?=abc)」並不匹配「abc」，而是查看是否已abc開頭，若是須要匹配「abc」仍是須要在後面寫上的。）

獨立子表達式(Independent sub-expressions)

「(?>expression)」匹配「expression」做爲一個獨立的原子動做（除非產生錯誤，算法不會回退產看）。

選擇(Alternatives)

選擇出如今須要匹配一個子表達式或另外一個子表達式的狀況下。每一個選擇的項目使用「|」分割，或者當設置了 regex_constants::bk_vbar 選項的時候，使用「\|」分割，或

者當設置了 regex_constants::newline_alt 選項的時候，使用換行符分割。每一個選擇的項目老是最長可能的子表達式，這和重複操做符的狀況相反。

例：「a(b|c)」匹配「ab」或「ac」。

「abc|def」匹配

集合(Sets)

集合是一個字符的集合，它可以匹配任意是其成員的字符。集合使用「[」和「]」來包含文字，字符範圍，字符類，對照元素和等值類。使用「^」開頭的集合表示補集。

例：

字符文字：

"[abc]" 匹配 "a", "b", 或 "c"。

"[^abc] 匹配除 "a", "b", 和 "c" 以外的任何字符。

字符範圍：

"[a-z]" 匹配任意在 "a" 至 "z" 之間的字符。

"[^A-Z]" 匹配在 "A" 至 "Z" 以外的字符。

注意，若是設置 regex_constants::collate 選項，那麼字符範圍的依賴於地域的(locale dependent)：它們匹配任意在範圍兩端之間的字符，當使用默認的「C」 locale 的時

候，範圍遵循ASCII的規則。例如，若是庫是使用Win32地域模型編譯的話，那麼 [a-z] 會匹配 a-z的ASCII字符和 ‘A’, ’B’ 等，但不匹配 ‘Z’ ，它正好在’z’的後面。

默認狀況下，地域特殊化的行爲的禁止的，範圍的比較遵循ASCII字符的編碼。

字符類是使用「[:classname:]」語法聲明的集合。例如「[[:space:]]」是全部空白字符的集合。只有當設置了 regex_constants::char_classes 選項後，字符類纔有效。可

用的字符類有：

alnum
任何字符數字

alpha
a-z和A-Z之間的字母。若是設置了地域的話，可能包含其它字符。

blank
任何空白字符，空格或者tab字符。

cntrl
任何控制字符

digit
任何0-9之間的數字

graph
任何圖形字符

lower
a-z之間的小寫字符。若是設置了地域的話，可能包含其它字符。

print
任何可打印字符

punct
任何標點符號

space
任何空格字符

upper
A-Z之間的大寫字母。若是設置了地域的話，可能包含其它字符。

xdigit
任何在0-9,a-f和A-F之間的16進制數字

word
任何單詞字符 – 字母數字加上下劃線

Unicode
任何編碼大於255的字符，只能在寬字符中使用

當設置了 regex_constants::escape_in_lists 選項後，你可使用一些字符類的縮寫：

\w 代替 [:word:]

\s代替 [:space:]

\d代替[:digit:]

\l代替[:lower:]

\u代替[:upper:]

對照元素(Collating elements)是集合聲明中的經過 [.tagname.] 表示，此處 tagname 是單個字符或者是對照元素的名稱。例如 [[.a.]] 至關於 [a] ，[[.comma.]] 至關於

[,] 。庫支持全部標準POSIX的對照元素名稱和下列額外的名稱：「ae」, 「ch」, 「ll」, 「ss」, 「nj」, 「dz」, 「lj」，每一個均可以小寫，大寫或開頭大寫。多字符對照

元素令集合匹配一個以上的字符，例如 [[.ae.]]匹配兩個字符，而 [^[.ae.]]只匹配一個字符。

Equivalence classes take the generalform[=tagname=] inside a set declaration, where tagname is either a single character, or a name of a collating element,

and matches any character that is a member of the same primary equivalence class as the collating element [.tagname.]. An equivalence class is a set of

characters that collate the same, a primary equivalence class is a set of characters whose primary sort key are all the same (for example strings are

typically collated by character, then by accent, and then by case; the primary sort key then relates to the character, the secondary to the accentation, and

the tertiary to the case). If there is no equivalence class corresponding to tagname ,then[=tagname=] is exactly the same as [.tagname.]. Unfortunately there

is no locale independent method of obtaining the primary sort key for a character, except under Win32. For other operating systems the library will "guess"

the primary sort key from the full sort key (obtained from strxfrm), so equivalence classes are probably best considered broken under any operating system

other than Win32.

To include a literal "-" in a set declaration then: make it the first character after the opening "[" or "[^", the endpoint of a range, a collating element,

or if the flag regex_constants::escape_in_lists is set then precede with an escape character as in "[\-]". To include a literal "[" or "]" or "^" in a set

then make them the endpoint of a range, a collating element, or precede with an escape character if the flag regex_constants::escape_in_lists is set.

行錨 (Line anchors )

錨(anchor)是用來在一行開頭或結尾匹配空串的：「^」在一行的開頭匹配空串，「$」匹配行尾的空串。

回退引用(Back references)

回退引用是對已經匹配的子表達式的引用，這個引用是子表達式匹配的字符串，而不是子表達式自己。回退引用由換碼符「\」加一個「1」到「9」的數字組成，「\1」引

用第一個子表達式，「\2」引用第二個等等。例如表達式「(.*)\1」匹配任何重複2次的字符串，好比「abcabc」或「xyzxyz」。子表達式的回退引用不參與任何匹配，

匹配空串：NB 這不一樣於其它通常性的正則式匹配。只有使用了 regex_constants:bk_refs 選項才能使用回退引用。

編碼的字符(Characters by code )

這是算法的一個擴展，在其它的庫中是沒有的。它由換碼符加數字「0」加 10進制的字符編碼組成。例如「\023」表示10進制編碼是23的字符。當使用圓括號分割了表達式時

，可能引發模糊：「\0103」表示103編碼的字符，「(\010)3」表示字符10接着一個「3」。要使用16進制編碼的話，用 \x 加一個16進制數就能夠了，可使用 {} 括起來，

例如 \xf0 或 \x{aff} ，注意後一個例子是一個Unicode字符。

單詞操做符(Word operators )

下列操做符提供了與GNU正則式庫德兼容。

「\w」匹配任何屬於「word」類的字符，至關於「[[:word:]]」。

「\W」匹配任何不屬於「word」類的字符，至關於「[^[:word:]]」。

「\<」匹配一個單詞開頭的空串。

「\>」匹配一個單詞結尾的空串。

「\b」匹配單詞開頭或結尾的空串。

「\B」匹配單詞內的空串。

The start of the sequence passed to the matching algorithms is considered to be a potential start of a word unless the flag match_not_bow is set. The end of

the sequence passed to the matching algorithms is considered to be a potential end of a word unless the flag match_not_eow is set.

緩衝操做符(Buffer operators )

下列操做符提供了與GNU正則式庫和Perl正則式庫的兼容：

「\`」匹配一個緩衝的開頭(the start of a buffer)。

「\A」匹配緩衝的開頭(the start of the buffer)。

「\’」匹配緩衝的結尾。

「\z」匹配緩衝的結尾。

「\Z」匹配緩衝的結尾，可能包含一或多個換行符。

一個緩衝是提供給匹配算法的整個序列，除非設置了 match_not_bob 或 match_not_eob 選項。

換碼操做符(Escape operator)

換碼字符「\」有好幾個意思。

在集合聲明內，換碼符是一個普通的字符，除非設置了 regex_constants::escape_in_lists 選項，此時「\」以後的字符表示其字面意義而不考慮其原來的意思。

換碼符能夠引出其它的操做，例如：回退引用，或單詞操做符。

換碼符能夠接一個正常的字符，例如「\*」表示一個字面意義的「*」而不是重複操做符。

單字符換碼串(Single character escape sequences )

下列是單個字符的換碼串：

換碼串
字符編碼
含義

\a
0x07
Bell character.

\f
0x0C
Form feed.

\n
0x0A
Newline character.

\r
0x0D
Carriage return.

\t
0x09
Tab character.

\v
0x0B
Vertical tab.

\e
0x1B
ASCII Escape character.

\0dd
0dd
An octal character code, where dd is one or more octal digits.

\xXX
0xXX
A hexadecimal character code, where XX is one or more hexadecimal digits.

\x{XX}
0xXX
A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character.

\cZ
z-@
An ASCII escape sequence control-Z, where Z is any ASCII character greater than or equal to the character code for '@'.

各類換碼串：

下列提供了和perl的兼容，但注意 \1 \L \u和\U的不一樣之處：

\w 至關於 [[:word:]]。

\W 至關於 [^[:word:]]。

\s 至關於 [[:space:]]。

\S 至關於 [^[:space:]]。

\d 至關於 [[:digit:]]。

\D 至關於 [^[:digit:]]。

\l 至關於 [[:lower:]]。

\L 至關於 [^[:lower:]]。

\u 至關於 [[:upper:]]。

\U 至關於 [^[:upper:]]。

\C 任何字符，至關於'.'。

\X 匹配任何Unicode組和字符串，例如"a\x 0301" (帶重音號的字符)。

\Q 開始引用操做符，任何後面的字符被認爲是字面意義，除非是\E 結束引用操做符的出現。

\E 結束引用操做符,終止\Q開始的序列。

What gets matched?

When the expression is compiled as a Perl-compatible regex then the matching algorithms will perform a depth first search on the state machine and report the

first match found.

When the expression is compiled as a POSIX-compatible regex then the matching algorithms will match the first possible matching string, if more than one

string starting at a given location can match then it matches the longest possible string, unless the flag match_any is set, in which case the first match

encountered is returned. Use of the match_any option can reduce the time taken to find the match - but is only useful if the user is less concerned about

what matched - for example it would not be suitable for search and replace operations. In cases where their are multiple possible matches all starting at the

same location, and all of the same length, then the match chosen is the one with the longest first sub-expression, if that is the same for two or more

matches, then the second sub-expression will be examined and so on.

The following table examples illustrate the main differences between Perl and POSIX regular expression matching rules:

表達式
文本
POSIX最左長匹配
ECMAScript深度優先搜索匹配

a|ab
xaby
「ab」
「a」

.*([[:alnum:]]+).*
" abc def xyz "
$0 = " abc def xyz "

$1 = "abc"
$0 = " abc def xyz "

$1 = "z"

.*(a|xayy)
zzxayyzz
"zzxayy"
"zzxa"

These differences between Perl matching rules, and POSIX matching rules, mean that these two regular expression syntaxes differ not only in the features

offered, but also in the form that the state machine takes and/or the algorithms used to traverse the state machine.