- 微信公衆號:鄭爾多斯
- 關注「鄭爾多斯」公衆號 ,回覆「領取資源」,獲取IT資源500G乾貨。
升職加薪、當上總經理、出任CEO、迎娶白富美、走上人生巔峯!想一想還有點小激動- 關注可瞭解更多的
Nginx
知識。任何問題或建議,請公衆號留言;
關注公衆號,有趣有內涵的文章第一時間送達!
在Nginx
中location
, server_name
,rewrite
等模塊使用了大量的正則表達式,經過正則表達式能夠完整很是強悍的功能,可是這部分對咱們閱讀源碼也產生了很是大的困惑。本文就集中精力來學習一下Nginx
中的正則表達式,幫助咱們更透徹的理解nginx
中的功能。html
Nginx
中的正則表達式使用了pcre
格式,而且封裝了pcre函數庫
的幾個經常使用函數,咱們學習一下這幾個函數,經過它們就能夠透徹的理解nginx
中的正則表達式。nginx
正則表達式在使用以前要首先通過編譯(compile
),獲得一個編譯以後的數據結構,而後經過這個數據結構進行正則匹配和其餘各類信息的獲取。PCRE
中進行編譯的函數有兩個,分別爲pcre_compile()
和pcre_compile2()
,這兩個函數的功能相似,Nginx
使用了前者,因此咱們對pcre_compile
進行分析。正則表達式
pcre *pcre_compile(
const char *pattern,
int options,
const char **errptr,
int *erroffset,
const unsigned char *tableptr
);
複製代碼
參數說明:pattern
: 將要被編譯的正則表達式。options
: 編譯過程當中使用到的選項。在Nginx
中,只使用到了PCRE_CASELESS
選項,表示匹配過程當中不區分大小寫。errptr
:保存編譯過程當中遇到的錯誤。該字段若是爲NULL
,那麼pcre_compile()
會中止編譯,直接返回NULL
.erroffset
:該字段保存編譯過程當中發生錯誤的字符在pattern
中的偏移量。tableptr
:這個參數的做用不清楚,可是文檔中說能夠爲NULL
,而且Nginx
中也確實設置爲NULL
,因此能夠忽略這個字段。express
返回值:
該函數返回一個pcre *
指針,表示編譯信息,經過這個返回值能夠獲取與編譯有關的信息,該結構體也用於pcre_exec()
函數中,完整匹配操做。api
經過上述的編譯返回的結構體,能夠獲取當前pattern
的許多信息,好比捕獲分組的信息等,下面的函數就是完成這個功能的。數組
int pcre_fullinfo(
const pcre *code,
const pcre_extra *extra,
int what,
void *where
);
複製代碼
參數說明:code
: 這個參數就是上面的pcre_compile()
返回的結構體。extra
: 這個參數是pcre_study()
返回的結構體,若是沒有,能夠爲NULL
.what
: 咱們要獲取什麼信息where
: 保存返回的數據微信
返回值:
若是函數執行成功,返回0
.nginx
中經過該函數獲取了以下信息:數據結構
PCRE_INFO_CAPTURECOUNT
: 獲得的是全部子模式的個數,包含命名捕獲分組和非命名捕獲分組;app
PCRE_INFO_NAMECOUNT
: 獲得的是命名子模式的個數,不包括非命名子模式的個數;ide
在這裏要說明一個狀況:PCRE
容許使用命名捕獲分組,也容許使用匿名捕獲分組(即分組用數字來表示),其實命名捕獲分組只是用來標識分組的另外一種方式,命名捕獲分組也會得到一個數字分組名稱。PCRE
提供了一些方法能夠經過命名捕獲分組的名稱來快速獲取捕獲分組內容的函數,好比:pcre_get_named_substring()
.
也能夠經過如下步驟來獲取捕獲分組的信息:
name to number
的轉換過程,PCRE維護了一個 name-to-number
的map
,咱們能夠根據這個map
完成轉換功能,這個map
有如下三個屬性:PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
這個map
包含了若干個固定大小的記錄,能夠經過PCRE_INFO_NAMECOUNT
參數來獲取這個map
的記錄數量(其實就是命名捕獲分組的數量),經過PCRE_INFO_NAMEENTRYSIZE
來獲取每一個記錄的大小,這兩種狀況下,最後一個參數都是一個int
類型的指針。其中每一個每一個記錄的大小是由最長的捕獲分組的名稱來確立的。The entry size depends on the length of the longest name.
PCRE_INFO_NAMETABLE
返回一個指向這個map
的第一條記錄的指針(一個char
類型的指針),每條記錄的前兩個字節是命名捕獲分組所對應的數字分組值,剩下的內容是命名捕獲分組的name
,以'\0'
結束。返回的map
的順序是命名捕獲分組的字母順序。
下面是PCRE
官方文檔中的一個例子:
When PCRE_DUPNAMES is set, duplicate names are in order of their parentheses numbers. For example, consider the following pattern (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.
這裏有一個從網上找的例子,可是具體找不到原文的連接了,以下:
//gcc pcre_test.c -o pcre_test -L /usr/lib64/ -lpcre
#include <stdio.h>
#include <pcre.h>
int main()
{
pcre *re;
const char *errstr;
int erroff;
int captures =0, named_captures, name_size;
char *name;
char *data = "(?<date> (?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d) )";
int n, i;
char *p;
p = data;
printf("%s \n", p);
re = pcre_compile(data, PCRE_CASELESS, &errstr, &erroff, NULL);
if(NULL == re)
{
printf("compile pcre failed\n");
return 0;
}
n = pcre_fullinfo(re, NULL, PCRE_INFO_CAPTURECOUNT, &captures);
if(n < 0)
{
printf("pcre_fullinfo PCRE_INFO_CAPTURECOUNT failed %d \n", n);
return 0;
}
printf(" captures %d \n", captures);
n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMECOUNT, &named_captures);
if(n < 0)
{
printf("pcre_fullinfo PCRE_INFO_NAMECOUNT failed %d \n", n);
return 0;
}
printf("named_captures %d \n", named_captures);
n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMEENTRYSIZE, &name_size);
if(n < 0)
{
printf("pcre_fullinfo PCRE_INFO_NAMEENTRYSIZE failed %d \n", n);
return 0;
}
printf("name_size %d \n", name_size);
n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMETABLE, &name);
if(n < 0)
{
printf("pcre_fullinfo PCRE_INFO_NAMETABLE failed %d \n", n);
return 0;
}
p =name;
int j;
for(j = 0; j < named_captures; j++)
{
for(i = 0; i <2; i++)
{
printf("%x ", p[i]);
}
printf("%s \n", &p[2]);
p += name_size;
}
return 0;
}
複製代碼
輸出結果以下:
5
個捕獲分組
4
個命名捕獲分組
8
,這裏就是
month
這條記錄是最長的了,由於最後面還有一個
'\0'
結束符,因此長度爲
8
capture
的數字是和非命名子模式一塊兒排列的,也就是根據左括號的前後排列的
上面介紹了編譯,以及獲取其餘信息,那麼剩下的就是最重要的匹配了。
int pcre_exec(
const pcre *code,
const pcre_extra *extra,
const char *subject,
int length,
int startoffset,
int options,
int *ovector,
int ovecsize
);
複製代碼
參數說明:code
: 編譯函數的返回值extra
: pcre_study
的返回值,能夠爲NULL
subject
: 待匹配的字符串length
: subject
的長度startoffset
: 開始匹配的位置option
: 匹配的選項vector
: 保存匹配結構的數據ovecsize
: vector
數組的長度,必須爲3
的倍數
下面是PCRE
文檔中對該函數的一些解釋,我翻譯了一部分:
How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.
一般來講,一個pattern
能夠匹配一個subject
中的特定一部分,除此以外,subject
中的一部分還可能會被pattern
中的一部分匹配(意思就是:pattern
中可能存在捕獲分組,那麼subject
中的一部分可能會被這部分捕獲分組所匹配)。
Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.
咱們在pcre_exec()
中的vector
參數就是會保存一系列integer offset
,經過這些整形偏移量咱們就能夠獲取捕獲分組的內容。vector
參數的數量是經過ovecsize
參數指定的,ovecsize
參數的大小必須是三的倍數。
The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec()while matching capturing subpatterns, and is not available for passing back information. The length passed in ovecsize should always be a multiple of three. If it is not, it is rounded down.
vector
參數的前2/3
用來保存後向引用的分組捕獲(好比$1, $2
等),每一個substring
都會使用vector
中的兩個整數。剩餘的1/3
被pcre_exec()
函數在捕獲分組的時候使用,不能被用來保存後向引用。
When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of a pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. The first pair, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.
當匹配成功以後,從vector
參數的第一個元素開始,每對元素都表明一個捕獲分組,直到最多前2/3
個元素。vector
參數的每對元素的第一個元素表示當前捕獲分組的第一個字符在subject
中的偏移量,第二個元素表示捕獲分組最後一個元素後面的元素在subject
中的位置。vector
的前兩個元素, ovector[0]
和ovector[1]
用來表示subject
中徹底匹配pattern
的部分。next pair
用來表示第一個捕獲分組,以此類推。pcre_exec()
的返回值是匹配的最大分組的number
加1
(這部分很差翻譯,直接看英文更容易理解)。例如,若是兩個捕獲分組被匹配成功,那麼返回值就是3
。若是沒有匹配成功任何分組,那麼返回值就是1
。
If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.
若是某個捕獲分組被屢次匹配成功,那麼返回最後一次匹配成功的substring
的信息。
If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. In particular, if the substring offsets are not of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector.
若是vector
過小,沒法保存全部的捕獲分組,那麼pcre
會盡量的使用這個數組(可是最多使用2/3
),而且pcre_exec()
函數返回0
。特別指出,若是咱們對捕獲分組的信息不感興趣,那麼能夠把vector
參數設置爲NULL
,ovecsize
參數設置爲0
。
The pcre_info() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.
咱們可使用pcre_info()
函數來獲取當前的pattern
中有多少捕獲分組(其實如今使用的都是pcre_fullinfo()
函數)。好比ovector
參數的值爲n
,那麼爲了獲取被整個pattern
匹配的string
的信息,咱們應該把ovecsize
的值設置爲 (n + 1) * 3
.
It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corresponding to unused subpatterns are set to -1.
舉一個例子,若是咱們使用"abc"
來匹配"(a|(z))(bc)"
,那麼pcre_exec()
函數將返回4
.其中第一個和第三個捕獲分組捕獲成功,可是第二個分組沒有捕獲成功。因此第二個分組對應的那個下標對的值會被設置爲 -1
。
Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1. However, you can refer to the offsets for the second and third capturing subpatterns if you wish (assuming the vector is large enough, of course).
PCRE
函數庫連接:http://regexkit.sourceforge.net/Documentation/pcre/pcreapi.html#SEC1
微軟關於正則表達式的用法:https://docs.microsoft.com/zh-cn/dotnet/standard/base-types/anchors-in-regular-expressions