nginx正則表達式(上篇)

時間 2019-11-17

原文原文鏈接

微信公衆號：鄭爾多斯

關注「鄭爾多斯」公衆號，回覆「領取資源」，獲取IT資源500G乾貨。
升職加薪、當上總經理、出任CEO、迎娶白富美、走上人生巔峯！想一想還有點小激動

關注可瞭解更多的Nginx知識。任何問題或建議，請公衆號留言;
關注公衆號，有趣有內涵的文章第一時間送達！

前言

在Nginx中location, server_name,rewrite等模塊使用了大量的正則表達式，經過正則表達式能夠完整很是強悍的功能，可是這部分對咱們閱讀源碼也產生了很是大的困惑。本文就集中精力來學習一下Nginx中的正則表達式，幫助咱們更透徹的理解nginx中的功能。html

起源

Nginx中的正則表達式使用了pcre格式，而且封裝了pcre函數庫的幾個經常使用函數，咱們學習一下這幾個函數，經過它們就能夠透徹的理解nginx中的正則表達式。nginx

編譯正則表達式

正則表達式在使用以前要首先通過編譯(compile)，獲得一個編譯以後的數據結構，而後經過這個數據結構進行正則匹配和其餘各類信息的獲取。
PCRE中進行編譯的函數有兩個，分別爲pcre_compile()和pcre_compile2()，這兩個函數的功能相似，Nginx使用了前者，因此咱們對pcre_compile進行分析。正則表達式

pcre *pcre_compile(
     const char *pattern, 
     int options, 
     const char **errptr, 
     int *erroffset, 
     const unsigned char *tableptr
);
複製代碼

參數說明：
pattern: 將要被編譯的正則表達式。
options: 編譯過程當中使用到的選項。在Nginx中，只使用到了PCRE_CASELESS選項，表示匹配過程當中不區分大小寫。
errptr:保存編譯過程當中遇到的錯誤。該字段若是爲NULL，那麼pcre_compile()會中止編譯，直接返回NULL.
erroffset:該字段保存編譯過程當中發生錯誤的字符在pattern中的偏移量。
tableptr:這個參數的做用不清楚，可是文檔中說能夠爲NULL，而且Nginx中也確實設置爲NULL,因此能夠忽略這個字段。express

返回值：
該函數返回一個pcre *指針，表示編譯信息，經過這個返回值能夠獲取與編譯有關的信息，該結構體也用於pcre_exec()函數中，完整匹配操做。api

獲取編譯信息

經過上述的編譯返回的結構體，能夠獲取當前pattern的許多信息，好比捕獲分組的信息等，下面的函數就是完成這個功能的。數組

int pcre_fullinfo(
      const pcre *code, 
      const pcre_extra *extra, 
      int what, 
      void *where
);
複製代碼

參數說明：
code : 這個參數就是上面的pcre_compile()返回的結構體。
extra: 這個參數是pcre_study()返回的結構體，若是沒有，能夠爲NULL.
what : 咱們要獲取什麼信息
where: 保存返回的數據微信

返回值：
若是函數執行成功，返回0.
nginx中經過該函數獲取了以下信息：數據結構

PCRE_INFO_CAPTURECOUNT: 獲得的是全部子模式的個數,包含命名捕獲分組和非命名捕獲分組;app

PCRE_INFO_NAMECOUNT: 獲得的是命名子模式的個數,不包括非命名子模式的個數;ide

在這裏要說明一個狀況：PCRE容許使用命名捕獲分組，也容許使用匿名捕獲分組（即分組用數字來表示），其實命名捕獲分組只是用來標識分組的另外一種方式，命名捕獲分組也會得到一個數字分組名稱。PCRE提供了一些方法能夠經過命名捕獲分組的名稱來快速獲取捕獲分組內容的函數，好比：pcre_get_named_substring() .
也能夠經過如下步驟來獲取捕獲分組的信息：

將命名捕獲分組的名稱轉換爲數字。
經過上一步的數字來獲取分組的信息。
這裏就牽涉到了一個 name to number 的轉換過程，PCRE維護了一個 name-to-number 的map，咱們能夠根據這個map完成轉換功能，這個map有如下三個屬性：

PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE

這個map包含了若干個固定大小的記錄，能夠經過PCRE_INFO_NAMECOUNT參數來獲取這個map的記錄數量(其實就是命名捕獲分組的數量)，經過PCRE_INFO_NAMEENTRYSIZE來獲取每一個記錄的大小，這兩種狀況下，最後一個參數都是一個int類型的指針。其中每一個每一個記錄的大小是由最長的捕獲分組的名稱來確立的。The entry size depends on the length of the longest name.

PCRE_INFO_NAMETABLE 返回一個指向這個map的第一條記錄的指針（一個char類型的指針），每條記錄的前兩個字節是命名捕獲分組所對應的數字分組值，剩下的內容是命名捕獲分組的name，以'\0'結束。返回的map的順序是命名捕獲分組的字母順序。

下面是PCRE官方文檔中的一個例子：

When PCRE_DUPNAMES is set, duplicate names are in order of their parentheses numbers. For example, consider the following pattern (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
There are four named subpatterns, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns using the name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.

例子

這裏有一個從網上找的例子，可是具體找不到原文的連接了，以下：

//gcc pcre_test.c -o pcre_test -L /usr/lib64/ -lpcre
#include <stdio.h>
#include <pcre.h>

int main()
{
    pcre  *re;
        const   char       *errstr;
    int  erroff;
    int captures =0, named_captures, name_size;
    char  *name;
    char *data = "(?<date> (?<year>(\\d\\d)?\\d\\d) - (?<month>\\d\\d) - (?<day>\\d\\d) )";
    int n, i;
    char  *p;
    p = data;
    printf("%s \n", p);
    re = pcre_compile(data, PCRE_CASELESS, &errstr, &erroff, NULL);
    if(NULL == re)
    {
        printf("compile pcre failed\n");
        return 0;
    }
    n = pcre_fullinfo(re, NULL, PCRE_INFO_CAPTURECOUNT, &captures);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_CAPTURECOUNT failed %d \n", n);
        return 0;
    }
    printf(" captures %d \n", captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMECOUNT, &named_captures);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMECOUNT failed %d \n", n);
        return 0;
    }
    printf("named_captures %d \n", named_captures);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMEENTRYSIZE, &name_size);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMEENTRYSIZE failed %d \n", n);
        return 0;
    }
    printf("name_size %d \n", name_size);
    n = pcre_fullinfo(re, NULL, PCRE_INFO_NAMETABLE, &name);
    if(n < 0)
    {
        printf("pcre_fullinfo PCRE_INFO_NAMETABLE failed %d \n", n);
        return 0;
    }
    p =name;
    int j;
    for(j = 0; j < named_captures; j++)
    {
        for(i = 0; i <2; i++)
        {
            printf("%x ", p[i]);
        }
        printf("%s \n", &p[2]);
        p += name_size;
    }
    return 0;
}
複製代碼

輸出結果以下：

從結果中能夠看出來：
總共有 5 個捕獲分組
4 個命名捕獲分組
每一個記錄的最大長度是 8，這裏就是 month 這條記錄是最長的了，由於最後面還有一個 '\0' 結束符，因此長度爲 8
咱們能夠看出來，對於每一個命名捕獲分組，也都會給它分配一個數字編號。而且 capture的數字是和非命名子模式一塊兒排列的,也就是根據左括號的前後排列的

匹配

上面介紹了編譯，以及獲取其餘信息，那麼剩下的就是最重要的匹配了。

int pcre_exec(
    const pcre *code, 
    const pcre_extra *extra,
    const char *subject, 
    int length, 
    int startoffset, 
    int options, 
    int *ovector, 
    int ovecsize
);
複製代碼

參數說明：
code: 編譯函數的返回值
extra: pcre_study的返回值，能夠爲NULL
subject: 待匹配的字符串
length : subject的長度
startoffset: 開始匹配的位置
option: 匹配的選項
vector: 保存匹配結構的數據
ovecsize : vector數組的長度，必須爲3的倍數
下面是PCRE文檔中對該函數的一些解釋，我翻譯了一部分：

How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.

一般來講，一個pattern能夠匹配一個subject中的特定一部分，除此以外，subject中的一部分還可能會被pattern中的一部分匹配（意思就是：pattern中可能存在捕獲分組，那麼subject中的一部分可能會被這部分捕獲分組所匹配）。

Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.

咱們在pcre_exec()中的vector參數就是會保存一系列integer offset，經過這些整形偏移量咱們就能夠獲取捕獲分組的內容。vector參數的數量是經過ovecsize參數指定的，ovecsize參數的大小必須是三的倍數。

The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec()while matching capturing subpatterns, and is not available for passing back information. The length passed in ovecsize should always be a multiple of three. If it is not, it is rounded down.

vector參數的前2/3用來保存後向引用的分組捕獲（好比$1, $2等），每一個substring都會使用vector中的兩個整數。剩餘的1/3被pcre_exec()函數在捕獲分組的時候使用，不能被用來保存後向引用。

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of a pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. The first pair, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set.

當匹配成功以後，從vector參數的第一個元素開始，每對元素都表明一個捕獲分組，直到最多前2/3個元素。vector參數的每對元素的第一個元素表示當前捕獲分組的第一個字符在subject中的偏移量，第二個元素表示捕獲分組最後一個元素後面的元素在subject中的位置。vector的前兩個元素, ovector[0]和ovector[1]用來表示subject中徹底匹配pattern的部分。next pair用來表示第一個捕獲分組，以此類推。pcre_exec()的返回值是匹配的最大分組的number加1(這部分很差翻譯，直接看英文更容易理解）。例如，若是兩個捕獲分組被匹配成功，那麼返回值就是3。若是沒有匹配成功任何分組，那麼返回值就是1。

If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.

若是某個捕獲分組被屢次匹配成功，那麼返回最後一次匹配成功的substring的信息。

If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. In particular, if the substring offsets are not of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector.

若是vector過小，沒法保存全部的捕獲分組，那麼pcre會盡量的使用這個數組（可是最多使用2/3）,而且pcre_exec()函數返回0。特別指出，若是咱們對捕獲分組的信息不感興趣，那麼能夠把vector參數設置爲NULL，ovecsize參數設置爲0。

The pcre_info() function can be used to find out how many capturing subpatterns there are in a compiled pattern. The smallest size for ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3.

咱們可使用pcre_info()函數來獲取當前的pattern中有多少捕獲分組(其實如今使用的都是pcre_fullinfo()函數)。好比ovector參數的值爲n，那麼爲了獲取被整個pattern匹配的string的信息，咱們應該把ovecsize的值設置爲 (n + 1) * 3.

It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this happens, both values in the offset pairs corresponding to unused subpatterns are set to -1.

舉一個例子，若是咱們使用"abc"來匹配"(a|(z))(bc)"，那麼pcre_exec()函數將返回4.其中第一個和第三個捕獲分組捕獲成功，可是第二個分組沒有捕獲成功。因此第二個分組對應的那個下標對的值會被設置爲 -1。

Offset values that correspond to unused subpatterns at the end of the expression are also set to -1. For example, if the string "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The return from the function is 2, because the highest used capturing subpattern number is 1. However, you can refer to the offsets for the second and third capturing subpatterns if you wish (assuming the vector is large enough, of course).

參考

PCRE函數庫連接：http://regexkit.sourceforge.net/Documentation/pcre/pcreapi.html#SEC1
微軟關於正則表達式的用法：https://docs.microsoft.com/zh-cn/dotnet/standard/base-types/anchors-in-regular-expressions

喜歡本文的朋友們，歡迎長按下圖關注訂閱號鄭爾多斯，更多精彩內容第一時間送達