利用正則實現匹配和替換

時間 2019-12-06

原文原文鏈接

tbox裏面針對三個正則庫（pcre/pcre2/posix）進行了封裝，實現接口統一和跨平臺處理，只要xmake在編譯配置的時候自動檢測到其中一種庫，就可使用了，通常會優先使用pcre2。git

若是你不想過多的依賴第三方庫，能夠切換到posix的正則，調用 xmake f --pcre=false --pcre2=false 把pcre的庫禁用了就好了。github

首先給個最簡單的匹配單個子串的例子：正則表達式

// 執行簡單匹配，第二個參數是匹配模式，默認傳0就好了
    tb_vector_ref_t results = tb_regex_match_done_simple("(\\w+)\\s+?(\\w+)", 0, "hello world");
    if (results)
    {
        // 遍歷匹配到的結果
        tb_for_all_if (tb_regex_match_ref_t, entry, results, entry)
        {
            // 打印匹配到的子串的起始偏移、長度、和內容
            tb_trace_i("[%lu, %lu]: %s", entry->start, entry->size, entry->cstr);
        }
        
        // 銷燬匹配到的結果數據
        tb_vector_exit(results);
    }

輸出結果以下：優化

[0, 11]: hello world 
[0, 5]: hello
[6, 5]: world

第一個匹配結果，是針對整個匹配子串的，後面兩個結果，是針對()裏面的分組匹配。。.net

若是不想進行遍歷，只像提取其中第一個分組匹配的結果，能夠這麼使用：code

// 執行簡單匹配，第二個參數是匹配模式，默認傳0就好了
    tb_vector_ref_t results = tb_regex_match_done_simple("(\\w+)\\s+?(\\w+)", 0, "hello world");
    if (results && tb_vector_size(results) > 1)
    {
        // 獲取第一個分組結果，也就是索引1的子串
        tb_regex_match_ref_t entry = (tb_regex_match_ref_t)tb_iterator_item(results, 1);
        if (entry)
        {
            // 打印匹配到的子串的起始偏移、長度、和內容
            tb_trace_i("[%lu, %lu]: %s", entry->start, entry->size, entry->cstr);
        }
        
        // 銷燬匹配到的結果數據
        tb_vector_exit(results);
    }

上面只能匹配全文中的第一個字串，若是想要進行全局匹配，能夠這麼來：對象

// 初始化一個正則對象，採用默認匹配模式
    tb_regex_ref_t regex = tb_regex_init("\\w+", 0);
    if (regex)
    {
        // 循環匹配所有子串
        tb_long_t       start = 0;
        tb_size_t       length = 0;
        tb_vector_ref_t results = tb_null;
        while ((start = tb_regex_match_cstr(regex, content, start + length, &length, &results)) >= 0 && results)
        {
            // 整個子串的起始偏移和長度（不是分組子串，是整個匹配串）
            tb_trace_i("[%lu, %lu]: ", start, length);

            // 遍歷顯示這個匹配子串的全部分組，第一項是整個子串
            tb_for_all_if (tb_regex_match_ref_t, entry, results, entry)
            {
                // trace
                tb_trace_i("    [%lu, %lu]: %s", entry->start, entry->size, entry->cstr);
            }
        }

        // 銷燬正則對象
        tb_regex_exit(regex);
    }

用 tb_regex_init 建立一個正則對象的方式，針對匹配次數頻繁的操做，進行了優化，由於它會提早預編譯正則表達式索引

若是隻是進行單一子串匹配，那麼使用tb_regex_match_done_simple就夠用了，畢竟接口更加簡單易用接口

須要注意的是，用 tb_regex_init 進行的全部匹配結果和替換結果，是不須要手動銷燬釋放的，在調用tb_regex_exit，會去自動釋放他們文檔

前面傳遞的匹配模式，只傳了0，使用默認匹配規則，tbox目前能夠支持如下模式：

TB_REGEX_MODE_NONE              = 0     //!< 默認匹配模式
,   TB_REGEX_MODE_CASELESS          = 1     //!< 忽略大小寫匹配
,   TB_REGEX_MODE_MULTILINE         = 2     //!< ^ 和 $ 匹配新行，實現多行匹配
,   TB_REGEX_MODE_GLOBAL            = 4     //!< 執行全局替換

替換子串更加方便，若是替換單次子串，只須要：

// 執行單次替換
    tb_char_t const* results = tb_regex_replace_done_simple("\\w+", 0, "hello world", "hi");
    if (results)
    {
        // trace
        tb_trace_i(": %s", results);

        // 銷燬結果字串
        tb_free(results);
    }

輸出結果以下：

hi world

若是要進行屢次全局替換，只需修改匹配模式：

// 設置TB_REGEX_MODE_GLOBAL全局替換模式，執行屢次替換
    tb_char_t const* results = tb_regex_replace_done_simple("\\w+", TB_REGEX_MODE_GLOBAL, "hello world", "hi");
    if (results)
    {
        // trace
        tb_trace_i(": %s", results);

        // 銷燬結果字串
        tb_free(results);
    }

輸出結果以下：