理解C++11正則表達式（2）

時間 2019-12-11

原文原文鏈接

　　今天有幸（2016/3/19）在上海蔘加了C++交流會，見到了夢寐已久想見的臺灣C++大神老師侯捷，心情十分的激動。侯老師對C+＋理解的深入，讓人歎爲觀止。覺得他教學的嚴謹，說話方式娓娓道來，聽着很是舒服。末尾附上一張侯老師照片。c++

　　咱們接着上文介紹C++11的正則表達式。本節將接着上文遺留問題開始展開，而且將結合網上的一些優秀的博客。正則表達式

正文

　C++11 支持六種正則表達式語法：函數

　ECMAScript,spa

　basic(POSIX Basic Regular Expressions),翻譯

extended(POSIX Extended Regular Expressions ),3d

awk(POSIX awk) ,c++11

grep(POSIX grep ),blog

egrep(POSIX grep –E)。其中ECMAScript最爲強大。token

首先，咱們將介紹正則表達式一些通用的基本類型。ip

1 basic_regex

//basic_regex: 這是一個包含一個正則表達式的模板類。一般有兩種特化方式：
a)    typedef basic_regex<char> regex;
b)    typedef basic_regex<wchar_t> wregex;

2 match_results:

　這個類包含了與給定正則表達式匹配的序列。當empty()成員返回true或者size()成員返回0，代表沒有找到匹配項。
不然，當empty()返回false，size()返回值>=1 代表發生了匹配。

//match_results有以下特化方式：
a)    typedef match_results<const char*> cmatch;
b)    typedef match_results<const wchar_t*> wcmatch;
c)    typedef match_results<string::const_iterator> smatch;
d)    typedef match_results<wstring::const_iterator> wsmatch;

match[0]:表明整個匹配序列 ；
match[1]:表明第一個匹配子序列；
match[2]: 表明第二個匹配子序列，以此類推。

特別須要注意的是：正則表達式的編寫過程當中，須要達到上面match［1］［2］［3］效果的話，必須將每一個match數據經過捕獲組(capture group)的方式用括號括起來(.*)，例如:

regex reg("<(.*)>(.*)</(\\1)>")

若是你把括號刪除，顯然是沒有上面說的那種效果的。其中\1指的是第一個捕獲組，\2 固然就是指代第二個啦。

3 sub_match: 該模板類用來表示與一個已標記的子表達式匹配的序列。這個匹配是經過一個迭代器對來表示的，該迭代器對代表了已匹配的正則表達式的一個範圍。能夠特化爲下面幾種狀況：

a)    typedef sub_match<const char*>              csub_match;
b)    typedef sub_match<const wchar_t*>           wcsub_match;
c)    typedef sub_match<string::const_iterator>   ssub_match;
d)    typedef sub_match<wstring::const_iterator>  wssub_match;

4 迭代器介紹：正則表達式迭代器用來遍歷這個正則表達式序列，經過一個迭代器區間來表示匹配的區間。

1. regex_iterator:
a)typedef regex_iterator<const char*>               cregex_iterator;
b)typedef regex_iterator<const wchar_t*>            wcregex_iterator;
c)typedef regex_iterator<string::const_iterator>    sregex_iterator;
d)typedef regex_iterator<wstring::const_iterator>   wsregex_iterator;

2. regex_token_iterator:
a) typedef regex_token_iterator<const char*>             cregex_token_iterator;
b) typedef regex_token_iterator<const wchar_t*>          wcregex_token_iterator;
c) typedef regex_token_iterator<string::const_iterator>  sregex_token_iterator;
d) typedef regex_token_iterator<wstring::const_iterator> wsregex_token_iterator;

一些test代碼：

　　　　 string data = "XML tag: <tag-name>the value</tag-name>.";
	cout << "data:            " << data << endl << endl;
	smatch m;
	bool found = regex_search(data, m, regex("<(.*)>(.*)</(\\1)>"));
	cout << "m.empty()   " << boolalpha << m.empty() << endl;
	cout << "m.size()    " << m.size() << endl;
	if (found)
	{
		cout << "m.str()              " << m.str() << endl;
		cout << "m.length()           " << m.length() << endl;
		cout << "m.position()         " << m.position() << endl;
		cout << "m.prefix().str()     " << m.prefix().str() << endl;
		cout << "m.suffix().str()     " << m.suffix().str() << endl;
	}
	for (int i = 0; i < m.size(); i++)
	{
		cout << "m[" << i << "].size():" << m[i].str() << endl;
		cout << "m.str(" << i << "):" << m.str(i) << endl;
		cout << "m.position(" << i << "):" << m.position(i) << endl;
	}
	cout << "match: " << endl;
	for (auto pos = m.begin(); pos != m.end(); pos++)
	{
		cout << " " << *pos << " ";
		cout << "(length" << pos->length() << ")" << endl;
	}
　　　　 //第二個例子
	data = "<person>\n"
		" <first>Nico</first>\n"
		" <last>Josuttis</last>\n"
		"</person>\n";
	auto pos = data.cbegin();
	auto end = data.cend();
	regex reg("<(.*)>(.*)</(\\1)>");

	for (;regex_search(pos, end, m, reg); pos = m.suffix().first)
	{
		cout << "match: " << m.str() << endl;
		cout << "tag: " << m.str(1) << endl;
		cout << "value: " << m.str(2) << endl;
	}

　　咱們要查找的字符串是這樣的格式：

　　另外，第二個例子中將字符串添加換行符，若是沒有添加，將沒法辨別<first><last>，最後只能找到<person>。

Regex Iterator: 採用迭代器的方式進行訪問或經過for_each()

sregex_iterator pos(data.cbegin(),data.cend(),reg);
sregex_iterator end;

for(; pos!=end; ++pos)
{
      cout << "match " << pos -> str() << endl;
      cout << "tag:" << pos -> str(1) << endl;
      cout << "value:" << pos -> str(2) << endl;          
}

for_each(beg,end,[](const smatch &m){
      cout << "match " << pos -> str() << endl;
      cout << "tag:" << pos -> str(1) << endl;
      cout << "value:" << pos -> str(2) << endl;   
});

Regex Token Iterator

一、regex reg("<(.*)>(.*)</(\\1)>");

sregex_token_iterator pos(data.cbegin(),data.cend(),reg,{0,2});//0表明了獲取整個match，2表明獲取第二個group;-1表明了想知道全部子序列。

二、regex sep("[[:space:]]*[;,.][[:space:]]*");//以;.,分割。

sregex_token_iterator pos(data.cbegin(),data.cend(),sep,-1);//獲得全部子序列。（具體用法請參考c++11標準庫，有詳細用法和介紹）

小結：

前面2節主要介紹兩個函數（regex_search regex_match）,可是這兩個函數不能定義出在那個位置，咱們引入了迭代器、group的概念。下一節主要介紹替換、regex常量以開頭說的6種正則表達式語法，最後想經過幾個簡單的例子練習一下。其實標準庫裏面說的很清楚，若是不明白的話，建議去買一本侯捷翻譯的c++11標準庫。