正則備忘錄（walker)

時間 2020-01-09

原文原文鏈接

　　本文的示例默認以 Python3 爲實現語言，用到 Python3 的 re 模塊或 regex 庫。據 walke r猜想：在 Python3 的 Unicode 字符集下，re模塊的 \s 匹配 \f\n\r\t\v 加全角半角空格，共 7 個字符。html

正則表達式的文檔

正則表達式30分鐘入門教程
另外一個不錯的入門教程
揭開正則表達式的神祕面紗，walker 以爲這篇文章對 Multiline 的講解特別到位，截圖以下：

提取雙引號及之間的內容

用 re.findall

text = '''abc"def"ghi'''
re.findall(r'"[^"]+"', text)
# 結果
['"def"']

用re.search。

>>> text = '''abc"def"ghi'''
>>> re.search(r'"([^"]+)"', text).group(0)
'"def"'

提取雙引號之間的內容

用 re.findall。

text = '''abc"def"ghi'''
re.findall(r'"([^"]+)"', text)
# 結果
['def']

用 re.search。

>>> text = '''abc"def"ghi'''
>>> re.search(r'"([^"]+)"', text).group(1)
'def'

環視: (?<=pattern)、(?=pattern)

text = '''abc"def"ghi'''
re.findall(r'(?<=")[^"]+(?=")', text)
# 結果
['def']

查找以某些字符串打頭的行

# 好比查找以+++、---、index打頭的行
#方法一，按行匹配
for i in lst:
    if re.match(r"(---|\+\+\+|index).*", i):
        print i
#方法二，一次性匹配
re.findall(r'^(?:\+\+\+|---|index).*$', content, re.M)
#方法二精簡版
re.findall(r'^(?:[-\+]{3}|index).*$', content, re.M)

包含/不包含

（參考：利用正則表達式排除特定字符串）python

文本內容

>>> print(text)
www.sina.com.cn
www.educ.org
www.hao.cc
www.baidu.com
www.123.com

sina.com.cn
educ.org
hao.cc
baidu.com
123.com

匹配以www打頭的行

>>> re.findall(r'^www.*$', text, re.M)
['www.sina.com.cn', 'www.educ.org', 'www.hao.cc', 'www.baidu.com', 'www.123.com']

匹配不以www打頭的行

>>> re.findall(r'^(?!www).*$', text, re.M)
['', 'sina.com.cn', 'educ.org', 'hao.cc', 'baidu.com', '123.com']

匹配以cn結尾的行

>>> re.findall(r'^.*?cn$', text, re.M)
['www.sina.com.cn', 'sina.com.cn']

匹配不以com結尾的行

>>> re.findall(r'^.*?(?<!com)$', text, re.M)
['www.sina.com.cn', 'www.educ.org', 'www.hao.cc', '', 'sina.com.cn', 'educ.org', 'hao.cc']

匹配包含com的行

>>> re.findall(r'^.*?com.*?$', text, re.M)
['www.sina.com.cn', 'www.baidu.com', 'www.123.com', 'sina.com.cn', 'baidu.com', '123.com']

匹配不包含com的行

>>> re.findall(r'^(?!.*com).*$', text, re.M)
['www.educ.org', 'www.hao.cc', '', 'educ.org', 'hao.cc']

>>> re.findall(r'^(?:(?!com).)*?$', text, re.M)
['www.educ.org', 'www.hao.cc', '', 'educ.org', 'hao.cc']

匹配所有，去除部分

利用分組獲得網址的第一級，即去除後面幾級。正則表達式

# 方法一
>>> strr = 'http://www.baidu.com/abc/d.html'
>>> re.findall(r'(http://.+?)/.*', strr)
['http://www.baidu.com']

# 方法二
>>> re.sub(r'(http://.+?)/.*', r'\1', strr)
'http://www.baidu.com'

兩個有助於理解正則分組的例子

# 一
>>> strr = 'A/B/C'
>>> re.sub(r'(.)/(.)/(.)', r'xx', strr)
'xx'
>>> re.sub(r'(.)/(.)/(.)', r'\1xx', strr)
'Axx'
>>> re.sub(r'(.)/(.)/(.)', r'\2xx', strr)
'Bxx'
>>> re.sub(r'(.)/(.)/(.)', r'\3xx', strr)
'Cxx'

# 二
>>> text = 'AA,BB:222'
>>> re.search(r'(.+),(.+):(\d+)', text).group(0)
'AA,BB:222'
>>> re.search(r'(.+),(.+):(\d+)', text).group(1)
'AA'
>>> re.search(r'(.+),(.+):(\d+)', text).group(2)
'BB'
>>> re.search(r'(.+),(.+):(\d+)', text).group(3)
'222'

提取含有hello字符串的div

>>> content
'<div id="abc"><div id="hello1"><div id="def"><div id="hello2"><div id="hij">'
>>> 
>>> p = r'<div((?!div).)+hello.+?>'
>>> re.search(p, content).group()
'<div id="hello1">'
>>> re.findall(p, content)
['"', '"']
>>> for iter in re.finditer(p, content):
    print(iter.group())

<div id="hello1">
<div id="hello2">
>>> 
>>> p = r'<div[^>]+hello.+?>'
>>> re.search(p, content).group()
'<div id="hello1">'
>>> re.findall(p, content)
['<div id="hello1">', '<div id="hello2">']
>>> for iter in re.finditer(p, content):
    print(iter.group())

<div id="hello1">
<div id="hello2">

若是所使用的工具支持確定環視（positive lookahead），同時能夠在確定環視中使用捕獲括號（capturing parentheses），就能模擬實現固化分組（atomic grouping）和佔有優先量詞（possessive quantifiers）。express

千分位

Python

>>> format(23456789, ',')
'23,456,789'
# 利用確定逆序環視與確定順序環視
>>> re.sub(r'(?<=\d)(?=(?:\d{3})+$)', ',', '2345678')
'2,345,678'

JavaScript

//利用確定順序環視（由於js不支持確定逆序環視）
//結果爲"23,456,789"
"23456789".replace(/(\d)(?=(?:\d{3})+$)/g, "$1,")

單層嵌套括號（平衡組）

>>> import re
>>> line = r'蓋層(汽油) 塔里木盆地(學科: 蓋層(油氣) 學科: 評價) 塔里木盆地'
>>> re.findall(r'\([^()]*(\([^()]*\)[^()]*)*\)', line)
['', '(油氣) 學科: 評價']
>>> re.findall(r'\([^()]*(?:\([^()]*\)[^()]*)*\)', line)
['(汽油)', '(學科: 蓋層(油氣) 學科: 評價)']

匹配漢字

>>> regex.findall(r'\p{Han}', '孔子/現代價值/Theory of "Knowing"')
['孔', '子', '現', '代', '價', '值']

一種正則和 lambda 的有趣結合

dic = {'user': 'walker', 'domain': '163.com'}
rule = r'%user%@%domain%'
email = re.sub('%[^%]*%', lambda matchobj: dic[matchobj.group(0).strip('%')], rule)
print('email: %s' % email)      # walker@163.com