python 正則表達式Re

時間 2019-12-14

原文原文鏈接

Python正則表達式指南這篇文章很好，推薦閱讀。
本文則是簡單記錄下我本身學習Re的筆記，環境是python3.5。html

1.簡單的Re語法

^ 匹配字符串開始位置。
$ 匹配字符串結束位置。
\b 匹配一個單詞邊界。
\d 匹配一個數字。
\D 匹配一個任意的非數字字符。
x? 匹配可選的 x 字符。換句話說，就是 0 個或者 1 個 x 字符。可是?還有第二個含義，做爲正則的懶惰模式。
x* 匹配 0 個或更多的 x。
x+ 匹配 1 個或者更多 x。
x{n,m} 匹配 n 到 m 個 x，至少 n 個，不能超過 m 個。
x{n}匹配n個x，必定是n個
(a|b|c) 匹配單獨的任意一個 a 或者 b 或者 c。
(x) 這是一個組，它會記憶它匹配到的字符串。
\w 一個字母或數字
. 任意字符
\s 空格
[] 表示範圍，[0-9a-zA-Z_]能夠匹配一個數字、字母或者下劃線。
[^] 做爲方括號中的第一個字符， ^有特別的含義：非。 [^abc] 的意思是：「除了 a、 b 或 c以外的任何字符」。

>>> re.search('[^aeiou]y$', 'vacancy')
<_sre.SRE_Match object; span=(5, 7), match='cy'>
>>> re.search('[^aeiou]y$', 'boy')
>>>

2.天然字符串

若是你想要指示某些不須要如轉義符那樣的特別處理的字符串，那麼你須要指定一個天然字符串。天然字符串經過給字符串加上前綴r或R來指定。例如r"Newlines are indicatedby \n"。
必定要用天然字符串處理正則表達式。不然會須要使用不少的反斜槓。例如，後向引用符能夠寫成'\1'或r'\1'。
強烈建議使用 Python 的 r 前綴，就不用考慮轉義的問題了。python

3.re.search(pattern, string, flags=0) 和 re.match(pattern, string, flags=0)

search()的幫助信息：Scan through string looking for a match to the pattern, returning a match object, or None if no match was found.。match()的幫助信息：Try to apply the pattern at the start of the string, returning a match object, or None if no match was found.。正則表達式

re 模塊最基本的方法是 search()函數（相似於match()函數）。它使用正則表達式來匹配字符串（ M）。若是成功匹配， search()返回一個匹配對象。shell

>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M')
<_sre.SRE_Match object; span=(0, 1), match='M'>

match()函數一樣的功能。express

>>> pattern = '^M?M?M?$'
>>> re.match(pattern, 'M')
<_sre.SRE_Match object; span=(0, 1), match='M'>

re.match()只匹配字符串的開始，若是字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search()匹配整個字符串，直到找到一個匹配。app

#!/usr/bin/python
import re
line = "Cats are smarter than dogs";

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print("match --> matchObj.group() : ", matchObj.group())
else:
   print("No match!!")
   
matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print("search --> matchObj.group() : ", matchObj.group())
else:
   print("No match!!")
   
# No match!!
# search --> matchObj.group() :  dogs
# [Finished in 0.1s]

4.re.compile(pattern, flags=0)

Compile a regular expression pattern, returning a pattern object.
Re模塊的compile()方法編譯正則表達式，返回一個pattern對象。
編譯後生成的 pattern 對象，已經包含了正則表達式(regular expression)，因此調用對應的方法時不用給出正則表達式了。
用編譯後的pattern對象去匹配字符串patternobj.search(string)，若是匹配成功，返回的結果調用groups()方法，能夠獲得這個正則表達式中定義的全部分組結果組成的元組。
但若是search()方法匹配不成功，返回None，它沒有 groups()方法，因此調用 None.groups()將會拋出一個異常。ssh

>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')

>>> dir(phonePattern)
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'findall', 'finditer', 'flags', 'fullmatch', 'groupindex', 'groups', 'match', 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']

>>> phonePattern
re.compile('^(\\d{3})-(\\d{3})-(\\d{4})$')
>>> m = phonePattern.search('800-555-1212')
>>> m
<_sre.SRE_Match object; span=(0, 12), match='800-555-1212'>
>>> m.groups()
('800', '555', '1212')

# 匹配失敗
>>> phonePattern.search('800-555-1212-1234').groups()
Traceback (most recent call last):
  File "<pyshell#24>", line 1, in <module>
    phonePattern.search('800-555-1212-1234').groups()
AttributeError: 'NoneType' object has no attribute 'groups'

5. re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.

正則表達式模塊的re.sub()函數能夠作字符串替換，以下它在字符串 s 中用正則表達式‘ROAD$’來搜索並替換成‘RD.’。函數

>>> s = '100 NORTH BROAD ROAD'
>>> import re
>>> re.sub('ROAD$', 'RD.', s)
'100 NORTH BROAD RD.'

>>> re.sub('[abc]', 'o', 'caps')
'oops'

可能會認爲該函數會將 caps 轉換爲 oaps，但實際上並非這樣。re.sub() 替換全部的匹配項，而不只僅是第一個匹配項。所以該正則表達式將 caps 轉換爲 oops，由於不管是 c 仍是 a 均被轉換爲 o 。
在替換字符串中，用到了新的語法： \1，它表示「嘿，記住的第一個分組呢？把它放到這裏。」在此例中，記住了 y 以前的 c ，在進行替換時，將用 c 替代 c，用 ies 替代 y 。（若是有超過一個的記憶分組，可使用 \2 和 \3 等等。）oop

>>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy')
'vacancies'

6. re.findall(pattern, string, flags=0)

Return a list of all non-overlapping matches in the string.
    
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.

re.findall()接受一個正則表達式和一個字符串做爲參數，而後找出字符串中出現該模式的全部地方。在這個例子裏，模式匹配的是數字序列。findall()函數返回全部匹配該模式的子字符串的列表。學習

>>> re.findall('[0-9]+', '16 2-by-4s in rows of 8')
['16', '2', '4', '8']

這裏正則表達式匹配的是字母序列。再一次，返回值是一個列表，其中的每個元素是匹配該正則表達式的字符串。

>>> re.findall('[A-Z]+', 'SEND + MORE == MONEY')
['SEND', 'MORE', 'MONEY']

懶惰模式和貪婪模式
.*具備貪婪的性質，首先匹配到不能匹配爲止，根據後面的正則表達式，會進行回溯。
.*？則相反，一個匹配之後，就往下進行，因此不會進行回溯，具備最小匹配的性質。
?的一個用法是匹配0次或1次。可是?還有第二個含義，做爲正則的懶惰模式。後邊多一個?表示懶惰模式，必須跟在*或者+後邊用。
re 正則有兩種模式，一種爲貪婪模式（默認），另一種爲懶惰模式，如下爲例：
(abc)dfe(gh)
對上面這個字符串使用(.*)將會匹配整個字符串，由於正則默認是儘量多的匹配。
雖然(abc)知足咱們的表達式，可是(abc)dfe(gh)也一樣知足，因此正則會匹配多的那個。
若是咱們只想匹配(abc)和(gh)就須要用到如下的表達式(.*?)

惰性模式匹配示例：

>>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
[' sixth s', " sheikh's s", " sheep's s"]
>>> re.findall(' s.* s', "The sixth sick sheikh's sixth sheep's sick.")
[" sixth sick sheikh's sixth sheep's s"]

加上分組：

>>> re.findall(' s(.*?) s', "The sixth sick sheikh's sixth sheep's sick.")
['ixth', "heikh's", "heep's"]

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。