在python中使用正則表達式（轉載）

時間 2019-11-05

標籤 python 使用正則表達式轉載欄目 Python 简体版

原文原文鏈接

http://www.javashuo.com/article/p-zzacjhkc-p.htmlhtml

在python中使用正則表達式(一)

在python中經過內置的re庫來使用正則表達式，它提供了全部正則表達式的功能。python

一.寫在前面：關於轉義的問題

正則表達式中用「\」表示轉義，而python中也用「\」表示轉義，當遇到特殊字符須要轉義時，你要花費心思到底須要幾個「\」，因此爲了不這個狀況，牆裂推薦使用原生字符串類型(raw string)來書寫正則表達式。正則表達式

方法很簡單，只須要在表達式前面加個「r」便可，以下shell

r'\d{2}-\d{8}'

r'\bt\w*\b'

二.Re庫經常使用的功能函數

1. re.match()

從字符串的起始位置匹配,匹配成功,返回一個匹配的對象，不然返回Noneexpress

語法：re.match(pattern, string, flags=0)
pattern：匹配的正則表達式
string:要匹配的字符串
flags:標誌位，用於控制正則表達式的匹配方式，如：是否區分大小寫，多行匹配等等;flags=0表示不進行特殊指定

可選標誌以下：app

修飾符被指定爲一個可選的標誌。多個標誌能夠經過按位 OR(|) 它們來指定。如 re.I | re.M 被設置成 I 和 M 標誌函數

示例：post

不含標誌位：
>>> re.match(r'\d{2}','123')
<_sre.SRE_Match object; span=(0, 2), match='12'>
>>> re.match(r'\d{2}','ab123')
>>> print(re.match(r'\d{2}','ab123'))
None

含有標誌位：
>>> re.match(r'a','ab123').group()
'a'
>>> re.match(r'a','Ab123').group()
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
re.match(r'a','Ab123').group()
AttributeError: 'NoneType' object has no attribute 'group'
>>> re.match(r'a','Ab123',re.I).group()
'A'

2. re.search()

掃描整個字符串並返回第一個成功的匹配對象，不然返回None學習

語法：re.search(pattern, string, flags=0)

示例：url

>>> re.search(r'\d{2}','Ab123')
<_sre.SRE_Match object; span=(2, 4), match='12'>
>>> re.search(r'\d{2}','Abcde')
>>> print(re.search(r'\d{2}','Abcde'))
None

能夠看到match()和search()返回的時match對象(即匹配對象)，能夠經過group()方法得到匹配內容

>>> re.search(r'\d{2}','Ab12c34d56e78').group()
'12'
>>> re.match(r'\d{2}','12c34d56e78').group(0)
'12'

group() 同group（0）就是匹配正則表達式總體結果，也就是全部匹配到的字符

group()其實更多的結合分組來使用，即若是在正則表達式中定義了分組(什麼是分組？參見正則表達式學習，一個左括號「(」，表示一個分組)，就能夠在match對象上用group()方法提取出子串來。後面會單獨寫一下group()和groups()的用法，這裏先簡單瞭解一下。

re.match與re.search的區別：

re.match只匹配字符串的開始，若是字符串開始不符合正則表達式，則匹配失敗，函數返回None；而re.search匹配整個字符串，直到找到一個匹配（注意：僅僅是第一個）

3. re.findall()

在字符串中找到正則表達式所匹配的全部子串，並返回一個列表，若是沒有找到匹配的，則返回空列表

注意： match 和 search 是匹配一次,而findall 匹配全部

>>> re.findall(r'\d{2}','21c34d56e78')
['21', '34', '56', '78']

4. re.finditer()

和 findall 相似，在字符串中找到正則表達式所匹配的全部子串，並把它們做爲一個迭代器返回.

示例：

>>> match = re.finditer(r'\d{2}','21c34d56e78')
>>> for t in match:
    print(t.group())

    
21
34
56
78
>>>

5. re.split()

根據正則表達式中的分隔符把字符分割爲一個列表並返回成功匹配的列表.

示例：

>>> match = re.split(r'\.|-','hello-world.data')   # 使用 . 或 - 做爲字符串的分隔符
>>> print(match)
['hello', 'world', 'data']

字符串也有split方法，以下，做個對比：

字符串的split方法
>>> 'a b   c'.split(' ')  # b和c之間有3個空格
['a', 'b', '', '', 'c']

若是用空格很差理解的話，能夠換位x
>>> 'axbxxxc'.split('x')
['a', 'b', '', '', 'c']
>>>

能夠看到，單純用字符串的split方法沒法識別連續的空格，

用正則表示式以下：

>>> re.split(r'\s+', 'a b   c')  # \s+ 表示匹配一個或多個空白符(\s表示匹配空白符，+表示重複1次或1次以上)
['a', 'b', 'c']
>>>

6. re.sub()

用於替換字符串中的匹配項

語法： re.sub(pattern, repl, string, count=0)

pattern : 正則中的模式字符串。
repl : 替換的字符串，也可爲一個函數。
string : 要被查找替換的原始字符串。
count : 模式匹配後替換的最大次數，默認 0 表示替換全部的匹配。

示例：

>>> match = re.sub(r'a', 'b','aaccaa')   # 把字符串中的a都替換爲b
>>> print(match)
bbccbb
>>>

參考：https://www.cnblogs.com/yan-lei/p/7653362.html和菜鳥教程

這一節主要學習一下compile()函數和group()方法

1. re.compile()

compile 函數用於編譯正則表達式，生成一個正則表達式（ Pattern ）對象，而後就能夠用編譯後的正則表達式去匹配字符串

語法以下：
>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.
>>>

pattern : 一個字符串形式的正則表達式 
flags ：可選，表示匹配模式，好比忽略大小寫，多行模式等

示例：

>>> test_pattern = re.compile(r'\d{2}')   # 編譯一個正則表達式，並將其賦給一個變量
>>> m = test_pattern.match('12bc34')   # 使用編譯後的正則表達式對象直接匹配字符串
>>> m
<_sre.SRE_Match object; span=(0, 2), match='12'>

>>> test_pattern = re.compile(r'a\w+')  # 生成一個正則表達式對象(這裏是匹配以a開頭的單詞)
>>> m = test_pattern.findall('apple,blue,alone,shot,attack') # 使用findall()函數匹配全部知足匹配規則的子串
>>> m
['apple', 'alone', 'attack']

2.group()和groups()

通常用match()或search()函數匹配，獲得匹配對象後，須要用group()方法得到匹配內容；同時也能夠提取分組截獲的字符串（正則表達式中()用來分組）

示例：

>>> pattern = re.compile(r'^(\d{3})-(\d{3,8})$')  # 匹配一個3位數開頭，而後一個-，而後跟着3-8位數字的字符串
>>> m = pattern.match('020-1234567')
>>> m
<_sre.SRE_Match object; span=(0, 11), match='020-1234567'>
>>> m.group()   #  顯示整個匹配到的字符
'020-1234567'
>>> m.group(0)  # 一樣是顯示整個匹配到的字符
'020-1234567'  
>>> m.group(1)   # 提取第1個分組中的子串
'020'
>>> m.group(2)   # 提取第2個分組中的子串
'1234567'
>>> m.group(3)   # 由於不存在第3個分組，因此這裏會報錯：沒有這樣的分組
Traceback (most recent call last):
  File "<pyshell#73>", line 1, in <module>
    m.group(3)
IndexError: no such group

>>> m.groups()
('020', '1234567')
>>>