認識python正則模塊re

時間 2019-11-19

原文原文鏈接

python正則模塊re

　　python中re中內置匹配、搜索、替換方法見博客---python附錄-re.py模塊源碼（含re官方文檔連接）html

　　正則的應用是處理一些字符串，phthon的博文python-基礎學習篇（二）中提到了字符串類型有一些字符串內置的處理方法，可是須要了解一點內置方法是適用於一些簡單字符串的處理，複雜的字符串處理方法仍是正則表達式的天下。至於爲啥要整一些內置方法，我我的認爲對於一些簡單應用中的字符串處理，無需使用一個總體的系統的正則知識，同時也是python易入門的體現。python

　　python中的正則內置於re模塊中，使用正則以前須要導入re模塊。git

import re

　　有了以前的正則表達式的基礎，咱們能夠寫出一些正則表達式（pattern）了，如何使用正則表達式去處理字符串(string)呢？只能經過re模塊中內置的幾個方法去操做。正則表達式

　　re模塊內置的函數方法

　　re.compile(pattern, flags=0)

　　re.compile()方法能夠把一個正則表達式編譯成一個正則對象(PatternObj)，返回的正則對象是操做其餘處理字符串方法的主體。數組

pattern_obj = re.compile(pattern)
match_obj = pattern_obj.compile(string)

　　等同於緩存

match_obj = re.match(pattern,string)

　　實際上re.match()處理流程內含re.compile()的過程。match方法源碼：app

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

　　能夠看出match方法返回的實際就是正則對象pattern_obj調用match()方法的結果。函數

　　re.search(pattern, string, flags=0)

　　re.search()方法是搜索整個字符串，找到第一個符合正則規則的字符串部分，返回一個匹配對象(MatchObject)；沒有匹配成功，就返回None。post

 1 import re
 2 
 3 
 4 pattern = r'the'
 5 match_obj = re.search(pattern, 'The dog is eating the bone', re.I)
 6 print(match_obj.group(0))
 7 print(match_obj)
 8 
 9 
10 # The
11 # <re.Match object; span=(0, 3), match='The'>

　　re.match(pattern, string, flags=0)

　　re.match()方法是從字符串開始位置匹配整個字符串，當從字符串開始成功匹配到部分字符內容，返回一個匹配對象(MatchObject)；沒有匹配成功，就返回None。學習

1 import re
2 
3 
4 pattern = r'the'
5 match_obj = re.match(pattern, 'Dog is eating the bone', re.I)
6 print(match_obj)
7 
8 # None

　　對比

 1 import re
 2 
 3 
 4 pattern = r'the'
 5 match_obj = re.match(pattern, 'The dog is eating the bone', re.I)
 6 print(match_obj.group(0))
 7 print(match_obj)
 8 
 9 # The
10 # <re.Match object; span=(0, 3), match='The'>

　　re.search()和re.match()區別對比：位置上，search()方法能夠從字符串任意位置匹配部分字符串內容，match()方法必須從字符串開始位置匹配字符串內容，一旦開頭匹配不成，則匹配失敗；內容上，search()方法是非貪婪匹配，只要找到第一個符合正則規則的部分字符串就返回匹配對象，match()方法則是按照正則規則只匹配字符串開始位置的部分字符串；多行模式下，match()方法依舊只會匹配字符串的開始位置，而search()方法和「^」聯合使用則是從多行的每一行開始匹配。

　　re.fullmatch(pattern, string, flags=0)

　　re.fullmatch()類似於re.match()是從字符串開始位置開始匹配，re.match()是匹配字符串部分或者所有，而re.fullmatch()是匹配字符串的所有，當且僅當正則表達式匹配整個字符串內容的時候，返回一個匹配對象MatchObject，不然返回None。

　　re.split(pattern, string, maxsplit=0, flags=0)

　　re.split()表示對字符串string，按照正則表達式pattern匹配內容分隔字符串，其中maxsplit是指最大分隔次數，最大分隔次數應該是小於默認分隔次數的。分隔後的字符串內容組成列表返回。

 1 import re
 2 
 3 
 4 split_list_default = re.split(r'\W+', 'Words, words, words.')
 5 print(split_list_default)
 6 
 7 # ['Words', 'words', 'words', ''] 正則表達式\W+表示以一個或多個非單詞字符對字符串分隔，分隔後組成列表的形式返回，注意列表後空字符串爲'.'和以前的words分隔結果
 8 
 9 split_list_max = re.split(r'\W+', 'Words, words, words.', 1)
10 print(split_list_max)
11 
12 # ['Words', 'words, words.'] 指定分隔次數，字符串分隔會由左至右按照maxsplit最大分隔次數分隔，實際最大分隔次數是小於等於默認分隔次數的
13 
14 split_list_couple = re.split(r'(\W+)', 'Words, words, words.')
15 print(split_list_couple)
16 
17 # ['Words', ', ', 'words', ', ', 'words', '.', ''] 正則表達式中存在分組狀況，即捕獲型括號，(\W+)會捕獲字符串中‘， ’並添加至列表一塊兒顯示出來

　　re.findall(pattern, string, flags=0)

　　re.findall()相似於re.search()方法，re.search()是在字符串中搜索到第一個與正則表達式匹配的字符串內容就返回一個匹配對象MatchObject，而re.findall()方法是在字符串中搜索並找到全部與正則表達式匹配的字符串內容，組成一個列表返回，列表中元素順序是按照正則表達式在字符串中由左至右匹配的返回；未匹配成功，返回一個空列表。

import re


pattern = r'\d{3}'
find = re.findall(pattern, 'include21321exclude13243alert213lib32')
print(find)

# ['213', '132', '213']

　　注意：當re.findall()中的正則表達式存在兩個或兩個以上分組時，按照分組自左向右的形式匹配，匹配結果按照順序組成元組，返回列表中元素以元組的形式給出。

import re


pattern = r'(\d{3})(1)'
find = re.findall(pattern, 'include21321exclude13243alert213lib32')
print(find)

# [('132', '1')]

　　re.finditer(pattern, string, flags=0)

　　re.finditer()類似於re.findall()方法，搜索字符串中全部與正則表達式匹配的字符串內容，返回一個迭代器Iterator，迭代器Iterator內保存了全部匹配字符串內容生成的匹配對象MatchObject。即匹配文本封裝在匹配對象MatchObject中，多個匹配對象MatchObject保存在一個迭代器Iterator中。

import re


pattern = r'\d{3}'
find = re.finditer(pattern, 'include21321exclude13243alert213lib32')
print(find)
for i in find:
    print(i)
    print(i.group(0))

# <callable_iterator object at 0x00000000028FB0F0>
# <re.Match object; span=(7, 10), match='213'>
# 213
# <re.Match object; span=(19, 22), match='132'>
# 132
# <re.Match object; span=(29, 32), match='213'>
# 213

　　re.sub(pattern, repl, string, count=0, flags=0)

　　re.sub()表示用正則表達式匹配字符串string中的字符串內容，使用repl參數內容替換匹配完成的字符串內容，返回替換後的字符串。參數count指定替換次數，正則表達式匹配字符串是由左至右的，可能匹配多個內容，替換操做也是自左向右替換，若是隻想替換左邊部分匹配內容能夠設置count參數，參數值爲非負整數且小於等於最大匹配成功個數；未匹配成功，不作替換，返回原字符串。

import re


pattern = r'\d+'
find_default = re.sub(pattern, ' ', 'include21321exclude13243alert213lib32')
print(find_default)

find_count = re.sub(pattern, ' ', 'include21321exclude13243alert213lib32', 2)
print(find_count)

# include exclude alert lib
# include exclude alert213lib32

　　注意：repl參數內容能夠是字符串也能夠是函數，若是repl是函數，要求這個函數只能有一個匹配對象MatchObject參數，將匹配成功後生成的匹配對象傳入函數處理後拼接到原字符串返回。

import re


def replace_func(match_obj):
    if match_obj.group(0).isdigit():
        return ' '
    else:
        return '-'


pattern = r'\d+'
find_default = re.sub(pattern, replace_func,  'include21321exclude13243alert213lib32')
print(find_default)

# include exclude alert lib

　　re.subn(pattern, repl, string, count=0, flags=0)

　　re.subn()與re.sub()做用相同，只在返回結果有所差異，re.sub()返回是替換後的字符串，而re.subn()返回是一個由替換後的字符串和替換次數組合成的元組。

import re


def replace_func(match_obj):
    if match_obj.group(0).isdigit():
        return ' '
    else:
        return '-'


pattern = r'\d+'
find_default = re.subn(pattern, replace_func,  'include21321exclude13243alert213lib32')
print(find_default)

# ('include exclude alert lib ', 4)

　　re.escape(pattern)

　　轉義正則表達式中能夠產生特殊含義的字符，主要用於匹配文本字符串中含有正則表達式的情形。

import re


result = re.escape('\d*')
print(result)

# \\d\*

　　re.purge()

　　清除正則表達式緩存

　　參數flags

　　上述方法中含有默認參數flags=0，能夠經過函數的調用爲flags指定特殊的參數值來指定匹配模式。經常使用參數值有：

　　re.I(re.IGNORECASE)，不區分大小寫模式；

　　re.M(re.MULTILINE)，多行模式；

　　re.S(re.DOTALL)，單行模式；

　　re.X(re.VERBOSE)，註釋模式；

　　正則對象（Pattern）

　　正則對象可使用直接調用上述方法，在re.match()方法中有所描述。由match()方法到subn()方法都是正則對象Pattern的實例方法，正則對象Pattern的實例屬性有：

　　Pattern.flags

　　指定或獲取匹配模式，如：Pattern.flags = re.I，可是通常不直接操做實例屬性，由實例方法操做實例屬性，故該屬性多用於獲取匹配模式。

　　Pattern.groups

　　獲取捕獲分組的數量。

　　Pattern.pattern

　　獲取原始正則表達式

　　匹配對象（MatchObject）

　　匹配對象是對匹配內容的封裝。

　　MatchObject.group(num)

　　獲取匹配對象中封裝的匹配內容，group(0)表示獲取所有內容，大於等於1表示獲取對應捕獲分組中的內容。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.group(0))
print(match_obj.group(1))
print(match_obj.group(2))

# Snow Stack
# Snow
# Stack

　　MatchObject.__getitem__(num)

　　做用同MatchObject.group(num)。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj[0])
print(match_obj[1])
print(match_obj[2])

# Snow Stack
# Snow
# Stack

　　MatchObject.groups()

　　以元組的形式返回全部捕獲分組內容，只返回捕獲分組中的內容，不包含其餘匹配內容。

import re


pattern = r'(\w+) (\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.groups())

# ('Snow', 'Stack')

　　MatchObject.groupdict()

　　返回一個字典，包含了全部的命名子組。key就是組名，value就是捕獲分組匹配的內容。

import re


pattern = r'(?P<first_name>\w+) (?P<last_name>\w+)'
match_obj = re.match(pattern, 'Snow Stack')
print(match_obj.groupdict())

# {'first_name': 'Snow', 'last_name': 'Stack'}