python3.5 正則表達式

時間 2019-11-30

標籤 python3.5 python 正則表達式欄目 Python 简体版

原文原文鏈接

咱們平時上網的時候，常常須要在一些網站上註冊賬號，而註冊賬號的時候對賬號信息會有一些要求。python

好比：git

上面的圖片中，輸入的郵件地址、密碼、手機號符合要求才能夠註冊成功。正則表達式

咱們是咱們本身寫的網站，那麼咱們須要判斷用戶輸入是否合法。express

那麼如何判斷用戶輸入的內容是否合法呢？本身寫函數來依次判斷？編程

python給咱們提供了更方便的工具——re模塊，也就是正則表達式模塊。緩存

在使用re模塊的時候，咱們須要瞭解一些正則表達式的基礎知識。app

什麼是正則表達式？編程語言

正則表達式（Regular expressions）其實就是描述字符串規則的代碼。好比說咱們的手機號碼的規則是由1開頭的11位數字組成。最簡單的正則表達式就是普通字符串，能夠匹配其自身。好比，正則表達式 ‘hello’ 能夠匹配字符串 ‘hello’。函數

注意：正則表達式並非一個程序，也不是python的一部分，它只是是用於處理字符串的一種模式，它有本身的一套語法規則，且十分強大。在提供正則表達式的編程語言裏，正則表達式的語法都是同樣的，區別在於不一樣的編程語言支持的語法數量不一樣，不過不用擔憂，工具

下面咱們來學習一下正則表達式的基礎知識。

學完了正則表達式，咱們來學習re模塊的使用。

re模塊爲咱們提供了一些函數和常量。

匹配模式：

正則表達式提供了一些可用的匹配模式，好比說不區分大小寫，多行匹配等等：

re.U (re.UNICODE) 使用Unicode模式匹配字符串，該模式爲默認，\w, \W, \b, \B, \d, \D, \s, \S會受到影響。

re.A(re.ASCII)：使 \w \W \b \B \s \S 只匹配 ASCII 字符，而不是 Unicode 字符。
re.I(re.IGNORECASE): 忽略大小寫
re.L(re.LOCALE): 使預約字符類 \w \W \b \B \s \S 取決於當前區域設定
re.M(re.MULTILINE): 多行模式，改變'^'和'$'的行爲，指定了之後，'^'會增長匹配每行的開始（也就是換行符後的位置）；'$'會增長匹配每行的結束（也就是換行符前的位置）。
re.S(re.DOTALL): 點任意匹配模式，改變'.'的行爲, 讓'.'能夠匹配包括'\n'在內的任意字符。
re.X(re.VERBOSE): 詳細模式。這個模式下正則表達式能夠是多行，忽略空白字符，並能夠加入註釋。如下兩個正則表達式是等價的：

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

學完了正則表達式，咱們來看一下如何使用 re 模塊。

re模塊爲咱們提供了不少方法，咱們來看經常使用的幾個：

　　re.search(pattern, string, flags=0)

對整個字符串進行搜索，並返回第一個匹配的字符串的match對象。

pattern : 使用的正則表達式

string : 要匹配的字符串

flags : 用來控制正則表達式的匹配規則。好比是否區分大小寫

示例：

>>> str1 = "Hello world"
>>> print(re.search(r"e", str1))
<_sre.SRE_Match object; span=(1, 2), match='e'>

re.match(pattern, string, flags)

從字符串「開頭」去匹配，並返回匹配的字符串的match對象。匹配不到時，返回None

示例：

>>> str1 = "Hello world"
>>> print(re.match(r"e", str1))
None
>>> print(re.match(r"He", str1))
<_sre.SRE_Match object; span=(0, 2), match='He'>

　　re.fullmatch(pattern, string, flags=0)

若是正則表達式匹配整個字符串，則返回匹配到的match對象，不然返回None。注意這裏不一樣於0長度的匹配。

>>> str1 = "Hello world"
>>> print(re.fullmatch(r"[a-z ]", str1, re.I))
None
>>> print(re.fullmatch(r"[a-z ]+", str1, re.I))
<_sre.SRE_Match object; span=(0, 11), match='Hello world'>

　　re.split(pattern, string, maxsplit=0, flags=0)

使用匹配到的內容分割字符串，返回一個列表，若是maxsplit非0,則根據指定的次數進行分割，剩餘的內容做爲列表的最後一個元素。

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

## 若是在進行分割的時候有分組，且匹配的內容的在字符串的起始位置，則結果的第一元素是一個空字符串，
>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']


## 注意這裏的 x* 匹配的0個x 會被忽略
>>> re.split('x*', 'axbc')
['a', 'bc']

##  只能匹配空字符串的pattern 通常不會分割字符串 。由於這不是一個預期的行爲，在python3.5中會報告 ValueError 錯誤
>>> re.split("^$", "foo\n\nbar\n", flags=re.M)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
ValueError: split() requires a non-empty pattern match.

　　re.findall(pattern, string, flags=0)

返回一個全部匹配的字符串的字符串列表

示例：

>>> str1 = "Hello world"
>>> print(re.findall(r"e", str1))
['e']
>>> print(re.findall(r"z", str1))
[]

　　re.finditer(pattern, string, flags=0)

返回匹配對象組成的一個迭代器

示例：

>>> str1 = "Hello python, hello python."
>>> ite = re.finditer(r"[a-z]+", str1 , re.I )
>>> print(ite)
<callable_iterator object at 0x7f79690b92e8>
>>> for i in ite:print(i)
... 
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
<_sre.SRE_Match object; span=(6, 12), match='python'>
<_sre.SRE_Match object; span=(14, 19), match='hello'>
<_sre.SRE_Match object; span=(20, 26), match='python'>

　　re.sub(pattern, repl, string, count=0, flags=0)

使用repl替換string中每個匹配的子串後返回替換後的字符串。

當repl是一個字符串時，可使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。
當repl是一個方法時，這個方法應當只接受一個參數（Match對象），並返回一個字符串用於替換（返回的字符串中不能再引用分組）。
count用於指定最多替換次數，不指定時所有替換。

>>> str1 = "hello 123, hello 456."
>>> re.sub(r"\d+", "world",str1)
'hello world, hello world.'
>>> def func(matchObj):
...     if matchObj:return "--"
...     else: return "++"
... 
>>> re.sub(r"\d+", func,str1)
'hello --, hello --.'

　　re.subn(pattern, repl, string, count=0, flags=0)

效果和 sub() 同樣,可是結果會返回一個元組 (new_string, number_of_subs_made)

>>> str1 = "hello 123, hello 456."
>>> re.subn(r"\d+", "world",str1)
('hello world, hello world.', 2)

　　re.escape(string)

轉義pattern中除了ASCII字母、數字、‘_’ 以外的全部字符。其實就是幫在特殊字符前面加 \ 。

>>> re.escape('www.python.org')
'www\\.python\\.org'

　　re.purge()

清楚正則表達式緩存

　　re.compile(pattern, flags=0)

將正則表達式編譯爲一個正則表達式對象，能夠把常用的正則表達式編譯成正則表達式對象來提升效率

>>> regex = re.compile(r"\d+")
>>> match = regex.search("hello 123")
>>> if match: print(match.group())
... 
123

正則表達式對象

經過 re.compile() 咱們能夠獲得一個編譯的正則表達式對象，他支持如下方法和屬性

regex.search(string[, pos[, endpos]])

>>> pattern = re.compile("d")
>>> pattern.search("dog")     # Match at index 0
<_sre.SRE_Match object; span=(0, 1), match='d'>
>>> pattern.search("dog", 1)  # No match; search doesn't include the "d"

regex.match(string[, pos[, endpos]])

>>> pattern = re.compile("o")
>>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
<_sre.SRE_Match object; span=(1, 2), match='o'>

regex.fullmatch(string[, pos[, endpos]])

>>> pattern = re.compile("o[gh]")
>>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
>>> pattern.fullmatch("ogre")     # No match as not the full string matches.
>>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
<_sre.SRE_Match object; span=(1, 3), match='og'>

regex.split(string, maxsplit=0)

regex.findall(string[, pos[, endpos]])

regex.finditer(string[, pos[, endpos]])

regex.sub(repl, string, count=0)

regex.subn(repl, string, count=0)

regex.flags

regex匹配的flags，

regex.groups

返回pattern中分組的數量

>>> aa = re.compile(r'(\d{3})(\d{3})(\d{3})(\d{3})(\d{3})' )
>>> aa.groups
5

regex.groupindex

返回定義了名字的分組組成的一個字典映射

>>> aa = re.compile(r'(?P<id01>\d{3})(?P<id02>\d{3})' )
>>> aa.groupindex
mappingproxy({'id01': 1, 'id02': 2})

regex.pattern

返回被編譯的 pattern 字符串

>>> aa = re.compile(r'(?P<id01>\d{3})(?P<id02>\d{3})' )
>>> aa.pattern
'(?P<id01>\\d{3})(?P<id02>\\d{3})'

咱們獲得匹配後的文本後會獲得匹配對象——Match對象

Match對象默認是一個爲True 的布爾值。因爲 match() and search() 在沒有匹配到內容的時候會返回None值，此時你能夠用if語句測試是否有Match對象。

match = re.search(pattern, string)
if match:
    process(match)

Match對象支持如下方法和屬性：

match.expand(template)

將匹配到的分組代入template中而後返回。template中可使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。\id與\g<id>是等價的；但\10將被認爲是第10個分組，若是你想表達\1以後是字符'0'，只能使用\g<1>0。

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.expand(r"\2 \1 \2 \1")
'Newton Isaac Newton Isaac'

match.group([group1, ...])

得到匹配後的分組字符串，參數爲編號或者別名；group(0)表明整個字符串，group(1)表明第一個分組匹配到的字符串，依次類推；若是編號大於pattern中的分組數或者小於0，則返回IndexError。另外，若是匹配不成功的話，返回None；若是在多行模式下有多個匹配的話，返回最後一個成功的匹配。

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group()
'Isaac Newton'
>>> m.group(0)
'Isaac Newton'
>>> m.group(1)
'Isaac'
>>> m.group(2)
'Newton'
>>> m.group(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

match.groups(default=None)

返回一個tuple，包含全部的分組匹配結果；若是default設爲None的話，若是有分組沒有匹配成功，則返回"None"；若設爲0，則返回"0"。

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.groups()
('Isaac', 'Newton')

match.groupdict(default=None)

和上一個類似，不過返回的是dictionary，包含全部命名的分組和其匹配的值，若是有分組沒有匹配成功，返回默認值"None"。

>>> m = re.match(r"(?P<id>\w+) (\w+)", "Isaac Newton, physicist")
>>> m.groupdict()
{'id': 'Isaac'}

match.start([group]) / match.end([group])

返回指定的組截獲的子串在string中的起始索引（子串第一個字符的索引）/ 結束索引（子串最後一個字符的索引+1），group默認值爲0。若是[group]存在可是沒有匹配成功，返回-1。

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'tony@tiger.net'

match.span([group])

返回一個分組[group]成功匹配時的信息，2-tuple，(m.start([group]), m.end([group]))；若是分組沒有成功匹配，返回(-1,-1)。

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> m.span()
(7, 18)

match.pos

在string中匹配時，開始匹配的下標。

match.endpos

在 string中匹配時，結束匹配的下標。

>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> len(email)
25
>>> m.pos
0
>>> m.endpos
25

match.lastindex

最後一個被捕獲的分組在文本中的索引。若是沒有被捕獲的分組，將爲None。

>>> m = re.match(r"(?P<id>\w+) (\w+)", "Isaac Newton, physicist")
>>> m.lastindex
2

match.lastgroup

返回分組匹配最後成功的分組別名；若是沒有一個分組匹配成功，或者最後一個成功匹配的分組沒有別名，返回None。

>>> m = re.match(r"(?P<id>\w+) (\w+)", "Isaac Newton, physicist")
>>> m.lastgroup
>>> m = re.match(r"(?P<id>\w+) (?P<word>\w+)", "Isaac Newton, physicist")
>>> m.lastgroup
'word'

match.re

執行該match對象的正則表達式對象。

>>> m = re.match(r"(?P<id>\w+) (\w+)", "Isaac Newton, physicist")
>>> m.re
re.compile('(?P<id>\\w+) (\\w+)')

match.string

傳遞到match()或search()函數中的字符串。

>>> m = re.match(r"(?P<id>\w+) (\w+)", "Isaac Newton, physicist")
>>> m.string
'Isaac Newton, physicist'

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。