Python 基礎部分-正則表達式

時間 2020-12-01

標籤 python git 正則表達式 express app ide 函數 post this 欄目 Python 简体版

原文原文鏈接

正則表達式python

經過調用re模塊，執行正則表達式的功能。git

import re

字符匹配（普通字符，元字符）：正則表達式

普通字符：大多數字符和字母都會和自身匹配express

元字符：app

#元字符：
.  ^  $  *  +  ?  {  }  [  ]  |  (  )  \

"." 匹配任意 單個 字符

res = re .findall('ale.x', 'kwaleexsandra') #
print(res)

>>>
['aleex']

"^" 匹配一行的開頭位置
res = re .findall('^alex', 'alexsandra') #
print(res)

>>>
['alex']

"$" 匹配一行的結束位置

res = re .findall('alex$', 'sandraalex') #
print(res)

>>>
['alex']

"*" 重複匹配指定的字符任意次數，也能夠不匹配 {0, }

res = re .findall('alex*', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex*', 'wwwale') #
print(res)

>>>
['ale']

res = re .findall('alex*', 'wwwalexxxx') #
print(res)

>>>
['alexxxx']

"+"  匹配指定字符至少一次，最多可能任意屢次 {1, }

res = re .findall('alex+', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex+', 'wwwalexxxx') #
print(res)

>>>
['alexxxx']

res = re .findall('alex+', 'wwwale') #
print(res)

>>>
[]

"?" 匹配指定字符零到一次 {0,1}

res = re .findall('alex?', 'wwwalex') #
print(res)

>>>
['alex']

res = re .findall('alex?', 'wwwalexxxx') #
print(res)

>>>
['alex']

res = re .findall('alex?', 'wwwale') #
print(res)

>>>
['ale']

"{min, max }"  匹配指定字符至少min次，最多max次

res = re .findall('alex{3}', 'wwwalexxxx') #匹配alex x最少匹配3次
print(res)

>>>
['alexxx']


res = re .findall('alex{3,5}', 'wwwalexxxx') #匹配alex x最少匹配3次，最多5次
print(res) 

>>>
['alexxxx']

"[ ]"  匹配[ ]內指定的其中一個字符 

res = re .findall('a[bc]d', 'wwwabd') #
print(res)

>>>
['abd']

res = re .findall('a[bc]d', 'wwwacd') #
print(res)

>>>
['acd']

res = re .findall('a[bc]d', 'wwwabcd') #
print(res)

>>>
[]

"-" 按順序匹配 " - " 之間的全部字符，需配合"[ ]" 使用

res = re .findall('a-z', 'wwwabd') #
print(res)

>>>
[]

res = re .findall('[a-z]', 'wwwabd') #
print(res)

>>>
['w', 'w', 'w', 'a', 'b', 'd']

res = re .findall('1-9', '127.0.0.1') #
print(res)

>>>
[]

res = re .findall('[1-9]', '127.0.0.1') #
print(res)

>>>
['1', '2', '7', '1']

"[ ]"結合"^" 但是匹配具備「非」的功能

res = re.findall('[^0-9]', "portpostid9987-alex")
print(res)

>>>
['p', 'o', 'r', 't', 'p', 'o', 's', 't', 'i', 'd', '-', 'a', 'l', 'e', 'x']


res = re.findall('[^a-z]', "portpostid9987-alex")
print(res)

>>>
['9', '9', '8', '7', '-']

"\"
反斜槓後邊跟 元字符 去除特殊功能；
反斜槓後邊跟 普通字符 實現特殊功能；
引用序號對應的字組所匹配的字符串

\d 匹配任意十進制數； 至關於[0-9]
res = re.findall('\d', "id9987-alex")
print(res)

>>>
['9', '9', '8', '7']

[\d] "\"在字符級的符號中一樣具備特殊功能
res = re.findall('[\d]', "id9987-alex")
print(res)

>>>
['9', '9', '8', '7']

\D 匹配任意非數字字符；至關於[^0-9]
res = re.findall('\D', "id9987-alex")
print(res)

>>>
['i', 'd', '-', 'a', 'l', 'e', 'x']

\s 匹配任意數字字符；至關於[a-zA-z0-9]
res = re.findall('\w', "id9987-alex")
print(res)

>>>
['i', 'd', '9', '9', '8', '7', 'a', 'l', 'e', 'x']

\S 匹配任意 非 數字字符；至關於[^a-zA-z0-9]
res = re.findall('\W', "id9987-alex")
print(res)

>>>
['-']

"( )" 匹配封閉括號中的正則表達式RE，並保存爲子組

res = re.search("(ab)*", "aba").group()
print(res)

>>>
ab

\b 匹配單詞邊界 \bXXXb

ret = re.findall(r"\babc\b", "sdasdssd abc asdssdasds")
print(ret)

>>>
['abc']

>>> re.findall("abc\b", "asdas abc")
[]
>>> re.findall(r"abc\b", "asdas abc") #"r"使"\b"爲含有特殊意義的字符
['abc']
>>> re.findall(r"\bI", "IMISS IOU") #匹配出單詞左邊的"I"
['I', 'I']
>>> re.findall(r"I\b", "IMISS IOU") #匹配出單詞有邊的"I"
[]

r"""Support for regular expressions (RE).

This module provides regular expression matching operations similar to
those found in Perl.  It supports both 8-bit and Unicode strings; both
the pattern and the strings being processed can contain null bytes and
characters outside the US ASCII range.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like "A", "a", or "0", are the simplest
regular expressions; they simply match themselves.  You can
concatenate ordinary characters, so last matches the string 'last'.

The special characters are:
    "."      Matches any character except a newline.
    "^"      Matches the start of the string.
    "$"      Matches the end of the string or just before the newline at
             the end of the string.
    "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
             Greedy means that it will match as many repetitions as possible.
    "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    "?"      Matches 0 or 1 (greedy) of the preceding RE.
    *?,+?,?? Non-greedy versions of the previous three special characters.
    {m,n}    Matches from m to n repetitions of the preceding RE.
    {m,n}?   Non-greedy version of the above.
    "\\"     Either escapes special characters or signals a special sequence.
    []       Indicates a set of characters.
             A "^" as the first character indicates a complementing set.
    "|"      A|B, creates an RE that will match either A or B.
    (...)    Matches the RE inside the parentheses.
             The contents can be retrieved or matched later in the string.
    (?aiLmsux) Set the A, I, L, M, S, U, or X flag for the RE (see below).
    (?:...)  Non-grouping version of regular parentheses.
    (?P<name>...) The substring matched by the group is accessible by name.
    (?P=name)     Matches the text matched earlier by the group named name.
    (?#...)  A comment; ignored.
    (?=...)  Matches if ... matches next, but doesn't consume the string.
    (?!...)  Matches if ... doesn't match next.
    (?<=...) Matches if preceded by ... (must be fixed length).
    (?<!...) Matches if not preceded by ... (must be fixed length).
    (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                       the (optional) no pattern otherwise.

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.

This module exports the following functions:
    match     Match a regular expression pattern to the beginning of a string.
    fullmatch Match a regular expression pattern to all of a string.
    search    Search a string for the presence of a pattern.
    sub       Substitute occurrences of a pattern found in a string.
    subn      Same as sub, but also return the number of substitutions made.
    split     Split a string by the occurrences of a pattern.
    findall   Find all occurrences of a pattern in a string.
    finditer  Return an iterator yielding a match object for each match.
    compile   Compile a pattern into a RegexObject.
    purge     Clear the regular expression cache.
    escape    Backslash all non-alphanumerics in a string.

Some of the functions in this module takes flags as optional parameters:
    A  ASCII       For string patterns, make \w, \W, \b, \B, \d, \D
                   match the corresponding ASCII character categories
                   (rather than the whole Unicode categories, which is the
                   default).
                   For bytes patterns, this flag is the only available
                   behaviour and needn't be specified.
    I  IGNORECASE  Perform case-insensitive matching.
    L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
    M  MULTILINE   "^" matches the beginning of lines (after a newline)
                   as well as the string.
                   "$" matches the end of lines (before a newline) as well
                   as the end of the string.
    S  DOTALL      "." matches any character at all, including the newline.
    X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
    U  UNICODE     For compatibility only. Ignored for string patterns (it
                   is the default), and forbidden for bytes patterns.

This module also defines an exception 'error'.

"""

正則表達式主要功能函數ide

search()函數

瀏覽所有字符串，匹配第一個符合規則的字符串post

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

origin = "hello alex bac alex leg alex tms alex 19"

r2 = re.search("a(\w+).*(?P<name>\d)$", origin) #  "a(\w+)" 匹配出a開頭的第一個單詞，並將a後的字符分紅一組
　　　　　　　　　　　　　　　　　　　　　　　　　　　　   .* 匹配字符a開頭的第一個單詞後的全部單詞，再貪婪匹配後一位到數字「1」
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　(?P<name>\d)$  字典模式匹配到第一個數字的後一位，將其做爲字典key：{"name"}中的值

print(r2.group())
>>>alex bac alex leg alex tms alex 19

print(r2.groups()) #獲取模型中匹配到的分組結果
>>>('lex', '9')

print(r2.groupdict())
>>>{'name': '9'}

origin = "hello alex bad alrx lge alex acd 19w"

n = re.search("(a)(\w+)", origin)

print(n.group())
>>>
alex

print(n.groups())
>>>
('a', 'lex')

match()ui

從頭開始匹配this

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""

　　 return _compile(pattern, flags).match(string)

# flags
A = ASCII = sre_compile.SRE_FLAG_ASCII # assume ascii "locale"
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case #忽略大小寫
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode "locale"
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline #可多行匹配
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline #匹配全部字符，包括換行符
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comments #忽略匹配字符串的註釋和空格符

匹配方式：

分組：正則出現「()」分組，系統把原來匹配到的全部結果，額外再匹配括號內的內容，把內容放到groups裏面，?P<>至關於把組裏面的內容加上key，做爲鍵值對放到group裏面

不分組：匹配到的全部內容都放入group裏面

origin = "hello alex bac alex leg alex tms alex"
r = re.match("h\w+", origin)
r2 = re.match("(h)(\w+)", origin)
print(r.group()) #獲取匹配到的全部結果
print(r2.group())

>>>
hello
hello

print(r.groups()) #獲取模型中匹配到的分組結果
print(r2.groups())
>>>
()
('h', 'ello')


#?P<xx>X 以尖角括號裏面的值爲key，以匹配的值爲value
r3 = re.match("(?P<n1>h)(?P<n2>\w+)", origin)
print(r.groupdict()) #獲取模型中匹配到的分組結果
print(r3.groupdict())
>>>
{}
{'n1':'h', 'n2':'hello'}

findall()

按照字符串循序逐個匹配，匹配成功就跳至匹配到的最後一個字符後，最後把匹配到的全部內容放在一個列表中。

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

貪婪模式中，findall再額外匹配一個內容。

（1）若正則裏面有一個組，自動把組裏的第一個結果拼接出來，放入列表。

（2）若正則裏面有多個組，把組放入一個元組裏面，做爲列表的一個元素。

origin = "hello alex bad alrx lge alex acd 19w"


r = re.findall("a\w+", origin) #匹配含有字符"a"的全部字符串,輸出列表
print(r)

>>>
['alex', 'ad', 'alrx', 'alex', 'acd']

r1 = re.findall("a(\w+)", origin) #匹配字符"a"後面出現的全部字符串
print(r1)

>>>
['lex', 'd', 'lrx', 'lex', 'cd']

r2 = re.findall("(a)(\w+)", origin) #分組匹配含有字母"a", 以及字母"a"後面的字符串,組成元組
print(r2)

>>>
[('a', 'lex'), ('a', 'd'), ('a', 'lrx'), ('a', 'lex'), ('a', 'cd')]

r3 = re.findall("(a)(\w+)(x)", origin)
print(r3)

>>>
[('a', 'le', 'x'), ('a', 'lr', 'x'), ('a', 'le', 'x')]

r4 = re.findall("(a)(\w+(e)(x)"),origin)
print(r4)

>>>
[('a', 'le', 'e', 'x'), ('a', 'le', 'e', 'x')]

ret = re.findall("a(\d+)", "a23b")
print(ret, type(ret))

>>>
['23'] <class 'list'>

ret = re.search("a(\d+)", "a23b").group()
print(ret, type(ret))

>>>
a23 <class 'str'>

ret = re.match("a(\d+)", "a23b").group()
print(ret, type(ret))

>>>
a23 <class 'str'>

貪婪模式中"()*"，findall再額外匹配一個內容。

>>> re.findall("www.(baidu|laonanhai).com", "asdqwerasdf www.baidu.com") #正則優先匹配括號組的內容
['baidu']
>>> re.findall("www.(?:baidu|laonanhai).com", "asdqwerasdf www.baidu.com") #"?:"除去括號裏面baidu的優先匹配的功能
['www.baidu.com']
>>>

r = re.findall("\d+\w\d+", "a2b3c4d5") #匹配順序：匹配2b3後，從c來講再匹配
print(r)

>>>
['2b3', '4d5']

a = "alex"
n = re.findall("(\w)(\w)(\w)(\w)", a)
print(n)

>>>
[('a', 'l', 'e', 'x')]

n2 = re.findall("(\w)*", a) #"*"貪婪匹配，默認匹配第四個字符串，再匹配最後沒有任何字符的位置。
print(n2)

>>>
['x', '']

n3 = re.findall("", "abcd")
print(n3)

>>>
['', '', '', '', '']

finditer()

生成可迭代的匹配內容

origin = "hello alex bad alrx lge alex acd 19w"

r = re.finditer("(a)(\w+(e))(?P<N1>x)", origin)
print(r)

>>>
<callable_iterator object at 0x00275A30>

for i in r:
    print(i)
    >>>
    <_sre.SRE_Match object; span=(6, 10), match='alex'>
    <_sre.SRE_Match object; span=(24, 28), match='alex'>

    print(i.group())
    >>>
    alex
    alex

    print(i.groups())
    >>>
    ('a', 'le', 'e', 'x')
    ('a', 'le', 'e', 'x')

    print(i.groupdict())
    >>>
    {'N1': 'x'}
    {'N1': 'x'}

sub()

def sub(pattern, repl, string, count=0, flags=0):

    return _compile(pattern, flags).sub(repl, string, count)

old_str  = "123askdjf654lkasdfasdfwer456"
new_str = re.sub("\d+", "TMD", old_str, 2) #按順序匹配兩組數字，並把匹配到的數字替換成"TMD", 
print(new_str)

>>>
TMDaskdjfTMDlkasdfasdfwer456

subn()

def subn(pattern, repl, string, count=0, flags=0):

    return _compile(pattern, flags).subn(repl, string, count)

old_str  = "123askdjf654lkasdf789asdfwer456"
new_str, count = re.subn("\d+", "TMD", old_str)#返回替換的字符串，以及替換的次數 print(new_str, count)

>>>
TMDaskdjfTMDlkasdfTMDasdfwerTMD 4

split()

origin = "hello alex bcd abcd lge acd 19"

n = re.split("a\w+", origin) #匹配"a"開頭的元素，把"a"開頭的元素分離出列表
print(n)
>>>
['hello ', ' bcd ', ' lge ', ' 19']

n1 = re.split("(a\w+)", origin, 1) #把匹配出來的字符做列表的元素，並把該元素做爲分割把先後的元素，組成了列表。
print(n1)
>>>
['hello ', 'alex', ' bcd abcd lge acd 19']

n2 = re.split("a(\w+)", origin, 1)
print(n2)
>>>
['hello ', 'lex', ' bcd abcd lge acd 19']

origin2 = "1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)"

n1 = re.split("(\([^()]+\))", origin2, 1)
print(n1)
>>>
['1-2*((60-30+', '(-40.0/5)', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)']

n2 = re.split("\(([^()]+)\)", origin2, 1)
print(n2)
>>>
['1-2*((60-30+', '-40.0/5', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)']

#split應用計算器

origin2 = "1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)"

def f1(content): #計算函數
    return ("1")


while True:
    print(origin2)
    result = re.split("\(([^()]+)\)", origin2,1)
    if len(result) == 3:
        before = result[0]
        content = result[1]
        after = result[2]
        r = f1(content)
        new_str = before + str(r) + after
        origin2 = new_str
    else:
        f_result = f1(origin2)
        print(f_result)
        break

>>>
1-2*((60-30+(-40.0/5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)
1-2*((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2)
1-2*((60-30+1*1)-(-4*3)/(16-3*2)
1-2*(1-(-4*3)/(16-3*2)
1-2*(1-1/(16-3*2)
1-2*(1-1/1
1

compile

Compile a pattern into a RegexObject.

"r"的使用方法

正則表達式中"r" = rawstring原生字符，除去多個系統中特殊符號的意義

>>> re.match('\bblow', "blow")
#沒有匹配成功

>>> re.match('\\bblow', 'blow') #\\b取消了\b在python的特殊意義,至關於\b是匹配邊際符
<_sre.SRE_Match object; span=(0, 4), match='blow'> 

>>> re.match(r'\bblow','blow') #使用原生字符, 直接除去python特殊意義
<_sre.SRE_Match object; span=(0, 4), match='blow'>
>>>

匹配小括號內的內容

#優先匹配小括號內的內容
#"\(  \)" 須要匹配括號小括號裏面的內容
#("\([^()]")   [^()]匹配括號內外的一個內容,包括數字運算符號
#("\([^()]")   [^()]*  匹配括號內外的全部,包括數字運算符號
#("\([^()]*\)")

source = "2*3+4(2*(3+4.5*5.5)-5)"
# res = re.search("\(\d+[\+\-\*\]\)")
res = re.search("\([^()]*\)", source).group()
print(res)

>>>
(3+4.5*5.5)

匹配浮點型的數字

#匹配浮點型的數字
#('\d+  匹配一個或多個數字
#('\d+\.?  \.?匹配 匹配小數點 一個或沒有 \.是出去"."在正則的特殊意義
#('\d+\.?\d*   \d*  因爲小數點無關緊要，如有小數點匹配小數點後的全部數字,若沒有小數點則無序匹配，因此用「*」

source_f = "abc3.5555abc"
r = re.search("(\d+\.?\d*\d*)", source_f).group()
print(r)

>>>
3.5555

匹配整數的乘除冪運算

#匹配整數的乘除冪運算
#[*/] 匹配*/法
#[*/]|\*\* 匹配*/法和冪遠算
#([*/]|\*\*) 匹配運算過程當中出現的*/法和冪遠算
#('\d+\.?\d+   ([*/]|\*\*)   \d+\.?\d+


res = "(3+4.5*5.5)"

res1 = re.search('\d+\.?\d+([*/]|\*\*)\d+\.?\d+', res)
print(res1, type(res1))

>>><_sre.SRE_Match object; span=(3, 10), match='4.5*5.5'> <class '_sre.SRE_Match'>

匹配ip地址

#匹配ip地址
#IP地址字段爲001-255,（1）匹配001-199，（2）匹配200-249，（3）匹配250-255
#[01]?)\d?\d 匹配首位爲0或1的字段，後兩位爲00-99或0-9的數字
#2[0-4]\d 匹配首位爲2的字段，後一位爲0-4，最後一位爲任意數
#25[0-5])\. 匹配前兩位以「25」開頭的字段，後面一位數爲0-5之間

w = re.search(r"((([01]?)\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5]\.)", '192.168.1.1').group()
print(w, type(w))

>>>
192.168.1.1

計算匹配

l1_expression = re.compile(r'(-?\d+)(\.\d+)?[-+](-?\d+)(\.\d+)?')             #匹配加減的正則表達式

l2_expression = re.compile(r'(-?\d+)(\.\d+)?[/*](-?\d+)(\.\d+)?')             #匹配乘除的正則表達式

l3_expression = re.compile(r'(-?\d+)(\.\d+)?\*-(-?\d+)(\.\d+)?')              #匹配乘負數的正則表達式

l4_expression = re.compile(r'(-?\d+)(\.\d+)?/-(-?\d+)(\.\d+)?')               #匹配除負數的正則表達式

l5_expression = re.compile(r'\([^()]*\)')