re模塊正則匹配

時間 2019-11-11

標籤模塊正則匹配欄目正則表達式简体版

原文原文鏈接

import rehtml

re.M 多行模式位或的意思正則表達式

parrterm就是正則表達式的字符串，flags是選項，表達式須要被編譯，經過語法、策劃、分析後衛其編譯爲一種格式，與字符串之間進行轉換ide

re模塊函數

主要爲了提速，re的其餘方法爲了提升效率都調用了編譯方法，就是爲了提速url

re的方法spa

單次匹配rest

re.compile 和 re.matchhtm

def compile(pattern, flags=0):對象

return _compile(pattern, flags)索引

可看到，re最後返回的是_compile內部方法並對其進行轉換

def match(pattern, string, flags=0):

return _compile(pattern, flags).match(string)

使用re模塊

import re

s = '0123abc'

regx = re.compile('\d')

print(type(regx))

print(re.match(regx,s))

<_sre.SRE_Match object; span=(0, 1), match='0'>

意思爲返回一個模塊爲match

從頭開始匹配掃描，發現匹配一次則不會再繼續向下掃描執行

import re

s = '0123abc'

regx = re.compile('\d')

matcher = re.match('\d',s)

#經過regex查看編譯後的結果

print(matcher)

matcher.endpos

<_sre.SRE_Match object; span=(0, 1), match='0'>

每次運行的時候須要調用match

match內部調用了編譯方法，意思是說明這個是內部函數，因此compile和match的處理方式是同樣的

而match本質是從頭向後匹配，找到馬上返回

s = '0123abc'

regx = re.compile('[a|b]')

matcher = regx.match(s)

print(type(matcher))

print(matcher)

None

發現是None，match要求必須是從頭開始

改進：

s = 'a0123abc'

regx = re.compile('^[a|b]')

matcher = re.match(regx,s)

matcher = regx.match(s)

print(type(matcher))

print(matcher)

<_sre.SRE_Match object; span=(0, 1), match='a'>

編譯後的match方法能夠本身定義索引位置，可是compile沒有

import re

s = '0123abc'

regex = re.compile('[a|b]')

matcher = re.match('\d',s)

print(type(matcher))

print(matcher)

matcher = regex.match(s,2)

print(matcher)

<_sre.SRE_Match object; span=(0, 1), match='a'>

search 方法

import re

s = '012abc'

mather = re.search('[ab]',s)

print(mather)

<_sre.SRE_Match object; span=(3, 4), match='a'>

可看到已匹配到a

match和serach對比

match 找到第一個當即返回，位置是從3開始，直接打印，只匹配一次

search 無論從什麼位置開始，找到第一個匹配的則當即返回，行爲上和match差很少，只不過search是能夠不定位置的

通常狀況先都爲先編譯後使用，因此儘可能少使用compile

有些狀況也是第一次使用匹配的字符，因此若是明確開頭是想要的，直接使用match，否則使用search頻率比較高

fullmatch 全文匹配

fullmatch至關於正則匹配全場

import re

s = '0123abc'

regx = re.compile('[ab]')

matcher = re.fullmatch('\w',s)

print(matcher)

matcher = regx.fullmatch(s)

print(matcher)

None

#改進：

matcher = regx.fullmatch(s,4,5)

print(matcher)

<_sre.SRE_Match object; span=(4, 5), match='a'>

res = re.fullmatch('bag',s)

print(res)

None

因爲fullmatch屬於全文匹配，因此要麼必須有範圍，要麼必須得知其長度

s = '''bottle\nbag\nbig\nable'''

regex = re.compile('bag')

res = regex.fullmatch(s,7,10) #匹配徹底長度

print(res)

<_sre.SRE_Match object; span=(7, 10), match='bag'>

使用全文匹配最好：

·儘可能匹配字符

·先切片再匹配

每一個字符對應索引不知，將索引獲取後進行折行

s = '''bottle\nbag\nbig\nable'''

for k in enumerate(s):

if k[0] % 8 == 0:

print()

print(k,end=' ')

findall

findall 所返回爲一個列表

s = '0123abcd'

regx = re.compile('^b\w+')

matcher = regx.findall(s,re.M)

print(type(matcher))

找到全部包含b的字符

s = '''bottle\nbag\nbig\nable'''

rest = re.findall('b',s)

print(rest)

['b', 'b', 'b', 'b']

s = '''bottle\nbag\nbig\nable'''

regx = re.compile('^b')

rest = re.findall(regx,s)

print(rest)

['b']

import re

s = '''bottle\nbag\nbig\nable'''

regx = re.compile('^b',re.M)

rest = re.findall(regx,s)

print(rest)

['b', 'b', 'b']

re.M

SRE_FLAG_MULTILINE = 8 # treat target as multiline string

最少匹配一個

s = '0123abcd'

regex = re.compile('[ab]+')

matcher = regex.findall(s) #字節調用regex方法後面跟字符串便可

print(matcher)

['ab']

匹配非數字全部的

s = '0123abcd'

regex = re.compile('\D')

matcher = regex.findall(s)

print(matcher)

['a', 'b', 'c', 'd']

finditer

返回一個可迭代對象

# coding:utf-8

import re

# s = '''bottle\nbag\nbig\nable'''

s = '0123abcd'

regex = re.compile('\D')

matcher = regex.findall(s)

print(matcher)

matcher = regex.finditer(s)

print(type(matcher))

print(matcher)

['a', 'b', 'c', 'd']

<callable_iterator object at 0x0000000000B87128> #返回一個迭代器

迭代器可使用next直接跑出來

for i in matcher:

print(i)

<_sre.SRE_Match object; span=(4, 5), match='a'>

<_sre.SRE_Match object; span=(5, 6), match='b'>

<_sre.SRE_Match object; span=(6, 7), match='c'>

<_sre.SRE_Match object; span=(7, 8), match='d'>

regex = re.compile('\D')

matcher = re.match('\d',s)

print(matcher.span())

(0, 1)

re.M 多行模式

import re

s = '''bottle\nbag\nbig\nable'''

regex = re.compile(r'^b\w+',re.M)

matcher = regex.findall(s)

print(matcher)

['bottle', 'bag', 'big']

去掉re.M查看

s = '''bottle\nbag\nbig\nable'''

regex = re.compile(r'^b\w+')

matcher = regex.findall(s)

print(matcher)

['bottle']

篩取以e結尾的行

import re

s = '''bottle\nbag\nbig\nable'''

regex = re.compile(r'^\w+e$',re.M)

matcher = regex.findall(s)

print(matcher)

爲了不出現問題，就用\n的方式進行分割

在邊界的時候指定選項re.M

匹配並替換

sub

re.sub 匹配原來的字符並替換，可根據次數進行調整

將每一個數字都替換爲一個串字符

import re

s = '''bottle\n123\nbag\nbig\nable'''

regx = re.compile('\d')

res = regx.sub('haha',s)

print(res)

bottle

hahahahahaha

bag

big

able

替換一次

res = regx.sub('haha',s,1)

haha23

將big bag 替換

s = '''bottle\n123\nbag\nbig\nable'''

regx = re.compile('\w+g')

res = regx.sub('hh',s)

print(res)

返回了一個新的字符串

使用subn

s = '''bottle\n123\nbag\nbig\nable'''

regx = re.compile('\w+g')

print(regx.subn('ha',s))

返回一個元組

('bottle\n123\nha\nha\nable', 2)

subn會將其所有封裝至一個元組中

s = '''os.path([path])'''

regx = re.compile('[.]')

print(type(regx))

print(regx)

re.compile('[.]')

將其替換爲空格

newstr = regx.sub(' ', s)

print(newstr)

newstr = regx.subn(' ', s)

print(newstr)

os path([path])

('os path([path])', 1)

字符串分割

re.split() 可匹配模式、字符串、最大切割數

s = '''\

01 bottle

02 bag

03 big1

100 bale

'''

regx = re.compile('^[\s\d]+')

res = re.split(regx,s)

print(res)

['', 'bottle\n02 bag\n03 big1\n100 bale\n']

若是想去掉\n 須要涉及到斷言

數字以前是\n

regx = re.compile('\s+|(?<=!\w)\d')

res = re.split(regx,s)

print(res)

['01', 'bottle', '02', 'bag', '03', 'big1', '100', 'bale', '']

若是使用\d+的話，則數字也所有被消除

regx = re.compile('\s+|(?<!\w)\d+')

res = re.split(regx,s)

print(res)

['', '', 'bottle', '', '', 'bag', '', '', 'big1', '', '', 'bale', '']

篩出空格兩邊的數字

\s+\d+\s+

以前用的是斷言方式，有點複雜，如今咱們發現一個規律：兩頭都有空字符，除了第一行

對於023來說，前有\n

因此認爲數字兩邊都有空白字符的話，則直接匹配並替換

分組

使用小括號的模式捕獲的數據存在了組中

match 、search函數能夠返回match對象

findall返回字符串列表

finditer返回一個個的match對象

使用group(N)方式對應分組，1-N是對應的分組，0是返回整個字符串

使用group，在match對象中加入分組會返回一個group，返回一個>=0的整數，1表明1號分組以此類推

import re

s = '''\

01 bottle

02 bag

03 big1

100 bale

'''

import re

s = '''bottle\nbag\nbig1\nbale'''

regex = re.compile('(b\wg)')

matcher = regex.search(s)

print(matcher)

print(matcher.groups())

<_sre.SRE_Match object; span=(13, 16), match='bag'>

('bag',)

0位整個串，若是超出的話不會報錯，一直都是0

print(matcher.groups(0))

print(matcher.groups(1))

print(matcher.groups(2))

print(matcher.groups(3))

('bag',)

更改\b開頭，並以e結尾

regex = re.compile('(b\w+e)',re.M)

咱們用的是search方法，所知匹配一次

查看：

0是表示匹配從b開頭，，b不屬於元組內，e也不是

regex = re.compile('b(\w+)e',re.M)

('ottl',)

替換爲findall

import re

s = '''bottle\nbag\nbig1\nbale'''

regex = re.compile('\b(\w+)e')

matcher = regex.findall(s)

print(matcher.groups(1))

提示：

意思爲能匹配到，可是不能用列表的方式進行group(),須要本身處理

改進：

使用finditer，使用迭代器，進行循環

import re

s = '''bottle\nbag\nbig1\nbale'''

regex = re.compile('\b(\w+)(e)')

matchers = regex.finditer(s)

print(matchers)

for matchers in matchers:

print(matchers.groups())

分組匹配的太多，咱們僅須要將分組外的拋除

regex = re.compile('b(\w+) (?p<TAIL>e)')

groupdict

用命名則按照索引方式，用了名字照樣能夠用索引方式同時支持groupdict()

正則匹配練習

匹配郵箱地址

html標記獲取文字

獲取<a>標記

改進

更嚴謹的判斷：

看到這裏是\w不帶等號或者引號

用取非的方式獲取

更復雜的實現：

經過分組取出

改進：

若是取得屬性，並有引號則用引號取邊界並取反

若是url後面有屬性參數能夠改成：

匹配一個url

\S 必定不出現空格的字符並至少出現1次的，

匹配×××

判斷密碼強弱

凡有非法字符在的話則替換爲空，這樣會縮減，並判斷字符串的長度

\W判斷，查看是否有特殊符號，其餘全是合法字符，這樣的話確定會掃描出非合法密碼

獲取別人最少用的_ 符號,也就是必包含它

單詞統計

file_path = r'C:\Users\Administrator\Desktop\sample.txt'

s = '''index modules | next | previous | Python 3.5.3 Documentation The Python Standard Library 11. File and Directory Access '''

regex = re.compile('[^-\w]+')

d = defaultdict(lambda:0)

with open(file_path,'r+',encoding='utf-8') as f:

for line in f:

for x in re.split(regex,line):

if len(x) > 0:

d[x] += 1

print(sorted(d.items(),key=lambda x:x[1],reverse=True))

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

re模塊 正則匹配

re模塊正則匹配