正則表達式-零寬斷言

時間 2021-01-19

標籤 java python web spring ide spa .net 日誌 code orm 欄目正則表達式简体版

原文原文鏈接

[toc]java

1、零寬斷言-介紹

零寬斷言，它匹配的內容不會提取，其做用是在一個限定位置的字符串向前或向後進行匹配查找。python

1.一、應用場景

排除查找，查找不含有某段字符串的行web
包含查找，查找含有某段字符串的行

2、斷言的分類

2.一、正先行斷言

什麼是正先行斷言，就是在字符串相應位置以前進行查找匹配，使用 (?=exp) 匹配exp前面的位置。spring

import re

str = 'abcgwcab'
pattern = 'bc(?=gw)'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
bc

解析：首先查找字符串」abcgwcab」中gw位置，斷言爲真，而後再匹配 bc，而後再向後匹配。ide

example:spa

pattern = 'bc(?=gw)ca'
# 匹配失敗，由於找到了 gw 的位置後，斷言爲真，再向前匹配 bc ，再而後是從 bc 處進行匹配是 gwca ，因此會失敗。

pattern = 'bc(?=gw)gwca'
# 匹配成功，輸出結果
bcgwca

2.二、反先行斷言

什麼是反先行斷言，使用 (?!exp) 匹配後面跟的不是exp。.net

import re

str = 'abcgwcab'
pattern = 'bc(?!ww)gw'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
bcgw

解析：首先判斷字符串是否包含bc，而後判斷其後面不是ww，斷言爲真，而後從 bc 處進行匹配 gw。日誌

2.三、正後發斷言

什麼是正後發斷言，就是在字符串相應位置以後進行查找匹配， (?<=exp) 匹配exp後面的位置code

import re

str = 'abcgwcab'
pattern = '(?<=gw)ca'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
ca

解析：首先判斷字符串是否包含 gw ，而後查找後面是否有 ca，存在，斷言爲真，則從 ca 處開始繼續匹配。orm

example:

import re

str = 'abcgwcab'
pattern = 'gw(?<=gw)cab'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
gwcab

2.四、反後發斷言

什麼是反後發斷言，就是在給定位置的字符串向前查找，(?<!exp)gw 若 gw 的前面是 exp 則爲 False。反之爲 True

import re

str = 'abcgwcab'
pattern = '(?<!bc)gw'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
False

解析：首先查找字符串中是否包含 gw ，而後判斷 gw 前面是否是 bc ，若是是則返回 False。若是不是，則返回 True，而後從 gw 處開始匹配。

example:

import re

str = 'abcgwcab'
pattern = 'gw(?<!bc)ca'
result = re.search(pattern,str)
print(result.group())

# 輸出結果
gwca

'''
在字符串中查找 ca ，而後判斷其前面是否是 bc ，不是，返回 True ，而後從 ca 處開始匹配，匹配到 gw 。 則輸出爲 gwca
'''

3、排除查找

3.一、查找不以 `baidu` 開頭的字符串

源文本

baidu.com
sina.com.cn

代碼

import re

source_str = 'baidu.com\nsina.com.cn'
str_list = source_str.split('\n')
print(str_list)

for str in str_list:
    pattern = '^(?!baidu).*$'
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 輸出結果
sina.com.cn

解析：^(?!baidu).*$ 從行首開始匹配，查找後面不是 baidu 的字符串。(?!baidu) 這段是反先行斷言

3.二、查找不以 `com` 結尾的字符串

源文本

baidu.com
sina.com.cn
www.educ.org
www.hao.cc
www.redhat.com

代碼

import re

source_str = 'baidu.com\nsina.com.cn\nwww.educ.org\nwww.hao.cc\nwww.redhat.com'
str_list = source_str.split('\n')
print(str_list)
# ['baidu.com', 'sina.com.cn', 'www.educ.org', 'www.hao.cc', 'www.redhat.com']

for str in str_list:
    pattern = '^.*?(?<!com)$'
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 輸出結果
sina.com.cn
www.educ.org
www.hao.cc

解析：'^.?(?<!com)$' ，^從行首處匹配，`.?忽略優先，優先忽略不匹配的任何字符。(?<!com)反後發斷言，匹配該位置不能是com` 字符，'$' 結尾錨定符。 '(?<!com)$' 意思是，匹配結尾前面不能是 com 字符的字符串。

3.三、查找文本中不含有 `world` 的行

源文本

I hope the world will be peaceful
Thepeoplestheworldoverlovepeace
Imissyoueveryday
Aroundtheworldin80Days
I usually eat eggs at breakfast

代碼

import re

source_str = 'I hope the world will be peaceful\nThepeoplestheworldoverlovepeace\nImissyoueveryday\nAroundtheworldin80Days\nI usually eat eggs at breakfast'
str_list = source_str.split('\n')
print(str_list)
# ['I hope the world will be peaceful', 'Thepeoplestheworldoverlovepeace', 'Imissyoueveryday', 'Aroundtheworldin80Days', 'I usually eat eggs at breakfast']

for str in str_list:
    pattern = '^(?!.*world).*$'
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 輸出結果
Imissyoueveryday
I usually eat eggs at breakfast

解析：^ 首先匹配行首，(?!.*world) , 匹配行首後不能有 .*world 的字符, 也就是不能有 xxxxxxxworld 的字符。這就排除了從行首開始後面有 world 字符的狀況了。

4、實戰操做

4.一、日誌匹配（一）

從日誌文件中過濾 [ERROR] 的錯誤日誌，但錯誤日誌又分兩種，一種是帶 _eMsg 參數的，一種是不帶的。

需求是過濾出全部的錯誤日誌，但排除 _eMsg=400 的行。

源文本

[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out

[ERROR][2020-04-02T10:30:17.353+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=0f05e854e38984b3f1f20||spanid=8a980083||cspanid=||serviceName=||errormsg=Handle request failed||_eMsg=400 Bad Request||_eTrace=sprin.web.Exception$BadRequest: 400 Bad Request

[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil] http get error

代碼

import re

source_str = '[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out\
\n[ERROR][2020-04-02T10:30:17.353+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=0f05e854e38984b3f1f20||spanid=8a980083||cspanid=||serviceName=||errormsg=Handle request failed||_eMsg=400 Bad Request||_eTrace=sprin.web.Exception$BadRequest: 400 Bad Request\
\n[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil]'
str_list = source_str.split('\n')
# print(str_list)

for str in str_list:
    pattern = '(^\[ERROR\].*?_eMsg(?!=400).*$)|^\[ERROR\](?!.*_eMsg).*'
    result = re.search(pattern,str)
    if result:
        print(result.group())

# 輸出結果
[ERROR][2020-04-02T10:27:05.370+0800][clojure.fn__147.core.clj:1] _com_im_error||traceid=ac85e854d7600001b6970||spanid=8a0a0084||cspanid=||serviceName=||errormsg=get-driver-online-status timeou||_eMsg=Read timed out||_eTrace=java.net.SocketTimeoutException: Read timed out
[ERROR][2020-03-25T09:21:16.186+0800][spring.util.HttpPoolClientUtil]

解析：(^\[ERROR\].*?_eMsg(?!=400).*$) 從行首匹配 [ERROR] ,.*? 忽略優先，優先忽略不匹配的任何字符。_eMsg(?!=400) 找到 _eMsg 字符串，匹配其後面是否是 =400 若是是返回 False。

以後 | 或邏輯符，^\[ERROR\](?!.*_eMsg).* 從行首匹配 [ERROR] ，而後匹配出不包含 xxxxxx_eMsg 的行。

寫後面那串或邏輯的目的是爲了匹配出，不包含 _eMsg 字段的錯誤日誌。