網絡爬蟲必備知識之正則表達式

時間 2019-11-10

標籤網絡爬蟲必備知識正則表達式欄目系統網絡简体版

原文原文鏈接

就庫的範圍，我的認爲網絡爬蟲必備庫知識包括urllib、requests、re、BeautifulSoup、concurrent.futures，接下來將結對re正則表達式的使用方法進行總結

1. 正則表達式概念

　　正則表達式是對字符串操做的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個「規則字符串」，這個「規則字符串」用來表達對字符串的一種過濾邏輯。python

　　許多程序設計語言都支持正則表達式進行字符串操做，並非python獨有，python的re模塊提供了對正則表達式的支持。web

　　正則表達式內容太過於"深奧"，如下內容僅總結我平時使用過程當中認爲相對重要的點：經常使用匹配模式、泛匹配、貪婪匹配、分組匹配(exp)和re庫函數正則表達式

2. python正則經常使用匹配模式

\w      匹配字母數字及下劃線
\W      匹配f非字母數字下劃線
\s      匹配任意空白字符，等價於[\t\n\r\f]
\S      匹配任意非空字符
\d      匹配任意數字
\D      匹配任意非數字
\A      匹配字符串開始
\Z      匹配字符串結束，若是存在換行，只匹配換行前的結束字符串
\z      匹配字符串結束
\G      匹配最後匹配完成的位置
\n      匹配一個換行符
\t      匹配一個製表符
^       匹配字符串的開頭
$       匹配字符串的末尾
.       匹配任意字符，除了換行符，re.DOTALL標記被指定時，則能夠匹配包括換行符的任意字符
[....]  用來表示一組字符，單獨列出：[amk]匹配a,m或k
[^...]  不在[]中的字符：[^abc]匹配除了a,b,c以外的字符
*       匹配0個或多個的表達式
+       匹配1個或者多個的表達式
?       匹配0個或1個由前面的正則表達式定義的片斷，非貪婪方式
{n}     精確匹配n前面的表示
{m,m}   匹配n到m次由前面的正則表達式定義片斷，貪婪模式
a|b     匹配a或者b
()      匹配括號內的表達式，也表示一個組

2. re庫使用說明

（1）match函數

　　函數原型：def match(pattern, string, flags=0):api

　　嘗試從字符串的起始位置匹配一個模式，若是起始位置沒匹配上的話，返回None網絡

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)
print(result.group()) #獲取匹配的結果
print(result.span())  #獲取匹配字符串的長度範圍

　　輸出：函數

（2）泛匹配

　　上面的代碼正則表達式太複雜，咱們可使用下面的方式進行簡化url

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result)
print(result.group())
print(result.span())

　　輸出結果同樣，這樣看起來就更簡潔，以hello開頭，中間匹配任意字符0次到屢次，以Demo結尾spa

（3）分組匹配

　　爲了匹配字符串中具體的目標，可使用（）進行分組匹配設計

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(\d+).*Demo$',content)
print(result.group())
print(result.group(1))

　　輸出：code

（4）命名方式的分組匹配

　　(?<name>exp) :匹配exp,並捕獲文本到名稱爲name的組裏，也能夠寫成(?'name'exp)

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(?P<num>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　輸出：

　　採用命名分組方式，能夠經過key‘num’獲取匹配到的信息

（5）貪婪匹配

　　意思就是一直匹配，匹配到匹配不上爲止

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　輸出：

　　最終結果輸出的是7，出現這樣的結果是由於被前面的.*給匹陪掉了，只剩下了一個數字，這就是貪婪匹配

　　若要非貪婪匹配可使用問號（？）

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(?P<name>\d+).*Demo$',content)
print(result.group())
print(result.group(1))
print(result.groupdict())

　　這樣就能夠匹配123了

（6）函數中添加匹配模式

　　def match(pattern, string, flags=0)函數中的第三個參數flags設置匹配模式

　　re.I：使匹配對大小寫不敏感

　　re.L：作本地化識別匹配

　　re.S：使.包括換行在內的全部字符

　　re.M：多行匹配，影響^和$

　　re.U：使用unicode字符集解析字符，這個標誌影響\w,\W,\b,\B

　　re.X：該標誌經過給予你更靈活的格式以便你將正則表達式寫得更易於理解

　　下面以re.I和re.S爲例：

content= "heLLo 123 4567 World_This is a regex Demo"
result = re.match('hello',content,re.I)
print(result.group())

　　輸出：heLLo

　　不加re.S狀況

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content)
print(result.group())

　　輸出：heLLo 123 4567 World_This is

　　再看加re.S的狀況

content= '''heLLo 123 4567 World_This is 
a regex Demo'''
result = re.match('.*',content,re.S)
print(result.group())

　　re庫中大部分函數都有該flags參數

（7）search函數

　　函數原型：def search(pattern, string, flags=0)

　　掃描整個字符串，返回第一個匹配成功的結果

content= '''hahhaha hello 123 4567 world'''
result = re.search('hello.*world',content)
print(result.group())

　　輸出：hello 123 4567 world，若是將search改成match將提示異常，由於沒有匹配到內容

（8）findall函數

　　函數原型：def findall(pattern, string, flags=0)

　　搜索字符串，以列表的形式返回全部能匹配的字串

content= '''
    <url>
        <loc>http://example.webscraping.com/places/default/view/Afghanistan-1</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Aland-Islands-2</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Albania-3</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/Algeria-4</loc>
    </url>
    <url>
        <loc>http://example.webscraping.com/places/default/view/American-Samoa-5</loc>
    </url>'''
urls = re.findall('<loc>（.*）</loc>',content)
for url in urls:
    print(url)

　　輸出：

（9）sub函數

　　函數原型：def subn(pattern, repl, string, count=0, flags=0)

　　替換字符串中每個匹配的子串後返回替換後的字符串

content= '''hahhaha hello 123 4567 world'''
str = re.sub('hello.*world','zhangsan',content)
print(str)

　　輸出：hahhaha zhangsan

（10）compile

　　函數原型：def compile(pattern, flags=0)

　　將正則表達式編譯成正則表達式對象，方便複用該正則表達式

content= '''hahhaha hello 123 4567 world'''
pattern = 'hello.*'
regex = re.compile(pattern)
str = re.sub(regex,'zhangsan',content)
print(str)

　　輸出：hahhaha zhangsan

正則表達式，初見可能會很複雜，但只要一步一步來，會發現正則表達式其實並沒有想像中的那麼困難，它的出現會讓咱們寫出的代碼簡潔不少。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。