python基於正則爬蟲小筆記

時間 2020-06-27

原文原文鏈接

1、re.match()，從字符串的起始位置開始匹配，好比hello，匹配模式第一個字符必須爲 hhtml

一、re.match()，模式'^hello.*Demo$'，匹配字符串符合正則的全部內容this

import re3d

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*Demo$',content)
print(result.group())htm

二、()、group(1)，匹配字符串中的某個字符串，匹配數字 (\d+)blog

group()匹配所有，group(1)匹配第一個()字符串

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s(\d+)\s(\d+)\sWorld.*Demo$',content)
print(result.group(2))io

三、\s只能匹配一個空格，如有多個空格呢，hello 123，用 \s+ 便可class

四、匹配空格、或任意字符串，.*，爲貪婪模式，會影響後面的匹配，好比 .*(\d+)，所以用 .*? 代替\s+import

4.1 貪婪模式語法

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*(\d+)\s(\d+)\sWorld.*Demo$',content)
print(result.group(1))

輸出 3

4.2 非貪婪模式

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello.*?(\d+).*?(\d+)\sWorld.*Demo$',content)
print(result.group(1))

輸出123

五、匹配 123 4567，(.*?)

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s+(.*?)\s+World.*Demo$',content)
print(result.group(1))

輸出 123 4567

當匹配特殊字符時，用轉義，$5.00，轉爲後 \$5\.00

2、re.search()，掃描整個字符串，好比hello，匹配模式第一個不必定必須爲 h，能夠是 e

網上其它文章寫的比較混亂，沒有寫出re.match與re.search之間的區別，只是寫了一個re.search使用案例，沒法讓新手朋友深刻理解各個模式之間的區別

一、這裏在用前面的案例，匹配「123 4567」

import re
content= "hello 123 4567 World_This is a regex Demo"
result = re.search('ello\s+(.*?)\s+World.*Demo$',content) #從ello開始，re.match()必須從 h 開始
print(result.group(1))

輸出 123 4567

二、匹配任意標籤的內容，好比 <li data-view="4" class="active">，.*?active.*?xxxxx

re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>',html,re.S) #當有多個 <li 時，惟有目標纔有active，寫入action便可指定此標籤，.*?active.*?xxxxx

能夠指定任意的標籤，當active不一樣時，用正則re會比BeautifulSoup簡單。

3、re.findall，列表語法來匹配字符串，不是 group(1)

以列表形式返回匹配的內容，語法與re.search同樣

re.search：經過 group(1) 來匹配字符串

re.findall：經過列表語法來匹配字符串，不是 group(1)

re.findall('<li.*?active.*?singer="(.*?)">(.*?)</a>',html,re.S)

輸出 [('齊秦', '往事隨風')]，列表內部是一個元組

print(result)

for x in result: