爬蟲經常使用正則、re.findall 使用

時間 2019-11-06

原文原文鏈接

爬蟲經常使用正則php

爬蟲常常用到的一些正則，這能夠幫助咱們更好地處理字符。html

正則符java

單字符python

. : 除換行之外全部字符
[] ：[aoe] [a-w] 匹配集合中任意一個字符
\d ：數字  [0-9]
\D : 非數字
\w ：數字、字母、下劃線、中文
\W : 非\w
\s ：全部的空白字符包,括空格、製表符、換頁符等等。等價於 [ \f\n\r\t\v]
\S : 非空白

數量修飾c++

* : 任意屢次  >=0
+ : 至少1次   >=1
? : 無關緊要  0次或者1次
{m} ：固定m次 hello{3,}
{m,} ：至少m次
{m,n} ：m-n次

邊界正則表達式

$ : 以某某結尾 
^ : 以某某開頭

分組ide

(ab)

貪婪模式spa

.*

非貪婪惰性模式code

.*?

# 1 提取出python
'''
key = 'javapythonc++php'

re.findall('python',key)
re.findall('python',key)[0]
'''
# 2 提取出 hello word
'''
key = '<html><h1>hello word</h1></html>'
print(re.findall('<h1>.*</h1>', key))
print(re.findall('<h1>(.*)</h1>', key))
print(re.findall('<h1>(.*)</h1>', key)[0])
'''
# 3 提取170
'''
key = '這個女孩身高170釐米'
print(re.findall('\d+', key)[0])
'''
# 4 提取出http://和https://
'''
key = 'http://www.baidu.com and https://www.cnblogs.com'
print(re.findall('https?://', key))
'''
# 5 提取出 hello
'''
key = 'lalala<hTml>hello</HtMl>hahaha'   # 輸出的結果<hTml>hello</HtMl>
print(re.findall('<[hH][tT][mM][lL]>.*[/hH][tT][mM][lL]>',key))
'''
# 6 提取hit. 貪婪模式;儘量多的匹配數據
'''
key = 'qiang@hit.edu.com'                # 加?是貪婪匹配,不加?是非貪婪匹配
print(re.findall('h.*?\.', key))
'''
# 7 匹配出全部的saas和sas
'''
key = 'saas and sas and saaas'
print(re.findall('sa{1,2}s',key))
'''
# 8 匹配出 i 開頭的行
'''
key = """fall in love with you
i love you very much 
i love she
i love her
"""
print(re.findall('^i.*', key, re.M))
'''
# 9 匹配所有行
'''
key = """
<div>細思極恐
你的隊友在看書,
你的閨蜜在減肥,
你的敵人在磨刀,
隔壁老王在練腰.
</div>
"""
print(re.findall('.*', key, re.S))
'''

案例題

re.findall 使用htm

一、re.findall 能夠對多行進行匹配，並依據參數做出不一樣結果。

re.findall(取值,值,re.M)
    - re.M ：多行匹配
    - re.S ：單行匹配 若是分行則顯示/n
    - re.I : 忽略大小寫
    - re.sub(正則表達式, 替換內容, 字符串)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。