Python_Crawler_Foundation1-2_MYSQL_Regular Expression

時間 2019-11-12

標籤 python crawler foundation1 foundation mysql regular expression 欄目 Python 简体版

原文原文鏈接

Mysqlpython

https://www.tutorialspoint.com/python/python_database_access.htm git

Regular Expressiongithub

2. 正則表達式實例正則表達式

^[A‐Za‐z]+$　　　　　　　 由26個字母組成的字符串
^[A‐Za‐z0‐9]+$　　  　   由26個字母和數字組成的字符串
^‐?\d+$　　　　　　　  　 整數形式的字符串
^[0‐9]*[1‐9][0‐9]*$　　　正整數形式的字符串
[1‐9]\d{5}　　　　　　 　中國境內郵政編碼，6位
[\u4e00‐\u9fa5]　　　　 匹配中文字符
\d{3}‐\d{8}|\d{4}‐\d{7}　　國內電話號碼，010‐68913536



IP地址字符串形式的正則表達式（IP地址分4段，每段0‐255）
\d+.\d+.\d+.\d+ 或\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
精確寫法：(([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])

3.正則表達式相關注解

（1）數量詞的貪婪模式與非貪婪模式

正則表達式一般用於在文本中查找匹配的字符串。Python裏數量詞默認是貪婪的（在少數語言裏也多是默認非貪婪），老是嘗試匹配儘量多的字符；非貪婪的則相反，老是嘗試匹配儘量少的字符。例如：正則表達式」ab*」若是用於查找」abbbc」，將找到」abbb」。而若是使用非貪婪的數量詞」ab*?」，將找到」a」。sql

注：咱們通常使用非貪婪模式來提取。編程

（2）反斜槓問題

與大多數編程語言相同，正則表達式裏使用」\」做爲轉義字符，這就可能形成反斜槓困擾。假如你須要匹配文本中的字符」\」，那麼使用編程語言表示的正則表達式裏將須要4個反斜槓」\\\\」：前兩個和後兩個分別用於在編程語言裏轉義成反斜槓，轉換成兩個反斜槓後再在正則表達式裏轉義成一個反斜槓。app

Python裏的原生字符串很好地解決了這個問題，這個例子中的正則表達式可使用r」\\」表示。一樣，匹配一個數字的」\\d」能夠寫成r」\d」。有了原生字符串，媽媽也不用擔憂是否是漏寫了反斜槓，寫出來的表達式也更直觀勒。編程語言

4.Python Re模塊

Python 自帶了re模塊，它提供了對正則表達式的支持。主要用到的方法列舉以下ide

 1 #返回pattern對象
 2 re.compile(string[,flag])  
 3 #如下爲匹配所用函數
 4 re.match(pattern,string[,flags])
 5 re.search(pattern,string[,flags])
 6 re.split(pattern,string[,maxsplit])
 7 re.findall(pattern,string[,flags])
 8 re.finditer(pattern,string[,flags])
 9 re.sub(pattern,repl,string[,count])
10 re.subn(pattern,repl,string[,count])

View Code

在介紹這幾個方法以前，咱們先來介紹一下pattern的概念，pattern能夠理解爲一個匹配模式，那麼咱們怎麼得到這個匹配模式呢？很簡單，咱們須要利用re.compile方法就能夠。例如函數

1 pattern=re.compile(r'hello')

在參數中咱們傳入了原生字符串對象，經過compile方法編譯生成一個pattern對象，而後咱們利用這個對象來進行進一步的匹配。

另外你們可能注意到了另外一個參數 flags，在這裏解釋一下這個參數的含義：

參數flag是匹配模式，取值可使用按位或運算符’|’表示同時生效，好比re.I | re.M。

可選值有：　　

1   •re.I(全拼：IGNORECASE):忽略大小寫（括號內是完整寫法，下同）
2   •re.M(全拼：MULTILINE):多行模式，改變'^'和'$'的行爲（參見上圖）
3   •re.S(全拼：DOTALL):點任意匹配模式，改變'.'的行爲
4   •re.L(全拼：LOCALE):使預約字符類\w\W\b\B\s\S取決於當前區域設定
5   •re.U(全拼：UNICODE):使預約字符類\w\W\b\B\s\S\d\D取決於unicode定義的字符屬性
6   •re.X(全拼：VERBOSE):詳細模式。這個模式下正則表達式能夠是多行，忽略空白字符，並能夠加入註釋。

在剛纔所說的另外幾個方法例如 re.match 裏咱們就須要用到這個pattern了，下面咱們一一介紹。

注：如下七個方法中的flags一樣是表明匹配模式的意思，若是在pattern生成時已經指明瞭flags，那麼在下面的方法中就不須要傳入這個參數了。

（1）re.match(pattern, string[, flags])

這個方法將會從string（咱們要匹配的字符串）的開頭開始，嘗試匹配pattern，一直向後匹配，若是遇到沒法匹配的字符，當即返回None，若是匹配未結束已經到達string的末尾，也會返回None。兩個結果均表示匹配失敗，不然匹配pattern成功，同時匹配終止，再也不對string向後匹配。下面咱們經過一個例子理解一下

 1 __author__='CQC'
 2 # -*- coding: utf-8 -*-
 3  
 4 #導入re模塊
 5 import re
 6  
 7 # 將正則表達式編譯成Pattern對象，注意hello前面的r的意思是「原生字符串」
 8 pattern=re.compile(r'hello')
 9  
10 # 使用re.match匹配文本，得到匹配結果，沒法匹配時將返回None
11 result1=re.match(pattern,'hello')
12 result2=re.match(pattern,'helloo CQC!')
13 result3=re.match(pattern,'helo CQC!')
14 result4=re.match(pattern,'hello CQC!')
15  
16 #若是1匹配成功
17 ifresult1:
18     # 使用Match得到分組信息
19     print result1.group()
20 else:
21     print'1匹配失敗！'
22  
23  
24 #若是2匹配成功
25 ifresult2:
26     # 使用Match得到分組信息
27     print result2.group()
28 else:
29     print'2匹配失敗！'
30  
31  
32 #若是3匹配成功
33 ifresult3:
34     # 使用Match得到分組信息
35     print result3.group()
36 else:
37     print'3匹配失敗！'
38  
39 #若是4匹配成功
40 ifresult4:
41     # 使用Match得到分組信息
42     print result4.group()
43 else:
44     print'4匹配失敗！'

#運行結果

hello
hello
3匹配失敗！
hello

匹配分析

1.第一個匹配，pattern正則表達式爲’hello’，咱們匹配的目標字符串string也爲hello，從頭到尾徹底匹配，匹配成功。

2.第二個匹配，string爲helloo CQC，從string頭開始匹配pattern徹底能夠匹配，pattern匹配結束，同時匹配終止，後面的o CQC再也不匹配，返回匹配成功的信息。

3.第三個匹配，string爲helo CQC，從string頭開始匹配pattern，發現到 ‘o’ 時沒法完成匹配，匹配終止，返回None

4.第四個匹配，同第二個匹配原理，即便遇到了空格符也不會受影響。

咱們還看到最後打印出了result.group()，這個是什麼意思呢？下面咱們說一下關於match對象的的屬性和方法
Match對象是一次匹配的結果，包含了不少關於這次匹配的信息，可使用Match提供的可讀屬性或方法來獲取這些信息。

屬性：
1.string: 匹配時使用的文本。
2.re: 匹配時使用的Pattern對象。
3.pos: 文本中正則表達式開始搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
4.endpos: 文本中正則表達式結束搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
5.lastindex: 最後一個被捕獲的分組在文本中的索引。若是沒有被捕獲的分組，將爲None。
6.lastgroup: 最後一個被捕獲的分組的別名。若是這個分組沒有別名或者沒有被捕獲的分組，將爲None。

方法：
1.group([group1, …]):
得到一個或多個分組截獲的字符串；指定多個參數時將以元組形式返回。group1可使用編號也可使用別名；編號0表明整個匹配的子串；不填寫參數時，返回group(0)；沒有截獲字符串的組返回None；截獲了屢次的組返回最後一次截獲的子串。
2.groups([default]):
以元組形式返回所有分組截獲的字符串。至關於調用group(1,2,…last)。default表示沒有截獲字符串的組以這個值替代，默認爲None。
3.groupdict([default]):
返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典，沒有別名的組不包含在內。default含義同上。
4.start([group]):
返回指定的組截獲的子串在string中的起始索引（子串第一個字符的索引）。group默認值爲0。
5.end([group]):
返回指定的組截獲的子串在string中的結束索引（子串最後一個字符的索引+1）。group默認值爲0。
6.span([group]):
返回(start(group), end(group))。
7.expand(template):
將匹配到的分組代入template中而後返回。template中可使用\id或\g、\g引用分組，但不能使用編號0。\id與\g是等價的；但\10將被認爲是第10個分組，若是你想表達\1以後是字符’0’，只能使用\g0。

下面咱們用一個例子來體會一下

 1 # -*- coding: utf-8 -*-
 2 #一個簡單的match實例
 3  
 4 import re
 5 # 匹配以下內容：單詞+空格+單詞+任意字符
 6 m=re.match(r'(\w+) (\w+)(?P<sign>.*)','hello world!')
 7  
 8 print"m.string:",m.string
 9 print"m.re:",m.re
10 print"m.pos:",m.pos
11 print"m.endpos:",m.endpos
12 print"m.lastindex:",m.lastindex
13 print"m.lastgroup:",m.lastgroup
14 print"m.group():",m.group()
15 print"m.group(1,2):",m.group(1,2)
16 print"m.groups():",m.groups()
17 print"m.groupdict():",m.groupdict()
18 print"m.start(2):",m.start(2)
19 print"m.end(2):",m.end(2)
20 print"m.span(2):",m.span(2)
21 printr"m.expand(r'\g \g\g'):",m.expand(r'\2 \1\3')
22  
23 ### output ###
24 # m.string: hello world!
25 # m.re: 
26 # m.pos: 0
27 # m.endpos: 12
28 # m.lastindex: 3
29 # m.lastgroup: sign
30 # m.group(1,2): ('hello', 'world')
31 # m.groups(): ('hello', 'world', '!')
32 # m.groupdict(): {'sign': '!'}
33 # m.start(2): 6
34 # m.end(2): 11
35 # m.span(2): (6, 11)
36 # m.expand(r'\2 \1\3'): world hello!

（2）re.search(pattern, string[, flags])

search方法與match方法極其相似，區別在於match()函數只檢測re是否是在string的開始位置匹配，search()會掃描整個string查找匹配，match（）只有在0位置匹配成功的話纔有返回，若是不是開始位置匹配成功的話，match()就返回None。一樣，search方法的返回對象一樣match()返回對象的方法和屬性。咱們用一個例子感覺一下

 1 #導入re模塊
 2 import re
 3  
 4 # 將正則表達式編譯成Pattern對象
 5 pattern=re.compile(r'world')
 6 # 使用search()查找匹配的子串，不存在能匹配的子串時將返回None
 7 # 這個例子中使用match()沒法成功匹配
 8 match=re.search(pattern,'hello world!')
 9 ifmatch:
10     # 使用Match得到分組信息
11     print match.group()
12 ### 輸出 ###
13 # world

（3）re.split(pattern, string[, maxsplit])

按照可以匹配的子串將string分割後返回列表。maxsplit用於指定最大分割次數，不指定將所有分割。咱們經過下面的例子感覺一下。

1 import re
2  
3 pattern=re.compile(r'\d+')
4 print re.split(pattern,'one1two2three3four4')
5  
6 ### 輸出 ###
7 # ['one', 'two', 'three', 'four', '']

（4）re.findall(pattern, string[, flags])

搜索string，以列表形式返回所有能匹配的子串。咱們經過這個例子來感覺一下

1 import re
2  
3 pattern=re.compile(r'\d+')
4 print re.findall(pattern,'one1two2three3four4')
5  
6 ### 輸出 ###
7 # ['1', '2', '3', '4']

（5）re.finditer(pattern, string[, flags])

搜索string，返回一個順序訪問每個匹配結果（Match對象）的迭代器。咱們經過下面的例子來感覺一下

1 import re
2  
3 pattern=re.compile(r'\d+')
4 forminre.finditer(pattern,'one1two2three3four4'):
5     printm.group(),
6  
7 ### 輸出 ###
8 # 1 2 3 4

（6）re.sub(pattern, repl, string[, count])

使用repl替換string中每個匹配的子串後返回替換後的字符串。
當repl是一個字符串時，可使用\id或\g、\g引用分組，但不能使用編號0。
當repl是一個方法時，這個方法應當只接受一個參數（Match對象），並返回一個字符串用於替換（返回的字符串中不能再引用分組）。
count用於指定最多替換次數，不指定時所有替換。

 1 import re
 2  
 3 pattern=re.compile(r'(\w+) (\w+)')
 4 s='i say, hello world!'
 5  
 6 print re.sub(pattern,r'\2 \1',s)
 7  
 8 def func(m):
 9     returnm.group(1).title()+' '+m.group(2).title()
10  
11 print re.sub(pattern,func,s)
12  
13 ### output ###
14 # say i, world hello!
15 # I Say, Hello World!

（7）re.subn(pattern, repl, string[, count])

返回 (sub(repl, string[, count]), 替換次數)。

 1 import re
 2  
 3 pattern=re.compile(r'(\w+) (\w+)')
 4 s='i say, hello world!'
 5  
 6 print re.subn(pattern,r'\2 \1',s)
 7  
 8 def func(m):
 9     returnm.group(1).title()+' '+m.group(2).title()
10  
11 print re.subn(pattern,func,s)
12  
13 ### output ###
14 # ('say i, world hello!', 2)
15 # ('I Say, Hello World!', 2)

5.Python Re模塊的另外一種使用方式

在上面咱們介紹了7個工具方法，例如match，search等等，不過調用方式都是 re.match，re.search的方式，其實還有另一種調用方式，能夠經過pattern.match，pattern.search調用，這樣調用便不用將pattern做爲第一個參數傳入了，你們想怎樣調用皆可。

函數API列表

1. ABAP mesh expression, JavaScript and Scala expression
2. Regular Expression
3. lambda expression
4. Lamda Expression
5. Lambd Expression
6. lamda expression
7. Lambda expression
8. Linq Expression
9. Core Expression
10. Facial Expression Recognition by De-expression Residue Learning
更多相關文章...
• XSLT 元素 - XSLT 教程
• Thymeleaf擴展2(Say Hello) - Thymeleaf 教程

相關標籤/搜索

expression

Python

MySQL

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。