python 之字符串處理

時間 2019-11-12

原文原文鏈接

分割字符串python

根據某個分割符分割git

>>> a = '1,2,3,4'
>>> a.split(',')
['1', '2', '3', '4']

根據多個分隔符分割正則表達式

>>> line = 'asdf fjdk; afed, fjek,asdf, foo' 
>>> import re
>>> re.split(r'[;,\s]\s*', line)# 用 re 匹配分隔符，
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

若是你在結果列表中保留這些分隔符，能夠捕獲分組：shell

>>> fields = re.split(r'(;|,|\s)\s*', line)
>>> fields
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

若是不保留這些分隔符，但想用分組正則表達式，可使用非捕獲分組：api

>>> re.split(r'(?:,|;|\s)\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

匹配字符串開始或結束函數

檢查字符串是否以某字符開始或結束可用 startswith() 和 endswith()：this

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True

若是你的檢查有多種匹配的可能，能夠傳入一個包含匹配項的元組：url

>>> import os
>>> filenames = os.listdir('.')
>>> filenames
[ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ]

>>> [name for name in filenames if name.endswith(('.c', '.h')) ]
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True

其餘方式能夠用切片或 re 匹配：spa

>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True

>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)
<_sre.SRE_Match object at 0x101253098>

使用shell通配符匹配字符串:操作系統

*	匹配任意多個字符，包括 0 個
？	匹配任意一個字符，必須有一個字符
[char]	匹配括號中的任意一個字符
[!char]	匹配任意一個不屬於括號中的字符的字符
[:alnum:]	匹配任意一個字母或者數字
[:alpha:]	匹配任意一個字母
[:digit:]	匹配任意一個數字
[:lower:]	匹配任意一個小寫字母
[:upper:]	匹配任意一個大寫字母

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']

fnmatch() 函數使用底層操做系統的大小寫敏感規則（不一樣操做系統不同）進行匹配：

>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True

若是你對這個區別很在乎，可使用 fnmatchcase() 來替代。它徹底使用你的模式進行匹配。好比：

>>> fnmatchcase('foo.txt', '*.TXT')
False

>>> fnmatchcase('foo.txt', '*.txt')
True

這個函數在處理非文件名字符串中也很是有用：

addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

>>> from fnmatch import fnmatchcase
>>> [addr for addr in addresses if fnmatchcase(addr, '* ST')]
['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
>>> [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]
['5412 N CLARK ST']

總結：fnmatch 的能力介於字符串方法和正則表達式之間，若是數據處理中只須要簡單的通配符就能完成，fnmatch 或 fnmatchcase 會是個不錯的選擇。若是須要作文件名的匹配，最好使用 glob 模塊。

字符串匹配和搜索

若是隻是簡單的字符串匹配，字符串方法足夠使用了，例如：str.find() , str.startswith() , str.endswith() 。

對於複雜的匹配須要使用正則表達式和re模塊：

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no
>>>

re.match() 老是從字符串開始去匹配，若是匹配到，返回 Match 對象。若是沒有匹配到，返回 None。

若是想重複使用同一個正則，能夠將模式字符串編譯爲模式對象：

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if datepat.match(text2):
... print('yes')
... else:
... print('no')
...
no

若是不想從字符串開始位置匹配，可使用 re.search() 或者 re.findall()，re.search() 在第一個匹配到的位置返回一個 Match 對象，若是沒有匹配到，則返回 None 。

re.findall() 將匹配到的全部字符串裝進列表中返回。

在使用正則時，若表達式中包含分組，re.findall() 返回一個包含 groups 的列表，groups 是一個包含匹配到的全部分組的元組。

>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups()
>>>
>>> # Find all matches (notice splitting into tuples)
>>> text
'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> for month, day, year in datepat.findall(text):
... print('{}-{}-{}'.format(year, month, day))
...
2012-11-27
2013-3-13

findall() 會以列表的形式返回結果，若是你想用迭代的形式返回，可使用 finditer() ：

>>> for m in datepat.finditer(text):
... print(m.groups())
...
('11', '27', '2012')
('3', '13', '2013')

字符串的搜索和替換

對於簡單的查找替換，可使用 str.replace()：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'

對於複雜的查找替換，可使用 re.sub()：

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'

其中 \3 等指向匹配模式中的分組

對於更加複雜的替換，能夠傳遞一個回調函數：

>>> from calendar import month_abbr
>>> def change_date(m):
... mon_name = month_abbr[int(m.group(1))]
... return '{} {} {}'.format(m.group(2), mon_name, m.group(3))
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'

出了替換後的結果之外，若是你還想知道替換了多少個，可使用 re.subn() 來代替：

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n
2

若是想在匹配的時候，忽略大小寫，能夠給 re 提供一個標誌參數，re.IGNORECASE：

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'

這個例子有一個小缺陷，替換字符串不會和匹配字符串的大小寫保持一致，能夠作以下修改：

def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
'UPPER SNAKE, lower snake, Mixed Snake'

re 匹配的結果可能並不許確，例：

>>> str_pat = re.compile(r'\"(.*)\"')
>>> text1 = 'Computer says "no."'
>>> str_pat.findall(text1)
['no.']
>>> text2 = 'Computer says "no." Phone says "yes."'
>>> str_pat.findall(text2)
['no." Phone says "yes.']

咱們想要的結果是 [ 'no.', 'yes.' ]，但很顯然，結果不正確。這是由於 * 操做符是貪婪的，會盡量多的匹配內容。若是想要精準的匹配到 "" 中的內容，能夠這樣：

>>> str_pat = re.compile(r'\"(.*?)\"')
>>> str_pat.findall(text2)
['no.', 'yes.']

在 .* 後面加上 ? 的做用是改變 re 的匹配模式爲非貪婪模式，就是儘量少的匹配內容。

re 實現多行匹配模式

. 去匹配任意字符的時候，沒法匹配換行符（\n），例：

>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>>
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]

對此，你能夠修改模式字符串，以增長對換行符的匹配支持：

>>> comment = re.compile(r'/\*((?:.|\n)*?)\*/')
>>> comment.findall(text2)
[' this is a\n multiline comment ']

‘ ?: ’ 的做用是指定這個分組是非捕獲分組（也就是它定義了一個僅僅用來作匹配，而不能經過單獨捕獲或者編號的組）

除此以外，也可使用標記參數，使 . 能匹配到換行符：

>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
>>> comment.findall(text2)
[' this is a\n multiline comment ']

簡單狀況下，re.DATALL 能很好的工做，但若是匹配模式很複雜，它極可能出現問題。因此，最好仍是定義本身的正則表達式模式。這裏只是提供一種額外的選擇。

刪除字符串中不須要的字符

可使用 str.strip()、str.lstrip()、str.rstrip()：

>>> # Whitespace stripping
>>> s = ' hello world \n'
>>> s.strip()
'hello world'
>>> s.lstrip()
'hello world \n'
>>> s.rstrip()
' hello world'
>>>
>>> # Character stripping
>>> t = '-----hello====='
>>> t.lstrip('-')
'hello====='
>>> t.strip('-=')
'hello'

這些操做不會去除字符串中間的字符，若是想這麼作，可使用 str.replace() 代替。

將 Unicode 文本標準化

在 unicode 中，某些字符能夠有多種合法表示：

>>> s1 = 'Spicy Jalape\u00f1o'
>>> s2 = 'Spicy Jalapen\u0303o'
>>> s1
'Spicy Jalapeño'
>>> s2
'Spicy Jalapeño'
>>> s1 == s2
False
>>> len(s1)
14
>>> len(s2)
15

文本 ‘Spicy Jalapeño’ 使用了兩種形式表示，一種是總體字符 ‘ñ’（U+00F1）。一種是組合字符， n 後面跟一個 ‘~’ （U+3030）。

在比較字符串時，若是出現這種狀況就麻煩了。解決辦法是使用 unicodedata 模塊將文本標準化：

>>> import unicodedata
>>> t1 = unicodedata.normalize('NFC', s1)
>>> t2 = unicodedata.normalize('NFC', s2)
>>> t1 == t2
True
>>> print(ascii(t1))
'Spicy Jalape\xf1o'
>>> t3 = unicodedata.normalize('NFD', s1)
>>> t4 = unicodedata.normalize('NFD', s2)
>>> t3 == t4
True
>>> print(ascii(t3))
'Spicy Jalapen\u0303o'

其中 ‘NFC’ 和 ‘NFD’ 是字符串標準化的方式。‘NFC’表示字符應該是總體組成，‘NFD’表示字符應該被分解爲多個組合字符。

python 一樣支持擴展的標準化形式 NFKC 和 NFKD ，它們在處理某些字符串時增長了一些額外的特性：

>>> s = '\ufb01' # A single character
>>> s
' '
>>> unicodedata.normalize('NFD', s)
' '
# Notice how the combined letters are broken apart here
>>> unicodedata.normalize('NFKD', s)
'fi'
>>> unicodedata.normalize('NFKC', s)
'fi'

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

python 之 字符串處理

python 之字符串處理