字符串操做

時間 2021-08-13

標籤 html python 正則表達式 api app ide 函數工具 this 欄目 HTML 简体版

原文原文鏈接

1.使用多個界定符分割字符串

string 對象的 split() 方法只適應於很是簡單的字符串分割情形，它並不容許有多個分隔符或者是分隔符周圍不肯定的空格。當你須要更加靈活的切割字符串的時候，最好使用 re.split() 方法：html

>>> line = 'asdf fjdk; afed, fjek,asdf, foo'>>> import re>>> re.split(r'[;,\s]\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

2.字符串開頭結尾匹配

檢查字符串開頭或結尾的一個簡單方法是使用 str.startswith() 或者是 str.endswith() 方法。好比：python

>>> filename = 'spam.txt'>>> filename.endswith('.txt')
True>>> filename.startswith('file:')
False>>> url = 'http://www.python.org'
 >>> url.startswith('http:')
True

若是你想檢查多種匹配可能，只須要將全部的匹配項放入到一個元組中去 [name for name in filenames if name.endswith(('.c', '.h')) ]正則表達式

3.你想使用 Unix Shell 中經常使用的通配符(好比 .py , Dat[0-9].csv 等)去匹配文本字符串

fnmatch 模塊提供了兩個函數—— fnmatch() 和 fnmatchcase() ，能夠用來實現這樣的匹配。用法以下：api

>>> from fnmatch import fnmatch, fnmatchcase>>> fnmatch('foo.txt', '*.txt')True>>> fnmatch('foo.txt', '?oo.txt')True>>> fnmatch('Dat45.csv', 'Dat[0-9]*')True

4正則表達式

match() 老是從字符串開始去匹配，若是你想查找字符串任意部分的模式出現位置，使用findall() 方法去代替。好比：app

>>> datepat = re.compile(r'\d+/\d+/\d+')>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

在定義正則式的時候，一般會利用括號去捕獲分組。好比：ide

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

findall() 方法會搜索文本並以列表形式返回全部的匹配。若是你想以迭代方式返回匹配，可使用 finditer() 方法來代替函數

5 字符串的搜索和替換

對於簡單的字面模式，直接使用 str.repalce() 方法便可，對於複雜的模式，請使用 re 模塊中的 sub() 函數。工具

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'>>> import re>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)'Today is 2012-11-27. PyCon starts 2013-3-13.'

sub() 函數中的第一個參數是被匹配的模式，第二個參數是替換模式。反斜槓數字好比 \3 指向前面模式的捕獲組號。若是除了替換後的結果外，你還想知道有多少替換髮生了，可使用 re.subn() 來代替。好比：ui

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)>>> newtext'Today is 2012-11-27. PyCon starts 2013-3-13.'>>> n2

6忽略大小寫的搜索替換

爲了在文本操做時忽略大小寫，你須要在使用 re 模塊的時候給這些操做提供 re.IGNORECASE 標誌參數。好比：this

>>> text = 'UPPER PYTHON, lower python, Mixed Python'>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)'UPPER snake, lower snake, Mixed snake'

最後的那個例子揭示了一個小缺陷，替換字符串並不會自動跟被匹配字符串的大小寫保持一致。爲了修復這個，你可能須要一個輔助函數，就像下面的這樣：

def matchcase(word):
    def replace(m):
        text = m.group()        if text.isupper():            return word.upper()        elif text.islower():            return word.lower()        elif text[0].isupper():            return word.capitalize()        else:            return wordreturn replace

下面是使用上述函數的方法：

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)'UPPER SNAKE, lower snake, Mixed Snake'

matchcase('snake') 返回了一個回調函數(參數必須是 match 對象),sub() 函數除了接受替換字符串外，還能接受一個回調函數。

7.多行匹配

（.*?）換成((?:.|\n)*?)在這個模式中， (?:.|\n) 指定了一個非捕獲組 (也就是它定義了一個僅僅用來作匹配，而不能經過單獨捕獲或者編號的組) re.compile() 函數接受一個標誌參數叫 re.DOTALL ，在這裏很是有用。它可讓正則表達式中的點(.)匹配包括換行符在內的任意字符。好比：

>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)>>> comment.findall(text2)
[' this is a\n multiline comment ']

8.刪除字符串中不須要的字符

strip() 方法能用於刪除開始或結尾的字符。 lstrip() 和 rstrip() 分別從左和從右執行刪除操做。默認狀況下，這些方法會去除空白字符，可是你也能夠指定其餘字符。

9.審查清理文本字符串

str.translate()方法根據參數table給出的表(包含 256 個字符)轉換字符串的字符, 要過濾掉的字符放到 del 參數中。語法：str.translate(table[, deletechars]);

#!/usr/bin/pythonfrom string import maketrans   # Required to call maketrans function.intab = "aeiou"outtab = "12345"trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";print str.translate(trantab, 'xm');

輸出：th3s 3s str3ng 21pl2....w4w!!!

10.字符串對齊

對於基本的字符串對齊操做，可使用字符串的 ljust() , rjust() 和 center() 方法。好比：

>>> text = 'Hello World'>>> text.ljust(20)'Hello World         '>>> text.rjust(20)'         Hello World'>>> text.center(20)'    Hello World     '

全部這些方法都能接受一個可選的填充字符。好比：

>>> text.rjust(20,'=')'=========Hello World'

函數 format() 一樣能夠用來很容易的對齊字符串。你要作的就是使用 <,> 或者 ^ 字符後面緊跟一個指定的寬度。好比：

>>> format(text, '>20')'         Hello World'>>> format(text, '<20')'Hello World         '>>> format(text, '^20')'    Hello World     '

若是你想指定一個非空格的填充字符，將它寫到對齊字符的前面便可： >>> format(text, '=>20s') 當格式化多個值的時候，這些格式代碼也能夠被用在 format() 方法中。好比：

>>> '{:>10s} {:>10s}'.format('Hello', 'World')'     Hello      World'

11.字符串中插入變量

format() 方法，結合使用 format_map() 和vars() vars()返回對象object的屬性和屬性值的字典對象。若是默認不輸入參數，就打印當前調用位置的屬性和屬性值，至關於locals()的功能。若是有參數輸入，就只打印這個參數相應的屬性和屬性值

'{name} has {n} messages.'.format(name='Guido', n=37) #'Guido has 37 messages.'>>> name = 'Guido'>>> n = 37>>> s.format_map(vars())'Guido has 37 messages.'

vars() 還有一個有意思的特性就是它也適用於對象實例。好比：

>>> class Info:...     def __init__(self, name, n):...         self.name = name...         self.n = n
...>>> a = Info('Guido',37)>>> s.format_map(vars(a))'Guido has 37 messages.'>>>

format 和 format_map() 的一個缺陷就是它們並不能很好的處理變量缺失的狀況，好比：

>>> s.format(name='Guido')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>KeyError: 'n'

一種避免這種錯誤的方法是另外定義一個含有 missing() 方法的字典對象，就像下面這樣：

class safesub(dict):"""防止key找不到"""
    def __missing__(self, key):
        return '{' + key + '}'

如今你能夠利用這個類包裝輸入後傳遞給 format_map() ：

>>> del n # Make sure n is undefined>>> s.format_map(safesub(vars()))'Guido has {n} messages.'

若是你發現本身在代碼中頻繁的執行這些步驟，你能夠將變量替換步驟用一個工具函數封裝起來。就像下面這樣：

import sysdef sub(text):
    return text.format_map(safesub(sys._getframe(1).f_locals))

如今你能夠像下面這樣寫了：

>>> name = 'Guido'>>> n = 37>>> print(sub('Hello {name}'))
Hello Guido>>> print(sub('You have {n} messages.'))
You have 37 messages.>>> print(sub('Your favorite color is {color}'))
Your favorite color is {color}

sub() 函數使用 sys.getframe(1) 返回調用者的棧幀。能夠從中訪問屬性 flocals 來得到局部變量。另外，值得注意的是 flocals 是一個複製調用函數的本地變量的字典。儘管你能夠改變 flocals 的內容，可是這個修改對於後面的變量訪問沒有任何影響。

12.以指定列寬格式化字符串

使用 textwrap 模塊來格式化字符串的輸出。好比，假如你有下列的長字符串：

s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

下面演示使用 textwrap 格式化字符串的多種方式：textwrap幫助文檔

>>> import textwrap
>>> print(textwrap.fill(s, 70))
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,not around the eyes, don't look around the eyes, look into my eyes,
you're under.
>>> print(textwrap.fill(s, 40))
Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, initial_indent='    '))
    Look into my eyes, look into myeyes, the eyes, the eyes, the eyes, notaround the eyes, don't look around the
eyes, look into my eyes, you're under.
>>> print(textwrap.fill(s, 40, subsequent_indent='    '))
Look into my eyes, look into my eyes,
    the eyes, the eyes, the eyes, not
    around the eyes, don't look around
    the eyes, look into my eyes, you're 
    under.

textwrap 模塊對於字符串打印是很是有用的，特別是當你但願輸出自動匹配終端大小的時候。你可使用 os.getterminalsize() 方法來獲取終端的大小尺寸。好比：

>>> import os>>> os.get_terminal_size().columns80

13.在字符串中處理html和xml 將HTML或者XML實體如 &entity; 或 &#code; 替換爲對應的文本。再者，你須要轉換文本中特定的字符(好比<, >, 或 &)。若是你想替換文本字符串中的 ‘<’ 或者 ‘>’ ，使用 html.escape() 函數能夠很容易的完成。

>>>s = 'Elements are written as "<tag>text</tag>".'>>>import html
>>>print(html.escape(s))
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.
>>>print(html.escape(s, quote=False))#引號不轉換Elements are written as "&lt;tag&gt;text&lt;/tag&gt;".>>> s = 'Spicy &quot;Jalape&#241;o&quot.'>>> from html.parser import HTMLParser>>> p = HTMLParser()>>> p.unescape(s)'Spicy "Jalapeo".'>>> t = 'The prompt is &gt;&gt;&gt;'>>> from xml.sax.saxutils import unescape>>> unescape(t)'The prompt is >>>'

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。