思路:先生成一個以列表爲鍵,出現次數爲值的字典,再進行字典的排序html
>>> from random import randint >>> data = [randint(1,21) for _ in xrange(30)] >>> data [18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]
>>> dicData = dict.fromkeys(data,0) >>> dicData {1: 0, 2: 0, 3: 0, 5: 0, 6: 0, 8: 0, 9: 0, 11: 0, 12: 0, 13: 0, 14: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}
>>> for x in data: dicData[x] += 1 >>> dicData {1: 1, 2: 2, 3: 2, 5: 1, 6: 3, 8: 2, 9: 2, 11: 1, 12: 1, 13: 1, 14: 3, 16: 1, 17: 1, 18: 3, 19: 1, 20: 3, 21: 2}
>>> sortDicData = sorted(dicData.iteritems(),key=lambda x:x[1],reverse=True) >>> sortDicData [(6, 3), (14, 3), (18, 3), (20, 3), (2, 2), (3, 2), (8, 2), (9, 2), (21, 2), (1, 1), (5, 1), (11, 1), (12, 1), (13, 1), (16, 1), (17, 1), (19, 1)]
>>> newdicData = dict(sortDicData[:4]) >>> newdicData {18: 3, 20: 3, 14: 3, 6: 3}
使用和上例相同的列表,Counter一個字典dict的子類。python
>>> data [18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2] >>> from collections import Counter >>> dict1 = Counter(data) >>> dict1 Counter({6: 3, 14: 3, 18: 3, 20: 3, 2: 2, 3: 2, 8: 2, 9: 2, 21: 2, 1: 1, 5: 1, 11: 1, 12: 1, 13: 1, 16: 1, 17: 1, 19: 1})
>>> dict1[6] 3 >>> dict1[20] 3 >>> dict1[2] 2
>>> dict1.most_common(3)
[(6, 3), (14, 3), (18, 3)]
思路:將文章讀入成字符串,再使用正則表達式模塊的分割,使用正則表達式的分割模塊,將每一個單詞分割分來。git
>>> from collections import Counter正則表達式
>>> import re #正則表達式模塊shell
#注意word文檔doc不能像文本文件讀,須要使用有專用於讀doc文件的doc模塊express
#打開collections.txt文件,並將該文件讀出,賦給txt,txt就是一個很長的字符串編程
>>> txt = open("C:\視頻\python高效實踐技巧筆記\collections.txt").read()app
#而後用正則表達式分割,用非字母對整個字符串進行分割,就分割出了由各單詞組成的列表re.split('\W+',txt)。再用Counter()對該列表詞頻統計,如上面介紹dom
>>> c3 =Counter(re.split('\W+',txt))編程語言
#獲得頻度最高的10個單詞的列表
>>> c3.most_common(10)
[('the', 177), ('a', 126), ('to', 96), ('and', 93), ('is', 73), ('d', 73), ('in', 72), ('for', 69), ('of', 64), ('2', 53)]
>>> help(dict) Help on class dict in module __builtin__: class dict(object) | dict() -> new empty dictionary | dict(mapping) -> new dictionary initialized from a mapping object's | (key, value) pairs | dict(iterable) -> new dictionary initialized as if via: | d = {} | for k, v in iterable: | d[k] = v | dict(**kwargs) -> new dictionary initialized with the name=value pairs | in the keyword argument list. For example: dict(one=1, two=2) | | Methods defined here: | | __cmp__(...) | x.__cmp__(y) <==> cmp(x,y) | | __contains__(...) | D.__contains__(k) -> True if D has a key k, else False | | __delitem__(...) | x.__delitem__(y) <==> del x[y] | | __eq__(...) | x.__eq__(y) <==> x==y | | __ge__(...) | x.__ge__(y) <==> x>=y | | __getattribute__(...) | x.__getattribute__('name') <==> x.name | | __getitem__(...) | x.__getitem__(y) <==> x[y] | | __gt__(...) | x.__gt__(y) <==> x>y | | __init__(...) | x.__init__(...) initializes x; see help(type(x)) for signature | | __iter__(...) | x.__iter__() <==> iter(x) | | __le__(...) | x.__le__(y) <==> x<=y | | __len__(...) | x.__len__() <==> len(x) | | __lt__(...) | x.__lt__(y) <==> x<y | | __ne__(...) | x.__ne__(y) <==> x!=y | | __repr__(...) | x.__repr__() <==> repr(x) | | __setitem__(...) | x.__setitem__(i, y) <==> x[i]=y | | __sizeof__(...) | D.__sizeof__() -> size of D in memory, in bytes | | clear(...) | D.clear() -> None. Remove all items from D. | | copy(...) | D.copy() -> a shallow copy of D | | fromkeys(...) | dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v. | v defaults to None. | | get(...) | D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None. | | has_key(...) | D.has_key(k) -> True if D has a key k, else False | | items(...) | D.items() -> list of D's (key, value) pairs, as 2-tuples | | iteritems(...) | D.iteritems() -> an iterator over the (key, value) items of D | | iterkeys(...) | D.iterkeys() -> an iterator over the keys of D | | itervalues(...) | D.itervalues() -> an iterator over the values of D | | keys(...) | D.keys() -> list of D's keys | | pop(...) | D.pop(k[,d]) -> v, remove specified key and return the corresponding value. | If key is not found, d is returned if given, otherwise KeyError is raised | | popitem(...) | D.popitem() -> (k, v), remove and return some (key, value) pair as a | 2-tuple; but raise KeyError if D is empty. | | setdefault(...) | D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D | | update(...) | D.update([E, ]**F) -> None. Update D from dict/iterable E and F. | If E present and has a .keys() method, does: for k in E: D[k] = E[k] | If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v | In either case, this is followed by: for k in F: D[k] = F[k] | | values(...) | D.values() -> list of D's values | | viewitems(...) | D.viewitems() -> a set-like object providing a view on D's items | | viewkeys(...) | D.viewkeys() -> a set-like object providing a view on D's keys | | viewvalues(...) | D.viewvalues() -> an object providing a view on D's values | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __hash__ = None | | __new__ = <built-in method __new__ of type object> | T.__new__(S, ...) -> a new object with type S, a subtype of T
| fromkeys(...) | dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v. | v defaults to None. 將序列的值,作爲字典的鍵,生成字典。 >>> data = [3,1,56] >>> data1 = dict.fromkeys(data) >>> data1 {56: None, 1: None, 3: None} >>> data2 = dict.fromkeys(data,3) >>> data2 {56: 3, 1: 3, 3: 3} >>>
| iteritems(...) | D.iteritems() -> an iterator over the (key, value) items of D 接上例:能夠看出這是一個鍵、值的迭代器 >>> data2.iteritems() <dictionary-itemiterator object at 0x02D812A0>
| iterkeys(...) | D.iterkeys() -> an iterator over the keys of D 接上例:能夠看出這是一個鍵的迭代器 >>> data2.iterkeys <built-in method iterkeys of dict object at 0x02E3BDB0> >>> data2.iterkeys() <dictionary-keyiterator object at 0x02E27F00>
| D.itervalues() -> an iterator over the values of D 接上例:能夠看出這是一個值的迭代器 >>> data2.itervalues() <dictionary-valueiterator object at 0x02D81810>
>>> import collections
>>> help(collections)
結果把整個官方在線文檔給輸出了,學習資料最方便的資料仍是官方文檔
在《2-2 爲元組中的元素命名》有作介紹
>>> import collections >>> help(collections.namedtuple) Help on function namedtuple in module collections: namedtuple(typename, field_names, verbose=False, rename=False) Returns a new subclass of tuple with named fields. >>> Point = namedtuple('Point', ['x', 'y']) >>> Point.__doc__ # docstring for the new class 'Point(x, y)' >>> p = Point(11, y=22) # instantiate with positional args or keywords >>> p[0] + p[1] # indexable like a plain tuple 33 >>> x, y = p # unpack like a regular tuple >>> x, y (11, 22) >>> p.x + p.y # fields also accessible by name 33 >>> d = p._asdict() # convert to a dictionary >>> d['x'] 11 >>> Point(**d) # convert from a dictionary Point(x=11, y=22) >>> p._replace(x=100) # _replace() is like str.replace() but targets named fields Point(x=100, y=22)
namedtuple是一個函數,它用來建立一個自定義的tuple對象,而且規定了tuple元素的個數,並能夠用屬性而不是索引來引用tuple的某個元素。
這樣一來,咱們用namedtuple能夠很方便地定義一種數據類型,它具有tuple的不變性,又能夠根據屬性來引用,使用十分方便。
>>> import collections
>>> help(collections.Counter)
打印出的說明文檔好多。
most_common() | most_common(self, n=None) | List the n most common elements and their counts from the most | common to the least. If n is None, then list all element counts. | | >>> Counter('abcdeabcdabcaba').most_common(3) | [('a', 5), ('b', 4), ('c', 3)]
官方文檔:
Py2.7:https://docs.python.org/2.7/library/re.html
Py3 :https://docs.python.org/3/library/re.html
>>> help(re) Help on module re: NAME re - Support for regular expressions (RE). FILE c:\python27\lib\re.py DESCRIPTION This module provides regular expression matching operations similar to those found in Perl. It supports both 8-bit and Unicode strings; both the pattern and the strings being processed can contain null bytes and characters outside the US ASCII range. Regular expressions can contain both special and ordinary characters. Most ordinary characters, like "A", "a", or "0", are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. The special characters are: "." Matches any character except a newline. "^" Matches the start of the string. "$" Matches the end of the string or just before the newline at the end of the string. "*" Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible. "+" Matches 1 or more (greedy) repetitions of the preceding RE. "?" Matches 0 or 1 (greedy) of the preceding RE. *?,+?,?? Non-greedy versions of the previous three special characters. {m,n} Matches from m to n repetitions of the preceding RE. {m,n}? Non-greedy version of the above. "\\" Either escapes special characters or signals a special sequence. [] Indicates a set of characters. A "^" as the first character indicates a complementing set. "|" A|B, creates an RE that will match either A or B. (...) Matches the RE inside the parentheses. The contents can be retrieved or matched later in the string. (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below). (?:...) Non-grouping version of regular parentheses. (?P<name>...) The substring matched by the group is accessible by name. (?P=name) Matches the text matched earlier by the group named name. (?#...) A comment; ignored. (?=...) Matches if ... matches next, but doesn't consume the string. (?!...) Matches if ... doesn't match next. (?<=...) Matches if preceded by ... (must be fixed length). (?<!...) Matches if not preceded by ... (must be fixed length). (?(id/name)yes|no) Matches yes pattern if the group with id/name matched, the (optional) no pattern otherwise. The special sequences consist of "\\" and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. \number Matches the contents of the group of the same number. \A Matches only at the start of the string. \Z Matches only at the end of the string. \b Matches the empty string, but only at the start or end of a word. \B Matches the empty string, but not at the start or end of a word. \d Matches any decimal digit; equivalent to the set [0-9]. \D Matches any non-digit character; equivalent to the set [^0-9]. \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]. \S Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v]. \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus characters defined as letters for the current locale. \W Matches the complement of \w. \\ Matches a literal backslash. This module exports the following functions: match Match a regular expression pattern to the beginning of a string. search Search a string for the presence of a pattern. sub Substitute occurrences of a pattern found in a string. subn Same as sub, but also return the number of substitutions made. split Split a string by the occurrences of a pattern. findall Find all occurrences of a pattern in a string. finditer Return an iterator yielding a match object for each match. compile Compile a pattern into a RegexObject. purge Clear the regular expression cache. escape Backslash all non-alphanumerics in a string. Some of the functions in this module takes flags as optional parameters: I IGNORECASE Perform case-insensitive matching. L LOCALE Make \w, \W, \b, \B, dependent on the current locale. M MULTILINE "^" matches the beginning of lines (after a newline) as well as the string. "$" matches the end of lines (before a newline) as well as the end of the string. S DOTALL "." matches any character at all, including the newline. X VERBOSE Ignore whitespace and comments for nicer looking RE's. U UNICODE Make \w, \W, \b, \B, dependent on the Unicode locale. This module also defines an exception 'error'. CLASSES exceptions.Exception(exceptions.BaseException) sre_constants.error class error(exceptions.Exception) | Method resolution order: | error | exceptions.Exception | exceptions.BaseException | __builtin__.object | | Data descriptors defined here: | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Methods inherited from exceptions.Exception: | | __init__(...) | x.__init__(...) initializes x; see help(type(x)) for signature | | ---------------------------------------------------------------------- | Data and other attributes inherited from exceptions.Exception: | | __new__ = <built-in method __new__ of type object> | T.__new__(S, ...) -> a new object with type S, a subtype of T | | ---------------------------------------------------------------------- | Methods inherited from exceptions.BaseException: | | __delattr__(...) | x.__delattr__('name') <==> del x.name | | __getattribute__(...) | x.__getattribute__('name') <==> x.name | | __getitem__(...) | x.__getitem__(y) <==> x[y] | | __getslice__(...) | x.__getslice__(i, j) <==> x[i:j] | | Use of negative indices is not supported. | | __reduce__(...) | | __repr__(...) | x.__repr__() <==> repr(x) | | __setattr__(...) | x.__setattr__('name', value) <==> x.name = value | | __setstate__(...) | | __str__(...) | x.__str__() <==> str(x) | | __unicode__(...) | | ---------------------------------------------------------------------- | Data descriptors inherited from exceptions.BaseException: | | __dict__ | | args | | message FUNCTIONS compile(pattern, flags=0) Compile a regular expression pattern, returning a pattern object. escape(pattern) Escape all non-alphanumeric characters in pattern. findall(pattern, string, flags=0) Return a list of all non-overlapping matches in the string. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result. finditer(pattern, string, flags=0) Return an iterator over all non-overlapping matches in the string. For each match, the iterator returns a match object. Empty matches are included in the result. match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. purge() Clear the regular expression cache search(pattern, string, flags=0) Scan through string looking for a match to the pattern, returning a match object, or None if no match was found. split(pattern, string, maxsplit=0, flags=0) Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used. subn(pattern, repl, string, count=0, flags=0) Return a 2-tuple containing (new_string, number). new_string is the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in the source string by the replacement repl. number is the number of substitutions that were made. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used. template(pattern, flags=0) Compile a template pattern, returning a pattern object DATA DOTALL = 16 I = 2 IGNORECASE = 2 L = 4 LOCALE = 4 M = 8 MULTILINE = 8 S = 16 U = 32 UNICODE = 32 VERBOSE = 64 X = 64 __all__ = ['match', 'search', 'sub', 'subn', 'split', 'findall', 'comp... __version__ = '2.2.1' VERSION 2.2.1
引用地址:http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
正則表達式是用於處理字符串的強大工具,擁有本身獨特的語法以及一個獨立的處理引擎,效率上可能不如str自帶的方法,但功能十分強大。得益於這一點,在提供了正則表達式的語言裏,正則表達式的語法都是同樣的,區別只在於不一樣的編程語言實現支持的語法數量不一樣;但不用擔憂,不被支持的語法一般是不經常使用的部分。
下圖展現了使用正則表達式進行匹配的流程:
下圖列出了Python支持的正則表達式元字符和語法:
正則表達式一般用於在文本中查找匹配的字符串。Python裏數量詞默認是貪婪的(在少數語言裏也多是默認非貪婪),老是嘗試匹配儘量多的字符;非貪婪的則相反,老是嘗試匹配儘量少的字符。例如:正則表達式"ab*"若是用於查找"abbbc",將找到"abbb"。而若是使用非貪婪的數量詞"ab*?",將找到"a"。
測試:
>>> print re.match('ab*','abbbc').group() abbb >>> print re.match('ab*?','abbbc').group() a
與大多數編程語言相同,正則表達式裏使用"\"做爲轉義字符,這就可能形成反斜槓困擾。假如你須要匹配文本中的字符"\",那麼使用編程語言表示的正則表達式裏將須要4個反斜槓"\\\\":前兩個和後兩個分別用於在編程語言裏轉義成反斜槓,轉換成兩個反斜槓後再在正則表達式裏轉義成一個反斜槓。Python裏的原生字符串很好地解決了這個問題,這個例子中的正則表達式可使用r"\\"表示。一樣,匹配一個數字的"\\d"能夠寫成r"\d"。有了原生字符串,你不再用擔憂是否是漏寫了反斜槓,寫出來的表達式也更直觀。
正則表達式提供了一些可用的匹配模式,好比忽略大小寫、多行匹配等,這部份內容將在Pattern類的工廠方法re.compile(pattern[, flags])中一塊兒介紹。
Python經過re模塊提供對正則表達式的支持。使用re的通常步驟是先將正則表達式的字符串形式編譯爲Pattern實例,而後使用Pattern實例處理文本並得到匹配結果(一個Match實例),最後使用Match實例得到信息,進行其餘的操做。
# 將正則表達式編譯成Pattern對象 >>> pattern = re.compile(r'hello') # 使用Pattern匹配文本,得到匹配結果,沒法匹配時將返回None >>> match = pattern.match('hello word!') # 使用Match得到分組信息 >>> print (match.group()) hello
此種方法多用在寫腳本或模塊時,對於較複雜的匹配規則或會常常被使用的匹配規則先作編譯,再使用。
>>> help(re.compile) Help on function compile in module re: compile(pattern, flags=0) Compile a regular expression pattern, returning a pattern object.
re.compile(strPattern[, flag]):
這個方法是Pattern類的工廠方法,用於將字符串形式的正則表達式編譯爲Pattern對象。 第二個參數flag是匹配模式,取值可使用按位或運算符'|'表示同時生效,好比re.I | re.M。另外,你也能夠在規則字符串中指定模式,好比re.compile('pattern', re.I | re.M)與re.compile('(?im)pattern')是等價的。 (參看特殊構造(不做爲分組部分))
可選值有:
a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X) b = re.compile(r"\d+\.\d*")
>>> help(re.match) Help on function match in module re: match(pattern, string, flags=0) Try to apply the pattern at the start of the string, returning a match object, or None if no match was found. >>> m = re.match(r'hello', 'hello world!') >>> m.group() 'hello'
Match對象是一次匹配的結果,包含了不少關於這次匹配的信息,可使用Match提供的可讀屬性或方法來獲取這些信息。
屬性:
(1)string: 匹配時使用的文本。
(2)re: 匹配時使用的Pattern對象。
(3)pos: 文本中正則表達式開始搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
(4)endpos: 文本中正則表達式結束搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。
(5)lastindex: 最後一個被捕獲的分組在文本中的索引。若是沒有被捕獲的分組,將爲None。
(6)lastgroup: 最後一個被捕獲的分組的別名。若是這個分組沒有別名或者沒有被捕獲的分組,將爲None。
>>> m.string 'hello world!' >>> m.re <_sre.SRE_Pattern object at 0x02CC6D40> >>> m.pos 0 >>> m.endpos 12 >>> m.lastindex >>> m.lastgroup >>>
方法:
(1)group([group1, …]):
得到一個或多個分組截獲的字符串;指定多個參數時將以元組形式返回。group1可使用編號也可使用別名;編號0表明整個匹配的子串;不填寫參數時,返回group(0);沒有截獲字符串的組返回None;截獲了屢次的組返回最後一次截獲的子串。
(2)groups([default]):
以元組形式返回所有分組截獲的字符串。至關於調用group(1,2,…last)。default表示沒有截獲字符串的組以這個值替代,默認爲None。
(3)groupdict([default]):
返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典,沒有別名的組不包含在內。default含義同上。
(4)start([group]):
返回指定的組截獲的子串在string中的起始索引(子串第一個字符的索引)。group默認值爲0。
(5)end([group]):
返回指定的組截獲的子串在string中的結束索引(子串最後一個字符的索引+1)。group默認值爲0。
(6)span([group]):
返回(start(group), end(group))。
(7)expand(template):
將匹配到的分組代入template中而後返回。template中可使用\id或\g<id>、\g<name>引用分組,但不能使用編號0。\id與\g<id>是等價的;但\10將被認爲是第10個分組,若是你想表達\1以後是字符'0',只能使用\g<1>0。
舉例說明:
匹配3個分組,(1)1或無限個字符,(2)1或無限個字符(3)具備額外別名「sign」的分組,任意符號0或無限個。要匹配的字符串爲」hello world!」
>>> m2 = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!') >>> m2.string #匹配時使用的文本,即要匹配的字符串 'hello world!' >>> m2.re #匹配時使用的Pattern對象,即編譯的匹配規則 <_sre.SRE_Pattern object at 0x02CB8B00> >>> m2.pos #文本中正則表達式開始搜索的索引 0 >>> m2.endpos #文本中正則表達式結束搜索的索引 12 >>> m2.lastindex #最後一個被捕獲的分組在文本中的索引 3 >>> m2.lastgroup #最後一個被捕獲的分組的別名,若是這個分組沒有別名或者沒有被捕獲的分組,將爲None。即只在有捕獲並有別名時纔會有輸出。 'sign' >>> m3 = re.match(r'(\w+) (\w+)(.*)', 'hello world!') >>> m3.lastgroup >>> >>> m2.group() #得到一個或多個分組截獲的字符串;指定多個參數時將以元組形式返回。 'hello world!' >>> m2.group(0) 'hello world!' >>> m2.group(1) 'hello' >>> m2.group(2) 'world' >>> m2.group(3) '!' >>> m2.group(1,2) ('hello', 'world') >>> m2.group(1,3) ('hello', '!') >>> m2.group(1,2,3) ('hello', 'world', '!') >>> m2.groups() #以元組形式返回所有分組截獲的字符串。 ('hello', 'world', '!') >>> m2.groups(1) ('hello', 'world', '!') >>> m2.groups(2) ('hello', 'world', '!') #返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典,沒有別名的組不包含在內。 >>> m2.groupdict() {'sign': '!'} #返回指定的組截獲的子串在string中的起始索引(子串第一個字符的索引) >>> m2.start() 0 >>> m2.start(0) 0 >>> m2.start(1) 0 >>> m2.start(2) 6 >>> m2.start(3) 11 #返回指定的組截獲的子串在string中的結束索引(子串最後一個字符的索引+1) >>> m2.end() 12 >>> m2.end(0) 12 >>> m2.end(1) 5 >>> m2.end(2) 11 >>> m2.end(3) 12 將匹配到的分組代入參數中而後按從新排列的順序返回 >>> m2.expand(r'\3\2\1') '!worldhello' >>> m2.expand(r'\3 \2 \1') '! world hello'
Pattern對象是一個編譯好的正則表達式,經過Pattern提供的一系列方法能夠對文本進行匹配查找。
>>> help(m2.re) Help on SRE_Pattern object: class SRE_Pattern(__builtin__.object) | Compiled regular expression objects | | Methods defined here: | | __copy__(...) | | __deepcopy__(...) | | findall(...) | findall(string[, pos[, endpos]]) --> list. | Return a list of all non-overlapping matches of pattern in string. | | finditer(...) | finditer(string[, pos[, endpos]]) --> iterator. | Return an iterator over all non-overlapping matches for the | RE pattern in string. For each match, the iterator returns a | match object. | | match(...) | match(string[, pos[, endpos]]) --> match object or None. | Matches zero or more characters at the beginning of the string | | scanner(...) | | search(...) | search(string[, pos[, endpos]]) --> match object or None. | Scan through string looking for a match, and return a corresponding | match object instance. Return None if no position in the string matches. | | split(...) | split(string[, maxsplit = 0]) --> list. | Split string by the occurrences of pattern. | | sub(...) | sub(repl, string[, count = 0]) --> newstring | Return the string obtained by replacing the leftmost non-overlapping | occurrences of pattern in string by the replacement repl. | | subn(...) | subn(repl, string[, count = 0]) --> (newstring, number of subs) | Return the tuple (new_string, number_of_subs_made) found by replacing | the leftmost non-overlapping occurrences of pattern with the | replacement repl. | | ---------------------------------------------------------------------- | Data descriptors defined here: | | flags | | groupindex | | groups | | pattern
Pattern不能直接實例化,必須使用re.compile()進行構造。
(1)pattern: 編譯時用的表達式字符串。
(2)flags: 編譯時用的匹配模式。數字形式。
(3)groups: 表達式中分組的數量。
(4)groupindex: 以表達式中有別名的組的別名爲鍵、以該組對應的編號爲值的字典,沒有別名的組不包含在內。
import re p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL) print "p.pattern:", p.pattern print "p.flags:", p.flags print "p.groups:", p.groups print "p.groupindex:", p.groupindex ### output ### # p.pattern: (\w+) (\w+)(?P<sign>.*) # p.flags: 16 # p.groups: 3 # p.groupindex: {'sign': 3}
3.3.2.3.2實例方法[ | re模塊方法]:
1、match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):
| match(...) | match(string[, pos[, endpos]]) --> match object or None. |
這個方法將從string的pos下標處起嘗試匹配pattern;若是pattern結束時仍可匹配,則返回一個Match對象;若是匹配過程當中pattern沒法匹配,或者匹配未結束就已到達endpos,則返回None。
pos和endpos的默認值分別爲0和len(string);re.match()沒法指定這兩個參數,參數flags用於編譯pattern時指定匹配模式。
注意:這個方法並非徹底匹配。當pattern結束時若string還有剩餘字符,仍然視爲成功。想要徹底匹配,能夠在表達式末尾加上邊界匹配符'$'。
示例參見3.3.2.1小節。
2、search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):
這個方法用於查找字符串中能夠匹配成功的子串。從string的pos下標處起嘗試匹配pattern,若是pattern結束時仍可匹配,則返回一個Match對象;若沒法匹配,則將pos加1後從新嘗試匹配;直到pos=endpos時仍沒法匹配則返回None。
pos和endpos的默認值分別爲0和len(string));re.search()沒法指定這兩個參數,參數flags用於編譯pattern時指定匹配模式。
# 將正則表達式編譯成Pattern對象 >>> pattern = re.compile(r'world') # 使用search()查找匹配的子串,不存在能匹配的子串時將返回None # 這個例子中使用match()沒法成功匹配 hello可以match()成功*** >>> match = pattern.search('hello world!') # 使用Match得到分組信息 >>> match.group() 'world' >>>
3、split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):
按照可以匹配的子串將string分割後返回列表。maxsplit用於指定最大分割次數,不指定將所有分割。
| split(...) | split(string[, maxsplit = 0]) --> list. | Split string by the occurrences of pattern. >>> help(re.split) Help on function split in module re: split(pattern, string, maxsplit=0, flags=0) Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings. >>> p = re.compile(r'\d+') >>> p <_sre.SRE_Pattern object at 0x02D53F70> >>> p.split('one1two2three3four4five5six6seven7eight8nine9ten10') ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '']
4、findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):
搜索string,以列表形式返回所有能匹配的子串。
>>> p = re.compile(r'\d+') >>> p.findall('one1two2three3four4five5six6seven7eight8nine9ten10') ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
5、finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):
搜索string,返回一個順序訪問每個匹配結果(Match對象)的迭代器。
>>> p = re.compile(r'\d+') >>> piter = p.finditer('one1two2three3four4') >>> piter <callable-iterator object at 0x02E153B0> >>> for x in piter: print x <_sre.SRE_Match object at 0x02EAE800> <_sre.SRE_Match object at 0x02EAE838> <_sre.SRE_Match object at 0x02EAE800> <_sre.SRE_Match object at 0x02EAE838> >>> piter = p.finditer('one1two2three3four4') >>> for x in piter: print x.group(), 1 2 3 4
6、sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):
使用repl替換string中每個匹配的子串後返回替換後的字符串。
當repl是一個字符串時,可使用\id或\g<id>、\g<name>引用分組,但不能使用編號0。
當repl是一個方法時,這個方法應當只接受一個參數(Match對象),並返回一個字符串用於替換(返回的字符串中不能再引用分組)。
count用於指定最多替換次數,不指定時所有替換。
(1)字符串時
>>> p = re.compile(r'(\w+) (\w+)') >>> s = 'i say, hello world' >>> p.sub(r'\2 \1',s) 'say i, world hello'
注:只有兩個匹配,使用序號超過匹配分組時,拋出異常
>>> p.sub(r'\3 \1',s) Traceback (most recent call last): File "<pyshell#207>", line 1, in <module> p.sub(r'\3 \1',s) File "C:\Python27\lib\re.py", line 291, in filter return sre_parse.expand_template(template, match) File "C:\Python27\lib\sre_parse.py", line 833, in expand_template raise error, "invalid group reference" error: invalid group reference
(2)方法時
>>> def fun(m): return m.group(1).title()+ ' ' + m.group(2).title() >>> p.sub(fun,s) 'I Say, Hello World' >>> help(str.title) Help on method_descriptor: title(...) S.title() -> string Return a titlecased version of S, i.e. words start with uppercase characters, all remaining cased characters have lowercase. 返回字符串首字母大寫。
7、subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):
返回 (sub(repl, string[, count]), 替換次數)。
>>> help(p.subn) Help on built-in function subn: subn(...) subn(repl, string[, count = 0]) --> (newstring, number of subs) Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl. >>> p = re.compile(r'(\w+) (\w+)') >>> s = 'i say, hello world!' >>> p.subn(r'\2 \1', s) ('say i, world hello!', 2) >>> p.subn(r'\2',s) ('say, world!', 2) >>> p.subn(r'\1',s) ('i, hello!', 2) >>> p.subn(r'\1 \2',s) ('i say, hello world!', 2) >>> def funn(m): print(m.group(1)+' '+ m.group(2)) >>> p.subn(funn,s) i say hello world (', !', 2) >>> def funn(m): return(m.group(1)+' '+ m.group(2)) >>> p.subn(funn,s) ('i say, hello world!', 2)