2-3統計序列中元素的出現頻度

時間 2019-11-11

標籤統計序列元素出現頻度简体版

原文原文鏈接

一、序列出現次數的實現方法

1.1使用fromkey方法初始化一個dict，而後經過for循環迭代統計次數。

思路：先生成一個以列表爲鍵，出現次數爲值的字典，再進行字典的排序html

(1)生成30個隨機數在1~20的列表

>>> from random import randint
>>> data = [randint(1,21) for _ in xrange(30)] 
>>> data
[18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]

生成隨機整數

(2)將列表做爲鍵，生成值全爲0的字典，字典會將列表中重複的值過濾掉

>>> dicData = dict.fromkeys(data,0) 
>>> dicData
{1: 0, 2: 0, 3: 0, 5: 0, 6: 0, 8: 0, 9: 0, 11: 0, 12: 0, 13: 0, 14: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}

由列表爲鍵生成字典

(3)迭代計算原列表中出現的次數，做爲生成字典的值

>>> for x in data:
    dicData[x] += 1
>>> dicData
{1: 1, 2: 2, 3: 2, 5: 1, 6: 3, 8: 2, 9: 2, 11: 1, 12: 1, 13: 1, 14: 3, 16: 1, 17: 1, 18: 3, 19: 1, 20: 3, 21: 2}

View Code

(4)將字典排序，以值爲key進行排序，同時採用逆序，生成元組列表

>>> sortDicData = sorted(dicData.iteritems(),key=lambda x:x[1],reverse=True)
>>> sortDicData
[(6, 3), (14, 3), (18, 3), (20, 3), (2, 2), (3, 2), (8, 2), (9, 2), (21, 2), (1, 1), (5, 1), (11, 1), (12, 1), (13, 1), (16, 1), (17, 1), (19, 1)]

View Code

(5)將生成的元組列表，用切片的方式取前3個，再轉爲字典。

>>> newdicData = dict(sortDicData[:4])
>>> newdicData
{18: 3, 20: 3, 14: 3, 6: 3}

View Code

1.2使用collections.Counter對象

使用和上例相同的列表，Counter一個字典dict的子類。python

(1)將序列傳入Counter的構造器，獲得Counter對象是元素頻度的字典

>>> data
[18, 5, 21, 6, 13, 18, 3, 20, 14, 3, 8, 20, 12, 16, 21, 11, 9, 17, 14, 1, 19, 2, 6, 9, 6, 20, 8, 14, 18, 2]
>>> from collections import Counter
>>> dict1 = Counter(data)
>>> dict1
Counter({6: 3, 14: 3, 18: 3, 20: 3, 2: 2, 3: 2, 8: 2, 9: 2, 21: 2, 1: 1, 5: 1, 11: 1, 12: 1, 13: 1, 16: 1, 17: 1, 19: 1})

View Code

(2)查看字典鍵所對應的值（即出現次數）

>>> dict1[6]
3
>>> dict1[20]
3
>>> dict1[2]
2

View Code

(3)使用dict1對象的most_common(n)方法，獲得頻度最高的n個元素的列表

>>> dict1.most_common(3)
[(6, 3), (14, 3), (18, 3)]

View Code

二、對英文文章詞頻的統計

思路：將文章讀入成字符串，再使用正則表達式模塊的分割，使用正則表達式的分割模塊，將每一個單詞分割分來。git

>>> from collections import Counter正則表達式

>>> import re #正則表達式模塊shell

#注意word文檔doc不能像文本文件讀，須要使用有專用於讀doc文件的doc模塊express

#打開collections.txt文件，並將該文件讀出，賦給txt，txt就是一個很長的字符串編程

>>> txt = open("C:\視頻\python高效實踐技巧筆記\collections.txt").read()app

#而後用正則表達式分割，用非字母對整個字符串進行分割，就分割出了由各單詞組成的列表re.split('\W+',txt)。再用Counter()對該列表詞頻統計，如上面介紹dom

>>> c3 =Counter(re.split('\W+',txt))編程語言

#獲得頻度最高的10個單詞的列表

>>> c3.most_common(10)

[('the', 177), ('a', 126), ('to', 96), ('and', 93), ('is', 73), ('d', 73), ('in', 72), ('for', 69), ('of', 64), ('2', 53)]

3擴展知識

3.1字典的相關知識

>>> help(dict)
Help on class dict in module __builtin__:

class dict(object)
 |  dict() -> new empty dictionary
 |  dict(mapping) -> new dictionary initialized from a mapping object's
 |      (key, value) pairs
 |  dict(iterable) -> new dictionary initialized as if via:
 |      d = {}
 |      for k, v in iterable:
 |          d[k] = v
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs
 |      in the keyword argument list.  For example:  dict(one=1, two=2)
 |  
 |  Methods defined here:
 |  
 |  __cmp__(...)
 |      x.__cmp__(y) <==> cmp(x,y)
 |  
 |  __contains__(...)
 |      D.__contains__(k) -> True if D has a key k, else False
 |  
 |  __delitem__(...)
 |      x.__delitem__(y) <==> del x[y]
 |  
 |  __eq__(...)
 |      x.__eq__(y) <==> x==y
 |  
 |  __ge__(...)
 |      x.__ge__(y) <==> x>=y
 |  
 |  __getattribute__(...)
 |      x.__getattribute__('name') <==> x.name
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(...)
 |      x.__gt__(y) <==> x>y
 |  
 |  __init__(...)
 |      x.__init__(...) initializes x; see help(type(x)) for signature
 |  
 |  __iter__(...)
 |      x.__iter__() <==> iter(x)
 |  
 |  __le__(...)
 |      x.__le__(y) <==> x<=y
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __lt__(...)
 |      x.__lt__(y) <==> x<y
 |  
 |  __ne__(...)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __setitem__(...)
 |      x.__setitem__(i, y) <==> x[i]=y
 |  
 |  __sizeof__(...)
 |      D.__sizeof__() -> size of D in memory, in bytes
 |  
 |  clear(...)
 |      D.clear() -> None.  Remove all items from D.
 |  
 |  copy(...)
 |      D.copy() -> a shallow copy of D
 |  
 |  fromkeys(...)
 |      dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v.
 |      v defaults to None.
 |  
 |  get(...)
 |      D.get(k[,d]) -> D[k] if k in D, else d.  d defaults to None.
 |  
 |  has_key(...)
 |      D.has_key(k) -> True if D has a key k, else False
 |  
 |  items(...)
 |      D.items() -> list of D's (key, value) pairs, as 2-tuples
 |  
 |  iteritems(...)
 |      D.iteritems() -> an iterator over the (key, value) items of D
 |  
 |  iterkeys(...)
 |      D.iterkeys() -> an iterator over the keys of D
 |  
 |  itervalues(...)
 |      D.itervalues() -> an iterator over the values of D
 |  
 |  keys(...)
 |      D.keys() -> list of D's keys
 |  
 |  pop(...)
 |      D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
 |      If key is not found, d is returned if given, otherwise KeyError is raised
 |  
 |  popitem(...)
 |      D.popitem() -> (k, v), remove and return some (key, value) pair as a
 |      2-tuple; but raise KeyError if D is empty.
 |  
 |  setdefault(...)
 |      D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D
 |  
 |  update(...)
 |      D.update([E, ]**F) -> None.  Update D from dict/iterable E and F.
 |      If E present and has a .keys() method, does:     for k in E: D[k] = E[k]
 |      If E present and lacks .keys() method, does:     for (k, v) in E: D[k] = v
 |      In either case, this is followed by: for k in F: D[k] = F[k]
 |  
 |  values(...)
 |      D.values() -> list of D's values
 |  
 |  viewitems(...)
 |      D.viewitems() -> a set-like object providing a view on D's items
 |  
 |  viewkeys(...)
 |      D.viewkeys() -> a set-like object providing a view on D's keys
 |  
 |  viewvalues(...)
 |      D.viewvalues() -> an object providing a view on D's values
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

help(dict)

3.1.1 fromkeys

 |  fromkeys(...)
 |      dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v.
 |      v defaults to None.
將序列的值，作爲字典的鍵，生成字典。
>>> data = [3,1,56]
>>> data1 = dict.fromkeys(data)
>>> data1
{56: None, 1: None, 3: None}
>>> data2 = dict.fromkeys(data,3)
>>> data2
{56: 3, 1: 3, 3: 3}
>>>

View Code

3.1.2 iteritems

 |  iteritems(...)
 |      D.iteritems() -> an iterator over the (key, value) items of D
接上例：能夠看出這是一個鍵、值的迭代器
>>> data2.iteritems()
<dictionary-itemiterator object at 0x02D812A0>

View Code

3.1.3 iterkeys

 |  iterkeys(...)
 |      D.iterkeys() -> an iterator over the keys of D

接上例：能夠看出這是一個鍵的迭代器
>>> data2.iterkeys
<built-in method iterkeys of dict object at 0x02E3BDB0>
>>> data2.iterkeys()
<dictionary-keyiterator object at 0x02E27F00>

View Code

3.1.4 itervalues

 |      D.itervalues() -> an iterator over the values of D
接上例：能夠看出這是一個值的迭代器
>>> data2.itervalues()
<dictionary-valueiterator object at 0x02D81810>

View Code

3.2 collections

>>> import collections

>>> help(collections)

結果把整個官方在線文檔給輸出了，學習資料最方便的資料仍是官方文檔

3.2.1 namedtuple

在《2-2 爲元組中的元素命名》有作介紹

>>> import collections
>>> help(collections.namedtuple)
Help on function namedtuple in module collections:

namedtuple(typename, field_names, verbose=False, rename=False)
    Returns a new subclass of tuple with named fields.
    
    >>> Point = namedtuple('Point', ['x', 'y'])
    >>> Point.__doc__                   # docstring for the new class
    'Point(x, y)'
    >>> p = Point(11, y=22)             # instantiate with positional args or keywords
    >>> p[0] + p[1]                     # indexable like a plain tuple
    33
    >>> x, y = p                        # unpack like a regular tuple
    >>> x, y
    (11, 22)
    >>> p.x + p.y                       # fields also accessible by name
    33
    >>> d = p._asdict()                 # convert to a dictionary
    >>> d['x']
    11
    >>> Point(**d)                      # convert from a dictionary
    Point(x=11, y=22)
    >>> p._replace(x=100)               # _replace() is like str.replace() but targets named fields
Point(x=100, y=22)

help(collections.namedtuple)

namedtuple是一個函數，它用來建立一個自定義的tuple對象，而且規定了tuple元素的個數，並能夠用屬性而不是索引來引用tuple的某個元素。

這樣一來，咱們用namedtuple能夠很方便地定義一種數據類型，它具有tuple的不變性，又能夠根據屬性來引用，使用十分方便。

3.2.2 Counter

>>> import collections

>>> help(collections.Counter)

打印出的說明文檔好多。

most_common()

|  most_common(self, n=None)

 |      List the n most common elements and their counts from the most

 |      common to the least.  If n is None, then list all element counts.

 |      

 |      >>> Counter('abcdeabcdabcaba').most_common(3)

 |      [('a', 5), ('b', 4), ('c', 3)]

most_common()

3.3正則表達式re模塊

官方文檔：

Py2.7:https://docs.python.org/2.7/library/re.html

Py3 :https://docs.python.org/3/library/re.html

>>> help(re)
Help on module re:

NAME
    re - Support for regular expressions (RE).

FILE
    c:\python27\lib\re.py

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matches the string 'last'.
    
    The special characters are:
        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
                 Greedy means that it will match as many repetitions as possible.
        "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
        "?"      Matches 0 or 1 (greedy) of the preceding RE.
        *?,+?,?? Non-greedy versions of the previous three special characters.
        {m,n}    Matches from m to n repetitions of the preceding RE.
        {m,n}?   Non-greedy version of the above.
        "\\"     Either escapes special characters or signals a special sequence.
        []       Indicates a set of characters.
                 A "^" as the first character indicates a complementing set.
        "|"      A|B, creates an RE that will match either A or B.
        (...)    Matches the RE inside the parentheses.
                 The contents can be retrieved or matched later in the string.
        (?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
        (?:...)  Non-grouping version of regular parentheses.
        (?P<name>...) The substring matched by the group is accessible by name.
        (?P=name)     Matches the text matched earlier by the group named name.
        (?#...)  A comment; ignored.
        (?=...)  Matches if ... matches next, but doesn't consume the string.
        (?!...)  Matches if ... doesn't match next.
        (?<=...) Matches if preceded by ... (must be fixed length).
        (?<!...) Matches if not preceded by ... (must be fixed length).
        (?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
                           the (optional) no pattern otherwise.
    
    The special sequences consist of "\\" and a character from the list
    below.  If the ordinary character is not on the list, then the
    resulting RE will match the second character.
        \number  Matches the contents of the group of the same number.
        \A       Matches only at the start of the string.
        \Z       Matches only at the end of the string.
        \b       Matches the empty string, but only at the start or end of a word.
        \B       Matches the empty string, but not at the start or end of a word.
        \d       Matches any decimal digit; equivalent to the set [0-9].
        \D       Matches any non-digit character; equivalent to the set [^0-9].
        \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v].
        \S       Matches any non-whitespace character; equiv. to [^ \t\n\r\f\v].
        \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
                 With LOCALE, it will match the set [0-9_] plus characters defined
                 as letters for the current locale.
        \W       Matches the complement of \w.
        \\       Matches a literal backslash.
    
    This module exports the following functions:
        match    Match a regular expression pattern to the beginning of a string.
        search   Search a string for the presence of a pattern.
        sub      Substitute occurrences of a pattern found in a string.
        subn     Same as sub, but also return the number of substitutions made.
        split    Split a string by the occurrences of a pattern.
        findall  Find all occurrences of a pattern in a string.
        finditer Return an iterator yielding a match object for each match.
        compile  Compile a pattern into a RegexObject.
        purge    Clear the regular expression cache.
        escape   Backslash all non-alphanumerics in a string.
    
    Some of the functions in this module takes flags as optional parameters:
        I  IGNORECASE  Perform case-insensitive matching.
        L  LOCALE      Make \w, \W, \b, \B, dependent on the current locale.
        M  MULTILINE   "^" matches the beginning of lines (after a newline)
                       as well as the string.
                       "$" matches the end of lines (before a newline) as well
                       as the end of the string.
        S  DOTALL      "." matches any character at all, including the newline.
        X  VERBOSE     Ignore whitespace and comments for nicer looking RE's.
        U  UNICODE     Make \w, \W, \b, \B, dependent on the Unicode locale.
    
    This module also defines an exception 'error'.

CLASSES
    exceptions.Exception(exceptions.BaseException)
        sre_constants.error
    
    class error(exceptions.Exception)
     |  Method resolution order:
     |      error
     |      exceptions.Exception
     |      exceptions.BaseException
     |      __builtin__.object
     |  
     |  Data descriptors defined here:
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from exceptions.Exception:
     |  
     |  __init__(...)
     |      x.__init__(...) initializes x; see help(type(x)) for signature
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from exceptions.Exception:
     |  
     |  __new__ = <built-in method __new__ of type object>
     |      T.__new__(S, ...) -> a new object with type S, a subtype of T
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from exceptions.BaseException:
     |  
     |  __delattr__(...)
     |      x.__delattr__('name') <==> del x.name
     |  
     |  __getattribute__(...)
     |      x.__getattribute__('name') <==> x.name
     |  
     |  __getitem__(...)
     |      x.__getitem__(y) <==> x[y]
     |  
     |  __getslice__(...)
     |      x.__getslice__(i, j) <==> x[i:j]
     |      
     |      Use of negative indices is not supported.
     |  
     |  __reduce__(...)
     |  
     |  __repr__(...)
     |      x.__repr__() <==> repr(x)
     |  
     |  __setattr__(...)
     |      x.__setattr__('name', value) <==> x.name = value
     |  
     |  __setstate__(...)
     |  
     |  __str__(...)
     |      x.__str__() <==> str(x)
     |  
     |  __unicode__(...)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from exceptions.BaseException:
     |  
     |  __dict__
     |  
     |  args
     |  
     |  message

FUNCTIONS
    compile(pattern, flags=0)
        Compile a regular expression pattern, returning a pattern object.
    
    escape(pattern)
        Escape all non-alphanumeric characters in pattern.
    
    findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more groups are present in the pattern, return a
        list of groups; this will be a list of tuples if the pattern
        has more than one group.
        
        Empty matches are included in the result.
    
    finditer(pattern, string, flags=0)
        Return an iterator over all non-overlapping matches in the
        string.  For each match, the iterator returns a match object.
        
        Empty matches are included in the result.
    
    match(pattern, string, flags=0)
        Try to apply the pattern at the start of the string, returning
        a match object, or None if no match was found.
    
    purge()
        Clear the regular expression cache
    
    search(pattern, string, flags=0)
        Scan through string looking for a match to the pattern, returning
        a match object, or None if no match was found.
    
    split(pattern, string, maxsplit=0, flags=0)
        Split the source string by the occurrences of the pattern,
        returning a list containing the resulting substrings.
    
    sub(pattern, repl, string, count=0, flags=0)
        Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the match object and must return
        a replacement string to be used.
    
    subn(pattern, repl, string, count=0, flags=0)
        Return a 2-tuple containing (new_string, number).
        new_string is the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in the source
        string by the replacement repl.  number is the number of
        substitutions that were made. repl can be either a string or a
        callable; if a string, backslash escapes in it are processed.
        If it is a callable, it's passed the match object and must
        return a replacement string to be used.
    
    template(pattern, flags=0)
        Compile a template pattern, returning a pattern object

DATA
    DOTALL = 16
    I = 2
    IGNORECASE = 2
    L = 4
    LOCALE = 4
    M = 8
    MULTILINE = 8
    S = 16
    U = 32
    UNICODE = 32
    VERBOSE = 64
    X = 64
    __all__ = ['match', 'search', 'sub', 'subn', 'split', 'findall', 'comp...
    __version__ = '2.2.1'

VERSION
    2.2.1

help(re)

Python正則表達式指南

引用地址：http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

3.3.1正則表達式基礎

3.3.1.1簡單介紹

正則表達式是用於處理字符串的強大工具，擁有本身獨特的語法以及一個獨立的處理引擎，效率上可能不如str自帶的方法，但功能十分強大。得益於這一點，在提供了正則表達式的語言裏，正則表達式的語法都是同樣的，區別只在於不一樣的編程語言實現支持的語法數量不一樣；但不用擔憂，不被支持的語法一般是不經常使用的部分。

下圖展現了使用正則表達式進行匹配的流程：

下圖列出了Python支持的正則表達式元字符和語法：

3.3.1.2數量詞的貪婪模式與非貪婪模式

正則表達式一般用於在文本中查找匹配的字符串。Python裏數量詞默認是貪婪的（在少數語言裏也多是默認非貪婪），老是嘗試匹配儘量多的字符；非貪婪的則相反，老是嘗試匹配儘量少的字符。例如：正則表達式"ab*"若是用於查找"abbbc"，將找到"abbb"。而若是使用非貪婪的數量詞"ab*?"，將找到"a"。

測試：

>>> print re.match('ab*','abbbc').group()

abbb

>>> print re.match('ab*?','abbbc').group()

a

View Code

3.3.1.3反斜槓的困擾

與大多數編程語言相同，正則表達式裏使用"\"做爲轉義字符，這就可能形成反斜槓困擾。假如你須要匹配文本中的字符"\"，那麼使用編程語言表示的正則表達式裏將須要4個反斜槓"\\\\"：前兩個和後兩個分別用於在編程語言裏轉義成反斜槓，轉換成兩個反斜槓後再在正則表達式裏轉義成一個反斜槓。Python裏的原生字符串很好地解決了這個問題，這個例子中的正則表達式可使用r"\\"表示。一樣，匹配一個數字的"\\d"能夠寫成r"\d"。有了原生字符串，你不再用擔憂是否是漏寫了反斜槓，寫出來的表達式也更直觀。

3.3.1.4匹配模式

正則表達式提供了一些可用的匹配模式，好比忽略大小寫、多行匹配等，這部份內容將在Pattern類的工廠方法re.compile(pattern[, flags])中一塊兒介紹。

3.3.2re模塊

3.3.2.1re.compile

Python經過re模塊提供對正則表達式的支持。使用re的通常步驟是先將正則表達式的字符串形式編譯爲Pattern實例，而後使用Pattern實例處理文本並得到匹配結果（一個Match實例），最後使用Match實例得到信息，進行其餘的操做。

# 將正則表達式編譯成Pattern對象
>>> pattern = re.compile(r'hello')
# 使用Pattern匹配文本，得到匹配結果，沒法匹配時將返回None
>>> match = pattern.match('hello word!')
# 使用Match得到分組信息
>>> print (match.group())
hello

View Code

此種方法多用在寫腳本或模塊時，對於較複雜的匹配規則或會常常被使用的匹配規則先作編譯，再使用。

>>> help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.

help(re.compile)

re.compile(strPattern[, flag]):

這個方法是Pattern類的工廠方法，用於將字符串形式的正則表達式編譯爲Pattern對象。第二個參數flag是匹配模式，取值可使用按位或運算符'|'表示同時生效，好比re.I | re.M。另外，你也能夠在規則字符串中指定模式，好比re.compile('pattern', re.I | re.M)與re.compile('(?im)pattern')是等價的。（參看特殊構造（不做爲分組部分））
可選值有：

re.I(re.IGNORECASE): 忽略大小寫（括號內是完整寫法，下同）
re.M(re.MULTILINE): 多行模式，改變'^'和'$'的行爲（參見上圖）
re.S(re.DOTALL): 點任意匹配模式，改變'.'的行爲
re.L(re.LOCALE): 使預約字符類 \w \W \b \B \s \S 取決於當前區域設定
re.U(re.UNICODE): 使預約字符類 \w \W \b \B \s \S \d \D 取決於unicode定義的字符屬性
re.X(re.VERBOSE): 詳細模式。這個模式下正則表達式能夠是多行，忽略空白字符，並能夠加入註釋。如下兩個正則表達式是等價的：

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

View Code

3.3.2.2re.match

>>> help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
>>> m = re.match(r'hello', 'hello world!')
>>> m.group()
'hello'

help(re.match)

Match對象是一次匹配的結果，包含了不少關於這次匹配的信息，可使用Match提供的可讀屬性或方法來獲取這些信息。

屬性：

（1）string: 匹配時使用的文本。

（2）re: 匹配時使用的Pattern對象。

（3）pos: 文本中正則表達式開始搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

（4）endpos: 文本中正則表達式結束搜索的索引。值與Pattern.match()和Pattern.seach()方法的同名參數相同。

（5）lastindex: 最後一個被捕獲的分組在文本中的索引。若是沒有被捕獲的分組，將爲None。

（6）lastgroup: 最後一個被捕獲的分組的別名。若是這個分組沒有別名或者沒有被捕獲的分組，將爲None。

>>> m.string
'hello world!'
>>> m.re
<_sre.SRE_Pattern object at 0x02CC6D40>
>>> m.pos
0
>>> m.endpos
12
>>> m.lastindex
>>> m.lastgroup
>>>

測試

方法：

（1）group([group1, …]):
得到一個或多個分組截獲的字符串；指定多個參數時將以元組形式返回。group1可使用編號也可使用別名；編號0表明整個匹配的子串；不填寫參數時，返回group(0)；沒有截獲字符串的組返回None；截獲了屢次的組返回最後一次截獲的子串。

（2）groups([default]):
以元組形式返回所有分組截獲的字符串。至關於調用group(1,2,…last)。default表示沒有截獲字符串的組以這個值替代，默認爲None。

（3）groupdict([default]):
返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典，沒有別名的組不包含在內。default含義同上。

（4）start([group]):
返回指定的組截獲的子串在string中的起始索引（子串第一個字符的索引）。group默認值爲0。

（5）end([group]):
返回指定的組截獲的子串在string中的結束索引（子串最後一個字符的索引+1）。group默認值爲0。

（6）span([group]):
返回(start(group), end(group))。

（7）expand(template):
將匹配到的分組代入template中而後返回。template中可使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。\id與\g<id>是等價的；但\10將被認爲是第10個分組，若是你想表達\1以後是字符'0'，只能使用\g<1>0。

舉例說明：

匹配3個分組，（1）1或無限個字符，（2）1或無限個字符（3）具備額外別名「sign」的分組,任意符號0或無限個。要匹配的字符串爲」hello world!」

>>> m2 = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')
>>> m2.string  #匹配時使用的文本，即要匹配的字符串
'hello world!'
>>> m2.re     #匹配時使用的Pattern對象，即編譯的匹配規則
<_sre.SRE_Pattern object at 0x02CB8B00>
>>> m2.pos  #文本中正則表達式開始搜索的索引
0
>>> m2.endpos #文本中正則表達式結束搜索的索引
12
>>> m2.lastindex  #最後一個被捕獲的分組在文本中的索引
3
>>> m2.lastgroup  #最後一個被捕獲的分組的別名，若是這個分組沒有別名或者沒有被捕獲的分組，將爲None。即只在有捕獲並有別名時纔會有輸出。
'sign'
>>> m3 = re.match(r'(\w+) (\w+)(.*)', 'hello world!')
>>> m3.lastgroup
>>> 


>>> m2.group() #得到一個或多個分組截獲的字符串；指定多個參數時將以元組形式返回。
'hello world!'
>>> m2.group(0)
'hello world!'
>>> m2.group(1)
'hello'
>>> m2.group(2)
'world'
>>> m2.group(3)
'!'
>>> m2.group(1,2)
('hello', 'world')
>>> m2.group(1,3)
('hello', '!')
>>> m2.group(1,2,3)
('hello', 'world', '!')

>>> m2.groups()  #以元組形式返回所有分組截獲的字符串。
('hello', 'world', '!')
>>> m2.groups(1)
('hello', 'world', '!')
>>> m2.groups(2)
('hello', 'world', '!')
#返回以有別名的組的別名爲鍵、以該組截獲的子串爲值的字典，沒有別名的組不包含在內。
>>> m2.groupdict()  
{'sign': '!'}

#返回指定的組截獲的子串在string中的起始索引（子串第一個字符的索引）
>>> m2.start()
0
>>> m2.start(0)
0
>>> m2.start(1)
0
>>> m2.start(2)
6
>>> m2.start(3)
11

#返回指定的組截獲的子串在string中的結束索引（子串最後一個字符的索引+1）
>>> m2.end()
12
>>> m2.end(0)
12
>>> m2.end(1)
5
>>> m2.end(2)
11
>>> m2.end(3)
12
將匹配到的分組代入參數中而後按從新排列的順序返回
>>> m2.expand(r'\3\2\1')
'!worldhello'
>>> m2.expand(r'\3 \2 \1')
'! world hello'

View Code

3.3.2.3Pattern

Pattern對象是一個編譯好的正則表達式，經過Pattern提供的一系列方法能夠對文本進行匹配查找。

>>> help(m2.re)
Help on SRE_Pattern object:

class SRE_Pattern(__builtin__.object)
 |  Compiled regular expression objects
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  findall(...)
 |      findall(string[, pos[, endpos]]) --> list.
 |      Return a list of all non-overlapping matches of pattern in string.
 |  
 |  finditer(...)
 |      finditer(string[, pos[, endpos]]) --> iterator.
 |      Return an iterator over all non-overlapping matches for the 
 |      RE pattern in string. For each match, the iterator returns a
 |      match object.
 |  
 |  match(...)
 |      match(string[, pos[, endpos]]) --> match object or None.
 |      Matches zero or more characters at the beginning of the string
 |  
 |  scanner(...)
 |  
 |  search(...)
 |      search(string[, pos[, endpos]]) --> match object or None.
 |      Scan through string looking for a match, and return a corresponding
 |      match object instance. Return None if no position in the string matches.
 |  
 |  split(...)
 |      split(string[, maxsplit = 0])  --> list.
 |      Split string by the occurrences of pattern.
 |  
 |  sub(...)
 |      sub(repl, string[, count = 0]) --> newstring
 |      Return the string obtained by replacing the leftmost non-overlapping
 |      occurrences of pattern in string by the replacement repl.
 |  
 |  subn(...)
 |      subn(repl, string[, count = 0]) --> (newstring, number of subs)
 |      Return the tuple (new_string, number_of_subs_made) found by replacing
 |      the leftmost non-overlapping occurrences of pattern with the
 |      replacement repl.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  flags
 |  
 |  groupindex
 |  
 |  groups
 |  
 |  pattern

help(m2.re)

Pattern不能直接實例化，必須使用re.compile()進行構造。

3.3.2.3.1Pattern提供了幾個可讀屬性用於獲取表達式的相關信息：

（1）pattern: 編譯時用的表達式字符串。

（2）flags: 編譯時用的匹配模式。數字形式。

（3）groups: 表達式中分組的數量。

（4）groupindex: 以表達式中有別名的組的別名爲鍵、以該組對應的編號爲值的字典，沒有別名的組不包含在內。

import re
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)
 
print "p.pattern:", p.pattern
print "p.flags:", p.flags
print "p.groups:", p.groups
print "p.groupindex:", p.groupindex
 
### output ###
# p.pattern: (\w+) (\w+)(?P<sign>.*)
# p.flags: 16
# p.groups: 3
# p.groupindex: {'sign': 3}

測試

3.3.2.3.2實例方法[ | re模塊方法]：

1、match(string[, pos[, endpos]]) | re.match(pattern, string[, flags]):

 |  match(...)
 |      match(string[, pos[, endpos]]) --> match object or None.
 |

View Code

這個方法將從string的pos下標處起嘗試匹配pattern；若是pattern結束時仍可匹配，則返回一個Match對象；若是匹配過程當中pattern沒法匹配，或者匹配未結束就已到達endpos，則返回None。
pos和endpos的默認值分別爲0和len(string)；re.match()沒法指定這兩個參數，參數flags用於編譯pattern時指定匹配模式。
注意：這個方法並非徹底匹配。當pattern結束時若string還有剩餘字符，仍然視爲成功。想要徹底匹配，能夠在表達式末尾加上邊界匹配符'$'。

示例參見3.3.2.1小節。

2、search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]):

這個方法用於查找字符串中能夠匹配成功的子串。從string的pos下標處起嘗試匹配pattern，若是pattern結束時仍可匹配，則返回一個Match對象；若沒法匹配，則將pos加1後從新嘗試匹配；直到pos=endpos時仍沒法匹配則返回None。
pos和endpos的默認值分別爲0和len(string))；re.search()沒法指定這兩個參數，參數flags用於編譯pattern時指定匹配模式。

# 將正則表達式編譯成Pattern對象
>>> pattern  = re.compile(r'world')
# 使用search()查找匹配的子串，不存在能匹配的子串時將返回None 
# 這個例子中使用match()沒法成功匹配   hello可以match()成功***
>>> match = pattern.search('hello world!')
# 使用Match得到分組信息
>>> match.group()
'world'
>>>

View Code

3、split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):

按照可以匹配的子串將string分割後返回列表。maxsplit用於指定最大分割次數，不指定將所有分割。

|  split(...)
 |      split(string[, maxsplit = 0])  --> list.
 |      Split string by the occurrences of pattern.

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.

>>> p = re.compile(r'\d+')
>>> p
<_sre.SRE_Pattern object at 0x02D53F70>
>>> p.split('one1two2three3four4five5six6seven7eight8nine9ten10')
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', '']

View Code

4、findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):

搜索string，以列表形式返回所有能匹配的子串。

>>> p = re.compile(r'\d+')
>>> p.findall('one1two2three3four4five5six6seven7eight8nine9ten10')
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

View Code

5、finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):

搜索string，返回一個順序訪問每個匹配結果（Match對象）的迭代器。

>>> p = re.compile(r'\d+')
>>> piter = p.finditer('one1two2three3four4')
>>> piter
<callable-iterator object at 0x02E153B0>
>>> for x in piter:
    print x

    
<_sre.SRE_Match object at 0x02EAE800>
<_sre.SRE_Match object at 0x02EAE838>
<_sre.SRE_Match object at 0x02EAE800>
<_sre.SRE_Match object at 0x02EAE838>

>>> piter = p.finditer('one1two2three3four4')
>>> for x in piter:
    print x.group(),
1 2 3 4

View Code

6、sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):

使用repl替換string中每個匹配的子串後返回替換後的字符串。
當repl是一個字符串時，可使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。
當repl是一個方法時，這個方法應當只接受一個參數（Match對象），並返回一個字符串用於替換（返回的字符串中不能再引用分組）。
count用於指定最多替換次數，不指定時所有替換。

（1）字符串時

>>> p = re.compile(r'(\w+) (\w+)')
>>> s = 'i say, hello world'
>>> p.sub(r'\2 \1',s)
'say i, world hello'

View Code

注：只有兩個匹配，使用序號超過匹配分組時，拋出異常

>>> p.sub(r'\3 \1',s)

Traceback (most recent call last):
  File "<pyshell#207>", line 1, in <module>
    p.sub(r'\3 \1',s)
  File "C:\Python27\lib\re.py", line 291, in filter
    return sre_parse.expand_template(template, match)
  File "C:\Python27\lib\sre_parse.py", line 833, in expand_template
    raise error, "invalid group reference"
error: invalid group reference

View Code

（2）方法時

>>> def fun(m):
    return m.group(1).title()+ ' ' + m.group(2).title()

>>> p.sub(fun,s)
'I Say, Hello World'

>>> help(str.title)
Help on method_descriptor:

title(...)
    S.title() -> string
    
    Return a titlecased version of S, i.e. words start with uppercase
characters, all remaining cased characters have lowercase.
返回字符串首字母大寫。

View Code

7、subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):

返回 (sub(repl, string[, count]), 替換次數)。

>>> help(p.subn)
Help on built-in function subn:

subn(...)
    subn(repl, string[, count = 0]) --> (newstring, number of subs)
    Return the tuple (new_string, number_of_subs_made) found by replacing
    the leftmost non-overlapping occurrences of pattern with the
replacement repl.
>>> p = re.compile(r'(\w+) (\w+)')
>>> s = 'i say, hello world!'
>>> p.subn(r'\2 \1', s)
('say i, world hello!', 2)

>>> p.subn(r'\2',s)
('say, world!', 2)
>>> p.subn(r'\1',s)
('i, hello!', 2)
>>> p.subn(r'\1 \2',s)
('i say, hello world!', 2)

>>> def funn(m):
    print(m.group(1)+' '+ m.group(2))

    
>>> p.subn(funn,s)
i say
hello world
(', !', 2)

>>> def funn(m):
    return(m.group(1)+' '+ m.group(2))

>>> p.subn(funn,s)
('i say, hello world!', 2)

help(p.subn)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。