CPython 標準庫源碼分析 collections.Counter

時間 2019-12-05

標籤 cpython 標準源碼分析 collections.counter collections counter 简体版

原文原文鏈接

Counter 是一個專門計數可哈希對象的 dict 子類，元素會被當作 dict 的 key，計數的數量被當作 value 存儲。python

這是 Counter 的 doc string，直接明確的指出了元素會被存儲在 dict 的 key，換句話說只有可哈希的元素才能使用 Counter 來計數。app

>>> c = Counter()
>>> c = Counter('gallahad') 
>>> c = Counter({'a': 4, 'b': 2})
>>> c = Counter(a=4, b=2)
複製代碼

Counter 有 4 中初始化方式：空、字符串、字典、關鍵字參數函數

class Counter(dict):
	def __init__(*args, **kwds):
        if not args:
            raise TypeError("...")
        self, *args = args
        if len(args) > 1:
            raise TypeError("...")

        super(Counter, self).__init__()
        self.update(*args, **kwds)
複製代碼

從初始化代碼中能夠看出，Counter 繼承自 dict，最後經過 self.update() 方法把參數更新到 Counter 中。ui

def update(*args, **kwds):
    ... # 參數檢查
    iterable = args[0] if args else None
        if iterable is not None:
            if isinstance(iterable, _collections_abc.Mapping):
                if self:
                    self_get = self.get
                    for elem, count in iterable.items():
                        self[elem] = count + self_get(elem, 0)
                else:
                    super(Counter, self).update(iterable)               	
        else:
    	    _count_elements(self, iterable)
    
        if kwds:
            self.update(kwds)

複製代碼

update 函數是須要一個 iteranble 對象，也就是說要是一個能夠被 for in 的數據類型。隨後判斷了這個 iteranble 是否是一個 Mapping，若是是就使用 .items() 遍歷 key 和 value。若是傳入的參數是一個 dict 或者 Counter 的實例就會走到這個條件判斷中，經過「mapping」的行爲更新計數。spa

Mapping 是在 collections 中的一個抽象數據類型，這個抽象數據類型並非用來繼承的，是用來判斷類型的抽象數據類型。就像這裏 if isinstance(iterable, _collections_abc.Mapping) ，本質是一個 duck typing 的應用。code

若是要實現自定義的 dict 類型，通常會繼承 collections.abc.User.Dict 來實現。對象

def _count_elements(mapping, iterable):
    'Tally elements from the iterable.'
    mapping_get = mapping.get
    for elem in iterable:
        mapping[elem] = mapping_get(elem, 0) + 1
複製代碼

非 Mapping 類型使用 _count_elements 函數完成計數跟新。_count_elements 函數使用 iterable 都會實現的迭代器遍歷完成。排序

若是參數是關鍵字參數會直接調用當前的 update 方法更新，一樣走的是 Mapping 類型那條路。繼承

def subtract(*args, **kwds):
    ... # 參數檢查
        
    iterable = args[0] if args else None
    if iterable is not None:
        self_get = self.get
        if isinstance(iterable, _collections_abc.Mapping):
            for elem, count in iterable.items():
                self[elem] = self_get(elem, 0) - count
        else:
            for elem in iterable:
                self[elem] = self_get(elem, 0) - 1
    if kwds:
        self.subtract(kwds)
複製代碼

subtract 函數和 update 函數功能相反，可是實現很相似，僅僅是把加換成了減，同時還有還有可能出現 0 值和負值。element

>>> c = Counter("abcd")
>>> c.subtract(d=10)
>>> c
Counter({'a': 1, 'b': 1, 'c': 1, 'd': -9})
>>>
複製代碼

def elements(self):
    return _chain.from_iterable(_starmap(_repeat, self.items()))
複製代碼

elements 方法能夠把 Counter 轉換成迭代器，同時忽略掉了 0 值和負值的計數。

>>> for e in c.elements():
...     print(e)
...
...
a
b
c
>>>
複製代碼

_chain.from_iterable(_starmap(_repeat, self.items())) 用了三個 itertool 裏面的三個方法來生成迭代器。

_repeat: itertools.repeat，建立一個重複的對象的迭代器，repeat('A', 2) => ['A', 'A']
_starmap: itertools._starmap，建立一個迭代器使用可迭代對象中獲取的參數，starmap(lambda x: x+x, ['A', 'B']) => ['AA', 'BB']
_chain.from_iterable: itertools.chain.from_iterable, 從可迭代對象建立一個迭代器

Counter 中剩下就是一些運算來簡化過程，實現了 "+", "-", "&", "|" 和對應原地修改 "+=", "-=", "&=", "|="。

def __add__(self, other):
    if not isinstance(other, Counter):
        return NotImplemented
    result = Counter()
    for elem, count in self.items():
        newcount = count + other[elem]
        if newcount > 0:
            result[elem] = newcount
    for elem, count in other.items():
        if elem not in self and count > 0:
            result[elem] = count
    return result
複製代碼

全部的非原地修改都會生成一個新的 Conter 實例，在加法中，現實相加了 other 中有的元素，而後再把只在 other 中同時大於 0 的也放入新的 Counter 中。

def __sub__(self, other):
    if not isinstance(other, Counter):
        return NotImplemented
    result = Counter()
    for elem, count in self.items():
        newcount = count - other[elem]
        if newcount > 0:
            result[elem] = newcount
    for elem, count in other.items():
        if elem not in self and count < 0:
            result[elem] = 0 - count
    return result
複製代碼

非原地的減法是從被減數中減去計數同時這個計數還要大於 0 纔會被放入結果中，若是減數中有負值會反轉成正值放入新 Counter 中。

def __or__(self, other):
    if not isinstance(other, Counter):
        return NotImplemented
    result = Counter()
    for elem, count in self.items():
        other_count = other[elem]
        newcount = other_count if count < other_count else count
        if newcount > 0:
            result[elem] = newcount
    for elem, count in other.items():
        if elem not in self and count > 0:
            result[elem] = count
    return result
複製代碼

並集運算的過程是假如沒有就放入新的 Counter 中，若是有就對比，哪一個計數大，哪一個就放入新的 Counter 中，同時也要保證每一個計數不能小於 0.

def __and__(self, other):
    if not isinstance(other, Counter):
        return NotImplemented
    result = Counter()
    for elem, count in self.items():
        other_count = other[elem]
        newcount = count if count < other_count else other_count
        if newcount > 0:
            result[elem] = newcount
    return result
複製代碼

差集運算找出同時存兩個 Counter 中，計數較小的那個放入新的 Counter 中，同時保證不大於 0。

剩下的就是與之對應的原地方法，並非建立新的 Counter 而是直接使用老的 Counter，實現過程上比較相似，可是最後是使用 self._keep_positive() 方法來保證返回的計數中不會有負值。

def _keep_positive(self):
    nonpositive = [elem for elem, count in self.items() if not count > 0]
    for elem in nonpositive:
        del self[elem]
    return self

def __iadd__(self, other):
    for elem, count in other.items():
        self[elem] += count
    return self._keep_positive()
複製代碼

最後剩下的一個函數是用的最多的 most_common(), 返回最多的 n 個計數

def most_common(self, n=None):
    if n is None:
        return sorted(self.items(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.items(), key=_itemgetter(1))
複製代碼

實現過程簡單暴力，直接根據計數作了個排序，而後使用了最大堆，獲取前 N 的元素和計算值。

總結一下，Counter 是基於 dict 的子類使用 key 存儲每一個元素，因此可計數的元素確定是可哈希的元素，核心方法是 update() 使用了 duck typing 方式更新不一樣合法類型的參數。在重載的運算過程當中，老是要保證不會有負計數的出現，惟一可能出現負計數的時候就是調用 subtract。因此在遍歷不要直接使用 c.items() 方法，必須使用 c.elements()。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。