[Optimized Python] 17 - Bottle-neck in performance

時間 2019-11-18

標籤 optimized python bottle neck performance 欄目 Python 简体版

原文原文鏈接

前言

對於一門編程語言，沒接觸到「優化」和「庫代碼」的源碼理解程度，不足以談「掌握」二字。html

本篇的學習筆記，同時也意味着本身終於觸及到了Python的junior國際水準。(joke)python

要學的東西有不少，尤爲是年紀大了後對編程語言要有一個「體系」：[Code] 大蛇之數據工程，方便本身複習。git

編程語言的理解沒大問題後，就是穩固「實踐能力」，將編程模塊化：[Code] 變態之人鍵合一，編程變爲「堆積木」的效果。github

工程能力的最終目的仍是迎合市場，解決現實中的問題：[AI] 深度數據 - Data，產出高質量代碼。數據庫

要學的清楚，走的明白。可見，大體的套路即是這三步走：編程

充分認識一門編程語言特性。
大量實踐之並達到手腦協調。
基於合適平臺解決市場問題。

經過性能分析找到程序瓶頸，讀完本篇後你將可以回答下列問題：bootstrap

如何找到代碼中速度和 RAM 的瓶頸？
如何分析 CPU 和內存使用狀況？
我應該分析到什麼深度？
如何分析一個長期運行的應用程序？
在 CPython 檯面下發生了什麼？
如何在調整性能的同時確保功能的正確？

度量時間

1、程序時間計量

"一次時間" 計量

經過「裝飾器」方法，只需給所測量的函數「戴個帽子」便可。ubuntu

import math
import time
from functools import wraps
  
def fn_timer(function):

    @wraps(function)     def function_timer(*args, **kwargs):
        start = time.time()
        result = function(*args, **kwargs)
        end   = time.time()
        print ("Total time: %s seconds" % (str(end-start)))
        return result

    return function_timer

「平均時間」計量

ipython命令行模式能夠直接支持，獲得更爲準確的「平均時間」。數組

import timeit

%timeit calc_pure_python(desired_width=1000, max_iterations=300)

"運行時細節" 計量

細分爲用戶、系統時間，貌似意義不大。緩存

$ /usr/bin/time -p python julia1_nopil.py

Length of x: 1000
Total elements: 1000000
calculate_z_serial_purepython took 12.7298331261 seconds

real 13.46
user 13.40
sys 0.04

2、函數時間計量

cProfile 是標準庫內建的三個分析工具之一，另外兩個是 hotshot 和 profile。

hotshot 還處於實驗階段， profile 則是原始的純 Python 分析器。

cProfile 具備跟 profile 同樣的接口，且是默認的分析工具。

import math
import time
from functools import wraps
  
def fn_timer(function):

    @wraps(function)
    def function_timer(*args, **kwargs):
        start = time.time()
        result = function(*args, **kwargs)
        end   = time.time()
        print ("Total time: %s seconds" % (str(end-start)))
        return result

    return function_timer


@fn_timer
def check_prime(number):
    sqrt_number = math.sqrt(number)
    number_float = float(number)
    for i in range(2, int(sqrt_number) + 1):
        if (number_float / i).is_integer():
            return False
    return True

if __name__ == "__main__":
    print("check_prime(10000000) = ", check_prime(10000000))
    print("check_prime(10000019) = ", check_prime(10000019))

完整測試代碼

函數運行時的詳細計量結果。

$  python -m cProfile -s cumulative 70-measure.py 

Total time: 1.239776611328125e-05 seconds
check_prime(10000000) =  False
Total time: 0.0006530284881591797 seconds
check_prime(10000019) =  True
         3412 function calls in 0.001 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        1    0.000    0.000    0.001    0.001 70-measure.py:5(<module>)　　# 進入__main__函數 2    0.000    0.000    0.001    0.000 70-measure.py:11(function_timer)
        2    0.000    0.000    0.001    0.000 70-measure.py:22(check_prime)
        1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:978(_find_and_load)

可視化 runsnake，安裝過程以下：

sudo apt-get update
sudo apt-get install runsnakerun

3、逐行分析

分析CPU密集型的性能問題，速度比較慢，因此一次測試僅考慮部分關鍵函數。

用修飾器（@profile）標記選中的函數。用 kernprof.py 腳本運行你的代碼，被選函數每一行花費的 CPU 時間以及其餘信息就會被記錄下來。

$ kernprof -l -v 70-measure.py

Total time: 5.10215759277e-05 seconds
('check_prime(10000000) = ', False)
Total time: 0.000823020935059 seconds
('check_prime(10000019) = ', True)
Wrote profile results to 70-measure.py.lprof
Timer unit: 1e-06 s

Total time: 0.000898 s
File: 70-measure.py
Function: function_timer at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                               @wraps(function)
    12                                               def function_timer(*args, **kwargs):
    13         2          2.0      1.0      0.2          start = time.time()
    14         2        870.0    435.0     96.9          result = function(*args, **kwargs)
    15         2          1.0      0.5      0.1          end   = time.time()
    16         2         24.0     12.0      2.7          print ("Total time: %s seconds" % (str(end-start)))
    17         2          1.0      0.5      0.1          return result

度量空間

咱們能不能重寫這個函數讓它使用更少的 RAM 來工做得更有效率？

咱們能不能使用更多 RAM 緩存來節省 CPU 週期？

1、進程內存使用

memory_profiler

安裝：https://anaconda.org/anaconda/memory_profiler

再強調一遍，這個數字不必定是數組的真實大小，只是進程在建立這些列表的過程當中增加的大小。

$ python -m memory_profiler 70-measure.py

Total time: 0.00010919570922851562 seconds
check_prime(10000000) =  False
Total time: 0.0031642913818359375 seconds
check_prime(10000019) =  True
Filename: 70-measure.py

Line #    Mem usage    Increment   Line Contents
================================================
    11   36.855 MiB   36.855 MiB       @wraps(function)
    12                                 def function_timer(*args, **kwargs):
    13   36.855 MiB    0.000 MiB           start = time.time()
    14   36.855 MiB    0.000 MiB           result = function(*args, **kwargs)
    15   36.855 MiB    0.000 MiB           end   = time.time()
    16   36.855 MiB    0.000 MiB           print ("Total time: %s seconds" % (str(end-start)))
    17   36.855 MiB    0.000 MiB           return result

mprof

memory_profiler有一個功能叫 mprof，用於對內存使用狀況進行採樣和畫圖。它的採樣基於時間而不是代碼行，於是不會影響代碼的運行時間。

mprof run 70-measure.py 

mprof plot 

mprof clean

ipython_memory_usage

經過一例子，查看一個模塊會吃掉多少內存。

Below you can see an example showing exact memory usage with and without __slots__ done in IPython thanks to https://github.com/ianozsvald/ipython_memory_usage

In [1]: import ipython_memory_usage.ipython_memory_usage as imu

In [2]: imu.start_watching_memory()
In [2] used 0.0000 MiB RAM in 5.31s, peaked 0.00 MiB above current, total RAM usage 15.57 MiB

---------------------------------------------------------------------------------------------

In [3]: %cat slots.py
class MyClass(object):
        __slots__ = ['name', 'identifier']
        def __init__(self, name, identifier):
                self.name = name
                self.identifier = identifier

num = 1024*256
x = [MyClass(1,1) for i in range(num)]
In [3] used 0.2305 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 15.80 MiB

In [4]: from slots import *
In [4] used 9.3008 MiB RAM in 0.72s, peaked 0.00 MiB above current, total RAM usage 25.10 MiB

---------------------------------------------------------------------------------------------

In [5]: %cat noslots.py
class MyClass(object):
        def __init__(self, name, identifier):
                self.name = name
                self.identifier = identifier

num = 1024*256
x = [MyClass(1,1) for i in range(num)]
In [5] used 0.1758 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 25.28 MiB

In [6]: from noslots import *
In [6] used 22.6680 MiB RAM in 0.80s, peaked 0.00 MiB above current, total RAM usage 47.95 MiB

2、內存堆上的對象

Ref: Getting started with Heapy

In [8]: import sys                                                                             

In [9]: sys.getsizeof(set())                                                                   
Out[9]: 224

In [10]: sys.getsizeof({})                                                                     
Out[10]: 240

In [11]: sys.getsizeof([])                                                                     
Out[11]: 64

Diagnosing Memory 「Leaks」 in Python

經過」差值" 判斷其間代碼帶來的影響，這裏的「減法」就是精髓。

from guppy import hpy
hp = hpy()
before = hp.heap()

# critical section here

after = hp.heap()
leftover = after - before

The best heapy tutorial I have found is How to use guppy/heapy for tracking down memory usage.

3、追蹤對象

Robert Brewer 的 dowser 能夠在代碼運行時鉤入名字空間並經過 CherryPy 接口 在一個 Web 服務器上提供一個實時的變量實例圖。

每一個被追蹤對象都有一個走勢圖，讓你能夠看到某個對象的數量是否在增加，這在分析長期運行的進程時頗有用。

安裝對應的python3相配的packages。

pip install dowser-py3
pip3 install cherrypy

相似 heapy, 在正式代碼開始前加入以下代碼：

Ref: Debugging Python’s Memory Usage with Dowser

Then doing launch_memory_usage_server() somewhere early in my code launched the HTTP interface (http://localhost:8080/) to see memory usage while the import process was running. This helped me narrow down where the issue showed up (as we were leaking MySQLdb cursors at an alarming rate), and digging deeper into the structure hinted to the underlying cause (the debug toolbar was active for a console script).

添加以下代碼：

def launch_memory_usage_server(port = 8080):
    import cherrypy
    import dowser

    cherrypy.tree.mount(dowser.Root())
    cherrypy.config.update({
        'environment': 'embedded',
        'server.socket_port': port
    })
    
    cherrypy.engine.start()

在網頁端能看到一大片的objects，以下：

系統級性能剖析

Perf是一款綜合性分析工具，大到系統全局性性能，再小到進程線程級別，甚至到函數及彙編級別。

Goto: 系統級性能分析工具perf的介紹與使用【很是強大的工具，此處只關注一部分】

1、使用方法

$ sudo perf stat -e cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branches,branch-misses,task-clock,faults,minor-faults,cs,migrations -r 3 python diffusion_python_memory.py

 Performance counter stats for 'python diffusion_python_memory.py' (3 runs):

         224752582      cycles                    #    3.186 GHz                      ( +-  0.71% )  (65.99%)
   <not supported>      stalled-cycles-frontend                                     
   <not supported>      stalled-cycles-backend                                      
         668607787      instructions              #    2.97  insn per cycle           ( +-  0.63% )  (83.00%)
           1696860      cache-references          #   24.054 M/sec                    ( +-  6.48% )  (82.99%)
            216056      cache-misses              #   12.733 % of all cache refs      ( +-  7.12% )  (83.00%)
         124467767      branches                  # 1764.388 M/sec                    ( +-  0.94% )  (84.02%)
            272637      branch-misses             #    0.22% of all branches          ( +-  9.35% )  (84.00%)
             70.54 msec task-clock                #    0.996 CPUs utilized            ( +-  2.57% )
               801      faults                    #    0.011 M/sec                  
               801      minor-faults              #    0.011 M/sec                  
                 2      cs                        #    0.024 K/sec                    ( +- 20.00% )
                 0      migrations                #    0.000 K/sec                  

           0.07081 +- 0.00184 seconds time elapsed  ( +-  2.59% )

2、根據 perf 輸出作出抉擇

考慮連續內存數據結構 (以下)；而 List 其實是「指針數組」。

import array

因爲實現細節，使用 array 類型來建立數據列表實際上比使用 list 要慢；array 對象不適用於數學計算而更適用於在內存中存放類型固定的數據。

爲了解決 perf 發現的分片問題，咱們必須找到一個能夠進行高效矢量操做的模塊。

幸運的是， numpy 擁有咱們須要的全部特性——它能將數據連續存儲在內存中並支持數據的矢量操做。

結果就是，任何咱們對 numpy 數組的數學操做都能自動矢量化而無須咱們顯式遍歷每個元素。

更多內容，請見：[Pandas] 01 - A guy based on NumPy

單元測試

程序能一次寫完並正常運行的機率很小，基本不超過1%。總會有各類各樣的bug須要修正。

單元測試部分代碼正確性的同時，也能夠測試其性能瓶頸。

More details: unittest --- 單元測試框架

`1、測試對象` `mydict.py`

編寫一個Dict類，行爲和dict一致。

class Dict(dict):

    def __init__(self, **kw):
        super().__init__(**kw)

    def __getattr__(self, key):
        try:
            return self[key]
        except KeyError:
            raise AttributeError(r"'Dict' object has no attribute '%s'" % key)

    def __setattr__(self, key, value):
        self[key] = value

2、測試程序 mydict_test.py

規則：以test開頭的方法就是測試方法，不以test開頭的方法不被認爲是測試方法，測試的時候不會被執行。

設想你的測試須要啓動一個數據庫，這時，就能夠在setUp()方法中鏈接數據庫，在tearDown()方法中關閉數據庫，這樣，沒必要在每一個測試方法中重複相同的代碼。

執行順序以下：setUp---test_init---setUp---test_key---setUp---test_attr---setUp---test_keyerror---setUp---test_attrerror---tearDown

import unittest from mydict import Dict class TestDict(unittest.TestCase):

    def setUp():
        pass

    def tearDown():
        pass

    def test_init(self):
        d = Dict(a=1, b='test')
        self.assertEqual(d.a, 1)
        self.assertEqual(d.b, 'test')
        self.assertTrue(isinstance(d, dict))

    def test_key(self):
        d = Dict()
        d['key'] = 'value'
        self.assertEqual(d.key, 'value')

    def test_attr(self):
        d = Dict()
        d.key = 'value'
        self.assertTrue('key' in d)
        self.assertEqual(d['key'], 'value')

    def test_keyerror(self):
        d = Dict()
        with self.assertRaises(KeyError):
            value = d['empty']　　　　　　　　　　# <---- 訪問不存在的key時但願拋出KeyError異常 def test_attrerror(self):
        d = Dict()
        with self.assertRaises(AttributeError):
            value = d.empty

3、開始測試 unittest.main()

最簡單的運行方式是在mydict_test.py的最後加上兩行代碼：

if __name__ == '__main__':
    unittest.main()

另外一種方法是在命令行經過參數-m unittest直接運行單元測試：

$ python -m unittest mydict_test
.....
----------------------------------------------------------------------
Ran 5 tests in 0.000s

OK

4、文檔測試

Python內置的「文檔測試」（doctest）模塊能夠直接提取註釋中的代碼並執行測試。

# mydict2.py
class Dict(dict):
    '''
    Simple dict but also support access as x.y style.

    >>> d1 = Dict()
    >>> d1['x'] = 100
    >>> d1.x
    100
    >>> d1.y = 200
    >>> d1['y']
    200
    >>> d2 = Dict(a=1, b=2, c='3')
    >>> d2.c
    '3'
    >>> d2['empty']
    Traceback (most recent call last):
        ...
    KeyError: 'empty'
    >>> d2.empty
    Traceback (most recent call last):
        ...
    AttributeError: 'Dict' object has no attribute 'empty'
    '''
    def __init__(self, **kw):
        super(Dict, self).__init__(**kw)

    def __getattr__(self, key):
        try:
            return self[key]
        except KeyError:
            raise AttributeError(r"'Dict' object has no attribute '%s'" % key)

    def __setattr__(self, key, value):
        self[key] = value


if __name__=='__main__':
    import doctest
    doctest.testmod()