測試環境：Ubuntu16.04（在VMWare虛擬機使用perf top存在沒法顯示問題）html

Kernel：3.13.0-32前端

系統級性能優化一般包括兩個階段：性能剖析（performance profiling）和代碼優化。性能剖析的目標是尋找性能瓶頸，查找引起性能問題的緣由及熱點代碼。代碼優化的目標是針對具體性能問題而優化代碼或編譯選項，以改善軟件性能。java

在性能剖析階段，須要藉助於現有的profiling工具，如perf等。在代碼優化階段每每須要藉助開發者的經驗，編寫簡潔高效的代碼，甚至在彙編級別合理使用各類指令，合理安排各類指令的執行順序。node

perf是一款Linux性能分析工具。Linux性能計數器是一個新的基於內核的子系統，它提供一個性能分析框架，好比硬件（CPU、PMU(Performance Monitoring Unit)）功能和軟件(軟件計數器、tracepoint)功能。
經過perf，應用程序能夠利用PMU、tracepoint和內核中的計數器來進行性能統計。它不但能夠分析制定應用程序的性能問題（per thread），也能夠用來分析內核的性能問題，固然也能夠同事分析應用程序和內核，從而全面理解應用程序中的性能瓶頸。python

使用perf，能夠分析程序運行期間發生的硬件事件，好比instructions retired、processor clock cycles等；也能夠分析軟件時間，好比page fault和進程切換。linux

背景知識

tracepoints

tracepoints是散落在內核源碼中的一些hook，它們能夠在特定的代碼被執行到時觸發，這一特定能夠被各類trace/debug工具所使用。程序員

perf將tracepoint產生的時間記錄下來，生成報告，經過分析這些報告，條有人緣即可以瞭解程序運行期間內核的各類細節，對性能症狀作出準確的診斷。算法

硬件特性之cache

內存讀寫是很快的，可是仍是沒法和處理器指令執行速度相比。爲了從內存中讀取指令和數據，處理器須要等待，用處理器時間來衡量，這種等待很是漫長。cache是一種SRAM，讀寫速度很是快，能和處理器相匹配。所以將經常使用的數據保存在cache中，處理器便無需等待，從而提升性能。cache的尺寸通常都很小，充分利用cache是軟件調優很是重要部分。ubuntu

主要關注點

基於性能分析，能夠進行算法優化（空間複雜度和時間複雜度權衡）、代碼優化（提升執行速度、減小內存佔用）。api

評估程序對硬件資源的使用狀況，例如各級cache的訪問次數、各級cache的丟失次數、流水線停頓週期、前端總線訪問次數等。

評估程序對操做系統資源的使用狀況，系統調用次數、上下文切換次數、任務遷移次數。

事件能夠分爲三種：

Hardware Event由PMU部件產生，在特定的條件下探測性能事件是否發生以及發生的次數。好比cache命中。
Software Event是內核產生的事件，分佈在各個功能模塊中，統計和操做系統相關性能事件。好比進程切換，tick數等。
Tracepoint Event是內核中靜態tracepoint所觸發的事件，這些tracepoint用來判斷程序運行期間內核的行爲細節，好比slab分配器的分配次數等。

perf的使用

perf --help以後能夠看到perf的二級命令。

序號	命令	做用
1	annotate	解析perf record生成的perf.data文件，顯示被註釋的代碼。
2	archive	根據數據文件記錄的build-id，將全部被採樣到的elf文件打包。利用此壓縮包，能夠再任何機器上分析數據文件中記錄的採樣數據。
3	bench	perf中內置的benchmark，目前包括兩套針對調度器和內存管理子系統的benchmark。
4	buildid-cache	管理perf的buildid緩存，每一個elf文件都有一個獨一無二的buildid。buildid被perf用來關聯性能數據與elf文件。
5	buildid-list	列出數據文件中記錄的全部buildid。
6	diff	對比兩個數據文件的差別。可以給出每一個符號（函數）在熱點分析上的具體差別。
7	evlist	列出數據文件perf.data中全部性能事件。
8	inject	該工具讀取perf record工具記錄的事件流，並將其定向到標準輸出。在被分析代碼中的任何一點，均可以向事件流中注入其它事件。
9	kmem	針對內核內存（slab）子系統進行追蹤測量的工具
10	kvm	用來追蹤測試運行在KVM虛擬機上的Guest OS。
11	list	列出當前系統支持的全部性能事件。包括硬件性能事件、軟件性能事件以及檢查點。
12	lock	分析內核中的鎖信息，包括鎖的爭用狀況，等待延遲等。
13	mem	內存存取狀況
14	record	收集採樣信息，並將其記錄在數據文件中。隨後可經過其它工具對數據文件進行分析。
15	report	讀取perf record建立的數據文件，並給出熱點分析結果。
16	sched	針對調度器子系統的分析工具。
17	script	執行perl或python寫的功能擴展腳本、生成腳本框架、讀取數據文件中的數據信息等。
18	stat	執行某個命令，收集特定進程的性能概況，包括CPI、Cache丟失率等。
19	test	perf對當前軟硬件平臺進行健全性測試，可用此工具測試當前的軟硬件平臺是否能支持perf的全部功能。
20	timechart	針對測試期間系統行爲進行可視化的工具
21	top	相似於linux的top命令，對系統性能進行實時分析。
22	trace	關於syscall的工具。
23	probe	用於定義動態檢查點。

下面開始逐一分析。

perf list

perf list不能徹底顯示全部支持的事件類型，須要sudo perf list。

同時還能夠顯示特定模塊支持的perf事件,hw/cache/pmu都是硬件相關的；tracepoint基於內核的ftrace；sw其實是內核計數器。

hw/hardware顯示支持的硬件事件相關，如：

al@ubuntu:~/common-use/example/perf-test$ perf list hardware
cpu-cycles OR cycles                               [Hardware event]
instructions                                       [Hardware event]
cache-references                                   [Hardware event]
cache-misses                                       [Hardware event]
branch-instructions OR branches                    [Hardware event]
branch-misses                                      [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
stalled-cycles-backend OR idle-cycles-backend      [Hardware event]

sw/software顯示支持的軟件事件列表：

al@ubuntu:~/common-use/example/perf-test$ perf list sw
cpu-clock                                          [Software event]
task-clock                                         [Software event]
page-faults OR faults                              [Software event]
context-switches OR cs                             [Software event]
cpu-migrations OR migrations                       [Software event]
minor-faults                                       [Software event]
major-faults                                       [Software event]
alignment-faults                                   [Software event]
emulation-faults                                   [Software event]
dummy                                              [Software event]

cache/hwcache顯示硬件cache相關事件列表：

al@ubuntu:~/common-use/example/perf-test$ perf list cache
L1-dcache-loads                                    [Hardware cache event]
L1-dcache-load-misses                              [Hardware cache event]
L1-dcache-stores                                   [Hardware cache event]
L1-dcache-prefetches                               [Hardware cache event]
L1-dcache-prefetch-misses                          [Hardware cache event]
L1-icache-loads                                    [Hardware cache event]
L1-icache-load-misses                              [Hardware cache event]
L1-icache-prefetches                               [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-load-misses                                    [Hardware cache event]
LLC-stores                                         [Hardware cache event]
dTLB-loads                                         [Hardware cache event]
dTLB-load-misses                                   [Hardware cache event]
iTLB-loads                                         [Hardware cache event]
iTLB-load-misses                                   [Hardware cache event]
branch-loads                                       [Hardware cache event]
branch-load-misses                                 [Hardware cache event]
node-loads                                         [Hardware cache event]
node-load-misses                                   [Hardware cache event]

pmu顯示支持的PMU事件列表：

al@ubuntu:~/common-use/example/perf-test$ perf list pmu
branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
branch-misses OR cpu/branch-misses/                [Kernel PMU event]
cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
cache-references OR cpu/cache-references/          [Kernel PMU event]
cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]
instructions OR cpu/instructions/                  [Kernel PMU event]

tracepoint顯示支持的全部tracepoint列表，這個列表就比較龐大：

drm:drm_vblank_event                               [Tracepoint event]
drm:drm_vblank_event_queued                        [Tracepoint event]
drm:drm_vblank_event_delivered                     [Tracepoint event]
skb:kfree_skb                                      [Tracepoint event]
skb:consume_skb                                    [Tracepoint event]
skb:skb_copy_datagram_iovec                        [Tracepoint event]
net:net_dev_xmit                                   [Tracepoint event]
net:net_dev_queue                                  [Tracepoint event]
net:netif_receive_skb                              [Tracepoint event]
net:netif_rx                                       [Tracepoint event]
napi:napi_poll                                     [Tracepoint event]
…

perf top

默認狀況下perf top是沒法顯示信息的，須要sudo perf top或者echo -1 > /proc/sys/kernel/perf_event_paranoid(在Ubuntu16.04，還須要echo 0 > /proc/sys/kernel/kptr_restrict)。

便可以正常顯示perf top以下：

第一列：符號引起的性能事件的比例，指佔用的cpu週期比例。

第二列：符號所在的DSO(Dynamic Shared Object)，能夠是應用程序、內核、動態連接庫、模塊。

第三列：DSO的類型。[.]表示此符號屬於用戶態的ELF文件，包括可執行文件與動態連接庫)。[k]表述此符號屬於內核或模塊。

第四列：符號名。有些符號不能解析爲函數名，只能用地址表示。

關於perf top界面經常使用命令以下：

h：顯示幫助，便可顯示詳細的幫助信息。

UP/DOWN/PGUP/PGDN/SPACE：上下和翻頁。

a：annotate current symbol，註解當前符號。可以給出彙編語言的註解，給出各條指令的採樣率。

d：過濾掉全部不屬於此DSO的符號。很是方便查看同一類別的符號。

P：將當前信息保存到perf.hist.N中。

perf top經常使用選項有：

-e <event>：指明要分析的性能事件。

-p <pid>：Profile events on existing Process ID (comma sperated list). 僅分析目標進程及其建立的線程。

-k <path>：Path to vmlinux. Required for annotation functionality. 帶符號表的內核映像所在的路徑。

-K：不顯示屬於內核或模塊的符號。

-U：不顯示屬於用戶態程序的符號。

-d <n>：界面的刷新週期，默認爲2s，由於perf top默認每2s從mmap的內存區域讀取一次性能數據。

-g：獲得函數的調用關係圖。

perf top -g [fractal]，路徑機率爲相對值，加起來爲100%，調用順序爲從下往上。

perf top -g graph，路徑機率爲絕對值，加起來爲該函數的熱度。

perf stat

perf stat用於運行指令，並分析其統計結果。雖然perf top也能夠指定pid，可是必須先啓動應用才能查看信息。

perf stat能完整統計應用整個生命週期的信息。

命令格式爲：

perf stat [-e <EVENT> | --event=EVENT] [-a] <command>
perf stat [-e <EVENT> | --event=EVENT] [-a] — <command> [<options>]

下面簡單看一下perf stat ls的輸出：

Performance counter stats for 'ls':

          1.051536      task-clock (msec)         #    0.670 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                93      page-faults               #    0.088 M/sec
         1,942,492      cycles                    #    1.847 GHz
           532,500      stalled-cycles-frontend   #   27.41% frontend cycles idle
           724,600      stalled-cycles-backend    #   37.30% backend cycles idle
     <not counted>      instructions               (0.00%)
     <not counted>      branches                   (0.00%)
     <not counted>      branch-misses              (0.00%)

       0.001569326 seconds time elapsed

task-clock：任務真正佔用的處理器時間，單位爲ms。CPUs utilized = task-clock / time elapsed，CPU的佔用率。

context-switches：程序在運行過程當中上下文的切換次數。

CPU-migrations：程序在運行過程當中發生的處理器遷移次數。Linux爲了維持多個處理器的負載均衡，在特定條件下會將某個任務從一個CPU遷移到另外一個CPU。

CPU遷移和上下文切換：發生上下文切換不必定會發生CPU遷移，而發生CPU遷移時確定會發生上下文切換。發生上下文切換有可能只是把上下文從當前CPU中換出，下一次調度器仍是將進程安排在這個CPU上執行。

page-faults：缺頁異常的次數。當應用程序請求的頁面還沒有創建、請求的頁面不在內存中，或者請求的頁面雖然在內存中，但物理地址和虛擬地址的映射關係還沒有創建時，都會觸發一次缺頁異常。另外TLB不命中，頁面訪問權限不匹配等狀況也會觸發缺頁異常。

cycles：消耗的處理器週期數。若是把被ls使用的cpu cycles當作是一個處理器的，那麼它的主頻爲2.486GHz。能夠用cycles / task-clock算出。

stalled-cycles-frontend：指令讀取或解碼的質量步驟，未能按理想狀態發揮並行左右，發生停滯的時鐘週期。

stalled-cycles-backend：指令執行步驟，發生停滯的時鐘週期。

instructions：執行了多少條指令。IPC爲平均每一個cpu cycle執行了多少條指令。

branches：遇到的分支指令數。branch-misses是預測錯誤的分支指令數。

其餘經常使用參數

    -a, --all-cpus        顯示全部CPU上的統計信息
    -C, --cpu <cpu>       顯示指定CPU的統計信息
    -c, --scale           scale/normalize counters
    -D, --delay <n>       ms to wait before starting measurement after program start
    -d, --detailed        detailed run - start a lot of events
    -e, --event <event>   event selector. use 'perf list' to list available events
    -G, --cgroup <name>   monitor event in cgroup name only
    -g, --group           put the counters into a counter group
    -I, --interval-print <n>
                          print counts at regular interval in ms (>= 10)
    -i, --no-inherit      child tasks do not inherit counters
    -n, --null            null run - dont start any counters
    -o, --output <file>   輸出統計信息到文件
    -p, --pid <pid>       stat events on existing process id
    -r, --repeat <n>      repeat command and print average + stddev (max: 100, forever: 0)
    -S, --sync            call sync() before starting a run
    -t, --tid <tid>       stat events on existing thread id
    -T, --transaction     hardware transaction statistics
    -v, --verbose         be more verbose (show counter open errors, etc)
    -x, --field-separator <separator>
                          print counts with custom separator
        --append          append to the output file
        --filter <filter>
                          event filter
        --log-fd <n>      log output to fd, instead of stderr
        --per-core        aggregate counts per physical processor core
        --per-socket      aggregate counts per processor socket
        --per-thread      aggregate counts per thread
        --post <command> 執行待測程序前執行的程序
        --pre <command>   執行待測程序後執行的程序

示例

前面統計程序的示例，下面看一下統計CPU信息的示例：

執行sudo perf stat -C 0，統計CPU 0的信息。想要中止後，按下Ctrl+C終止。能夠看到統計項同樣，只是統計對象變了。

Performance counter stats for 'CPU(s) 0':

      10164.997531      task-clock (msec)         #    1.000 CPUs utilized            (100.00%)
             1,147      context-switches          #    0.113 K/sec                    (100.00%)
                62      cpu-migrations            #    0.006 K/sec                    (100.00%)
                76      page-faults               #    0.007 K/sec
       353,147,998      cycles                    #    0.035 GHz                      (49.97%)
        55,541,309      stalled-cycles-frontend   #   15.73% frontend cycles idle     (50.00%)
     1,163,503,673      stalled-cycles-backend    # 329.47% backend cycles idle     (50.04%)
        51,713,115      instructions              #    0.15 insns per cycle
                                                  #   22.50 stalled cycles per insn (50.04%)
        10,334,910      branches                  #    1.017 M/sec                    (50.00%)
         1,980,209      branch-misses             #   19.16% of all branches          (49.96%)

      10.163701471 seconds time elapsed

若是須要統計更多的項，須要使用-e，如：

perf stat -e task-clock,context-switches,cpu-migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,LLC-loads,LLC-load-misses,dTLB-loads,dTLB-load-misses ls

結果以下，關注的特殊項也歸入統計。

Performance counter stats for 'ls':

          0.976790      task-clock (msec)         #    0.618 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                92      page-faults               #    0.094 M/sec
         1,801,371      cycles                    #    1.844 GHz
           458,450      stalled-cycles-frontend   #   25.45% frontend cycles idle
           665,732      stalled-cycles-backend    #   36.96% backend cycles idle
     <not counted>      instructions               (0.00%)
     <not counted>      branches                   (0.00%)
     <not counted>      branch-misses              (0.00%)
     <not counted>      L1-dcache-loads            (0.00%)
     <not counted>      L1-dcache-load-misses      (0.00%)
     <not counted>      LLC-loads                  (0.00%)
     <not counted>      LLC-load-misses            (0.00%)
     <not counted>      dTLB-loads                 (0.00%)
     <not counted>      dTLB-load-misses           (0.00%)

       0.001580341 seconds time elapsed

perf record

運行一個命令，並將其數據保存到perf.data中。隨後，可使用perf report進行分析。

perf record和perf report能夠更精確的分析一個應用，perf record能夠精確到函數級別。而且在函數裏面混合顯示彙編語言和代碼。

建立一個test1.c文件用於測試：

void longa()
{
int i,j;
for(i = 0; i < 10000000; i++)
    j=i; //am I silly or crazy? I feel boring and desperate.
}

void funcA()
{
int i;

for(i=0 ; i < 10; i++)
       longa();
}

void funcB()
{
int i;

for(i = 0; i< 100; i++)
     longa();
}

int main(void)
{
funcA();
funcB();
}

編譯test1.c文件gcc test1.c -o test1 -g -O0，同時可使用此方法分析是否選擇編譯優化產生的結果。-g是隻能callgraph功能，-O0是關閉優化。

經常使用選項

       -e record指定PMU事件
       --filter event事件過濾器
       -a 錄取全部CPU的事件
       -p 錄取指定pid進程的事件
       -o 指定錄取保存數據的文件名
       -g 使能函數調用圖功能
       -C 錄取指定CPU的事件

sudo perf record -g ./test1

會在當前目錄生成perf.data文件。

perf record -e cpu-clock ./test1 之路去cpu-clock事件

perf record -e raw_syscalls:sys_enter ./test1錄取sys_enter系統調用事件

perf record -g -o perf.test1.data ./test1 保存錄取信息到perf.test1.data中

perf report

解析perf record產生的數據，並給出分析結果。

經常使用參數：

-i 導入的數據文件名稱，若是沒有則默認爲perf.data

-g 生成函數調用關係圖

--sort 從更高層面顯示分類統計信息，好比： pid, comm, dso, symbol, parent, cpu,socket, srcline, weight, local_weight.

執行sudo perf report -i perf.data，能夠看出main函數所佔百分比，以及funcA和funcB分別所佔百分比。

在funcB執行過程當中，還產生了apic timer，佔用了一部分cpu資源。除此以外，佔比基本上是1：10。

funcA和funcB的佔比，基本符合預期。那麼進入longa，分析一下熱點。

在C和彙編混合顯示界面，能夠看出for循環佔用了69.92%，j=i賦值佔用了30.08%。

根據之上描述，能夠看出top適合監控整個系統的性能，stat比較適合單個程序的性能分析，record/report更適合對程序進行更細粒度的分析。

perf timechart

perf timechart是將以前的各類統計信息圖形化的一個工具。

perf timechart record <option> <command>用於記錄整個系統或者一個應用的事件，還能夠加option記錄指定類型的事件。

perf timechart用於將perf.data轉換成SVG格式的文件，SVG能夠經過Inkscape或者瀏覽器打開。

perf timechart record能夠指定特定類型的事件：

-P：記錄power相關事件

-T：記錄任務相關事件

-I：記錄io相關事件

-g：記錄函數調用關係

perf timechart用於將perf timechart record錄取的perf.data轉換成output.svg。

       -o 指定輸出文件名
       -i 指定待解析的文件名
       -w 輸出SVG文件寬度
       -P 只顯示power相關事件圖標
       -T , --tasks-only 顯示task信息，不顯示處理器信息
       -p 顯示指定進程名稱或者PID顯示
       --symfs=<directory> 指定系統符號表路徑
       -t, --topology 根據拓撲結構對CPU進行分類
       --highlight=<duration_nsecs|task_name> 對運行超過特定時間的task高亮

sudo perf timechart record -T && sudo perf timechart -w 1900結果以下，能夠看到相關task的名稱，開始時間/結束時間，每一個時間點的狀態（Running/Idle/Deeper Idle/Deepest Idle/Sleeping/Waiting for Cpu /Blocked on IO）。

記錄IO事件：sudo perf timechart record -I && sudo perf timechart -w 1800。

能夠看到按應用分類的，Disk/Network/Sync/Poll/Error信息。以及每一個應用數據吞吐量。

記錄Power狀態事件：sudo perf timechart record -I && sudo perf timechart -w 1800。能夠看到不一樣之處在於Idle之類的狀態裏面還有細分C/C2更詳細的顯示Power狀態。

抓取task-only信息 sudo perf timechart record -I && sudo perf timechart -w 1800，以下。顯示了Idle詳細信息，Running進程的名稱。

perf script

用於讀取perf record保存的裸trace數據。

使用方法：

       perf script [<options>]
       perf script [<options>] record <script> [<record-options>] <command>
       perf script [<options>] report <script> [script-args]
       perf script [<options>] <script> <required-script-args> [<record-options>] <command>
       perf script [<options>] <top-script> [script-args]

perf lock

Need to enable CONFIG_LOCKDEP and CONFIG_LOCK_STAT XXXXXXXXXXXXXXXXXXXXXXXXXX

分析內核鎖統計信息。

鎖是內核用於同步的方法，一旦加了鎖，其餘加鎖的內核執行路徑就必須等待，下降了並行。同時，若是加鎖不正確還會形成死鎖。

所以對於內核鎖進行分析是一項重要的調優工做。

要使用此功能，內核須要編譯選項的支持：CONFIG_LOCKDEP、CONFIG_LOCK_STAT。

CONFIG_LOCKDEP defines acquired and release events.

CONFIG_LOCK_STAT defines contended and acquired lock events.

perf lock record：抓取執行命令的lock事件信息到perf.data中

perf lock report：產生統計報告

perf lock script：顯示原始lock事件

perf lock info：

-k <value>：sorting key，默認爲acquired，還能夠按contended、wait_total、wait_max和wait_min來排序。

Name：內核鎖的名字。

aquired：該鎖被直接得到的次數，由於沒有其它內核路徑佔用該鎖，此時不用等待。

contended：該鎖等待後得到的次數，此時被其它內核路徑佔用，須要等待。

total wait：爲了得到該鎖，總共的等待時間。

max wait：爲了得到該鎖，最大的等待時間。

min wait：爲了得到該鎖，最小的等待時間。

perf kmem

perf kmem用於跟蹤測量內核slab分配器事件信息。
好比內存分配/釋放等。能夠用來研究程序在哪裏分配了大量內存，或者在什麼地方產生碎片之類的和內存管理相關的問題。

perf kmem和perf lock實際上都是perf tracepoint的子類，等同於perf record -e kmem:*和perf record -e lock:*。

可是這些工具在內部隊員是數據進行了慧聰和分析，所以統計報表更具可讀性。

perf kmem record：抓取命令的內核slab分配器事件

perf kmem stat：生成內核slab分配器統計信息

選項：

       --caller 顯示每一個調用點統計信息
       --alloc   顯示每次內存分配事件

       -s <key[,key2...]>, --sort=<key[,key2...]>
           Sort the output (default: frag,hit,bytes for slab and bytes,hit for page). Available sort keys are ptr, callsite, bytes, hit, pingpong, frag for slab and page, callsite, bytes, hit, order, migtype, gfp for page.
           This option should be preceded by one of the mode selection options - i.e. --slab, --page, --alloc and/or --caller.

       -l <num>, 只顯示固定行數
       --raw-ip
           Print raw ip instead of symbol
       --slab 分析slab分配器事件
       --page 分析頁分配事件

       --live
           Show live page stat. The perf kmem shows total allocation stat by default, but this option shows live (currently allocated) pages instead. (This option works with --page option only)

sudo perf kmem record ls

sudo perf kmem stat只顯示概要統計信息：

SUMMARY (SLAB allocator)
========================
Total bytes requested: 368,589
Total bytes allocated: 369,424
Total bytes wasted on internal fragmentation: 835
Internal fragmentation: 0.226028%
Cross CPU allocations: 0/2,256

sudo perf kmem --alloc --caller --slab stat顯示了更加詳細的分類信息：

---------------------------------------------------------------------------------------------------------
Callsite                           | Total_alloc/Per | Total_req/Per   | Hit      | Ping-pong | Frag
---------------------------------------------------------------------------------------------------------
proc_reg_open+32                   |        64/64    |        40/40    |        1 |         0 | 37.500%
seq_open+34                        |       384/192   |       272/136   |        2 |         0 | 29.167%
apparmor_file_alloc_security+5c    |       608/32    |       456/24    |       19 |         1 | 25.000%
ext4_readdir+8bd                   |        64/64    |        48/48    |        1 |         0 | 25.000%
ext4_htree_store_dirent+3e         |       896/68    |       770/59    |       13 |         0 | 14.062%
load_elf_phdrs+64                  |      1024/512   |       896/448   |        2 |         0 | 12.500%
load_elf_binary+222                |        32/32    |        28/28    |        1 |         0 | 12.500%
anon_vma_prepare+11b               |      1280/80    |      1152/72    |       16 |         0 | 10.000%
inotify_handle_event+75            |     73664/64    |     66758/58    |     1151 |         0 | 9.375%
do_execveat_common.isra.33+e5      |      2048/256   |      1920/240   |        8 |         1 | 6.250%
...                                | ...             | ...             | ...      | ...       | ...
---------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------
Alloc Ptr                          | Total_alloc/Per | Total_req/Per   | Hit      | Ping-pong | Frag
---------------------------------------------------------------------------------------------------------
0xffff8800ca4d86c0                 |       192/192   |       136/136   |        1 |         0 | 29.167%
0xffff8801ea05aa80                 |       192/192   |       136/136   |        1 |         0 | 29.167%
0xffff8801f6ad6540                 |        96/96    |        68/68    |        1 |         0 | 29.167%
0xffff8801f6ad6f00                 |        96/96    |        68/68    |        1 |         0 | 29.167%
0xffff880214e65e80                 |        96/32    |        72/24    |        3 |         0 | 25.000%
0xffff8801f45ddac0                 |        64/64    |        48/48    |        1 |         0 | 25.000%
0xffff8800ac4093c0                 |        32/32    |        24/24    |        1 |         1 | 25.000%
0xffff8800af5a4260                 |        32/32    |        24/24    |        1 |         0 | 25.000%
0xffff880214e651e0                 |        32/32    |        24/24    |        1 |         0 | 25.000%
0xffff880214e65220                 |        32/32    |        24/24    |        1 |         0 | 25.000%
0xffff880214e654e0                 |        32/32    |        24/24    |        1 |         0 | 25.000%
---------------------------------------------------------------------------------------------------------

SUMMARY (SLAB allocator)
========================
Total bytes requested: 409,260
Total bytes allocated: 417,008
Total bytes wasted on internal fragmentation: 7,748
Internal fragmentation: 1.857998%
Cross CPU allocations: 0/2,833

該報告有三個部分：根據 Callsite 顯示的部分，所謂 Callsite 即內核代碼中調用 kmalloc 和 kfree 的地方。

好比上圖中的函數proc_reg_open，Hit 欄爲 1，表示該函數在 record 期間一共調用了 kmalloc 一次。

對於第一行 Total_alloc/Per 顯示爲 1024/1024，第一個值 1024 表示函數 proc_reg_open 總共分配的內存大小，Per 表示平均值。

比較有趣的兩個參數是 Ping-pong 和 Frag。Frag 比較容易理解，即內部碎片。雖然相對於 Buddy System，Slab 正是要解決內部碎片問題，但 slab 依然存在內部碎片，好比一個 cache 的大小爲 1024，但須要分配的數據結構大小爲 1022，那麼有 2 個字節成爲碎片。Frag 即碎片的比例。

Ping-pong 是一種現象，在多 CPU 系統中，多個 CPU 共享的內存會出現」乒乓現象」。一個 CPU 分配內存，其餘 CPU 可能訪問該內存對象，也可能最終由另一個 CPU 釋放該內存對象。而在多 CPU 系統中，L1 cache 是 per CPU 的，CPU2 修改了內存，那麼其餘的 CPU 的 cache 都必須更新，這對於性能是一個損失。Perf kmem 在 kfree 事件中判斷 CPU 號，若是和 kmalloc 時的不一樣，則視爲一次 ping-pong，理想的狀況下 ping-pong 越小越好。Ibm developerworks 上有一篇講述 oprofile 的文章，其中關於 cache 的調優能夠做爲很好的參考資料。

Callsite：內核代碼中調用kmalloc和kfree的地方。

Total_alloc/Per：總共分配的內存大小，平均每次分配的內存大小。

Total_req/Per：總共請求的內存大小，平均每次請求的內存大小。

Hit：調用的次數。

Ping-pong：kmalloc和kfree不被同一個CPU執行時的次數，這會致使cache效率下降。

Frag：碎片所佔的百分比，碎片 = 分配的內存 - 請求的內存，這部分是浪費的。

有使用--alloc選項，還會看到Alloc Ptr，即所分配內存的地址。

後面則有根據被調用地點的顯示方式的部分。

最後一個部分是彙總數據，顯示總的分配的內存和碎片狀況，Cross CPU allocation 即 ping-pong 的彙總。

要分析--page事件，須要在record的時候加上--page選項。sudo perf kmem record --page ls，使用sudo perf kmem stat --page查看結果：

0xee318 [0x8]: failed to process type: 68
error during process events: -22

perf sched

perf sched專門用於跟蹤/測量調度器，包括延時等。

perf sched record <command>：錄製測試過程當中的調度事件

perf sched latency：報告線程調度延時和其餘調度相關屬性

perf sched script：查看執行過程當中詳細的trace信息

perf sched replay：回放perf sched record錄製的執行過程

perf sched map:用字符表示打印上下文切換

執行sudo perf sched record ls後，經過不一樣方式查看結果。

sudo perf sched latency，能夠看到ls進程的Average delay/Maximum delay時間。各個 column 的含義以下： Task: 進程的名字和 pid Runtime: 實際運行時間 Switches: 進程切換的次數 Average delay: 平均的調度延遲 Maximum delay: 最大延遲

這裏最值得人們關注的是 Maximum delay，通常從這裏能夠看到對交互性影響最大的特性：調度延遲，若是調度延遲比較大，那麼用戶就會感覺到視頻或者音頻斷斷續續的。

sudo perf sched script能看到更詳細的sched信息，包括sched_wakeup/sched_switch等等。每一列的含義依次是：進程名/pid/CPU ID/時間戳。

            perf 7801 [002] 5398.722314: sched:sched_stat_sleep: comm=perf pid=7806 delay=110095391 [ns]
            perf 7801 [002] 5398.722316: sched:sched_wakeup: comm=perf pid=7806 prio=120 target_cpu=004
         swapper     0 [004] 5398.722328: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]
         swapper     0 [004] 5398.722333: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7806 next_prio=120
            perf 7801 [002] 5398.722363: sched:sched_stat_runtime: comm=perf pid=7801 runtime=1255788 [ns] vruntime=3027478102 [ns]
            perf 7801 [002] 5398.722364: sched:sched_switch: prev_comm=perf prev_pid=7801 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
            perf 7806 [004] 5398.722568: sched:sched_wakeup: comm=migration/4 pid=27 prio=0 target_cpu=004
            perf 7806 [004] 5398.722571: sched:sched_stat_runtime: comm=perf pid=7806 runtime=254732 [ns] vruntime=1979611107 [ns]
            perf 7806 [004] 5398.722575: sched:sched_switch: prev_comm=perf prev_pid=7806 prev_prio=120 prev_state=R+ ==> next_comm=migration/4 next_pid=27 next_prio=0
     migration/4    27 [004] 5398.722582: sched:sched_stat_wait: comm=perf pid=7806 delay=13914 [ns]
     migration/4    27 [004] 5398.722586: sched:sched_migrate_task: comm=perf pid=7806 prio=120 orig_cpu=4 dest_cpu=2
         swapper     0 [002] 5398.722595: sched:sched_stat_wait: comm=perf pid=7806 delay=0 [ns]
         swapper     0 [002] 5398.722596: sched:sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7806 next_prio=120
     migration/4    27 [004] 5398.722611: sched:sched_switch: prev_comm=migration/4 prev_pid=27 prev_prio=0 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120
              ls 7806 [002] 5398.723421: sched:sched_stat_sleep: comm=kworker/u12:3 pid=7064 delay=1226675 [ns]
              ls 7806 [002] 5398.723423: sched:sched_wakeup: comm=kworker/u12:3 pid=7064 prio=120 target_cpu=003
         swapper     0 [003] 5398.723432: sched:sched_stat_wait: comm=kworker/u12:3 pid=7064 delay=0 [ns]
         swapper     0 [003] 5398.723434: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/u12:3 next_pid=7064 next_prio=120
   kworker/u12:3 7064 [003] 5398.723441: sched:sched_stat_sleep: comm=/usr/bin/termin pid=2511 delay=80833386 [ns]
   kworker/u12:3 7064 [003] 5398.723447: sched:sched_wakeup: comm=/usr/bin/termin pid=2511 prio=120 target_cpu=004
   kworker/u12:3 7064 [003] 5398.723449: sched:sched_stat_runtime: comm=kworker/u12:3 pid=7064 runtime=29315 [ns] vruntime=846439549943 [ns]
   kworker/u12:3 7064 [003] 5398.723451: sched:sched_switch: prev_comm=kworker/u12:3 prev_pid=7064 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
         swapper     0 [004] 5398.723462: sched:sched_stat_wait: comm=/usr/bin/termin pid=2511 delay=0 [ns]
         swapper     0 [004] 5398.723466: sched:sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=/usr/bin/termin next_pid=2511 next_prio=120
              ls 7806 [002] 5398.723503: sched:sched_migrate_task: comm=perf pid=7801 prio=120 orig_cpu=2 dest_cpu=3
              ls 7806 [002] 5398.723505: sched:sched_stat_sleep: comm=perf pid=7801 delay=1142537 [ns]
              ls 7806 [002] 5398.723506: sched:sched_wakeup: comm=perf pid=7801 prio=120 target_cpu=003
              ls 7806 [002] 5398.723508: sched:sched_stat_runtime: comm=ls pid=7806 runtime=920005 [ns] vruntime=3028398107 [ns]
         swapper     0 [003] 5398.723508: sched:sched_stat_wait: comm=perf pid=7801 delay=0 [ns]
         swapper     0 [003] 5398.723508: sched:sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=perf next_pid=7801 next_prio=120
              ls 7806 [002] 5398.723510: sched:sched_switch: prev_comm=ls prev_pid=7806 prev_prio=120 prev_state=x ==> next_comm=swapper/2 next_pid=0 next_prio=120
/usr/bin/termin 2511 [004] 5398.723605: sched:sched_stat_runtime: comm=/usr/bin/termin pid=2511 runtime=162720 [ns] vruntime=191386139371 [ns]
/usr/bin/termin 2511 [004] 5398.723611: sched:sched_switch: prev_comm=/usr/bin/termin prev_pid=2511 prev_prio=120 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120

sudo perf sched map的好處在於提供了一個的總的視圖，將成百上千的調度事件進行總結，顯示了系統任務在 CPU 之間的分佈，假若有很差的調度遷移，好比一個任務沒有被及時遷移到 idle 的 CPU 卻被遷移到其餘忙碌的 CPU，相似這種調度器的問題能夠從 map 的報告中一眼看出來。

星號表示調度事件發生所在的 CPU。

點號表示該 CPU 正在 IDLE。

                  *A0        5398.722333 secs A0 => perf:7806
          *.       A0        5398.722365 secs . => swapper:0
           .      *B0        5398.722575 secs B0 => migration/4:27
          *A0      B0        5398.722597 secs
           A0     *.         5398.722611 secs
           A0 *C0 .         5398.723434 secs C0 => kworker/u12:3:7064
           A0 *.   .         5398.723451 secs
           A0 . *D0        5398.723467 secs D0 => /usr/bin/termin:2511
           A0 *E0 D0        5398.723509 secs E0 => perf:7801
          *.   E0 D0        5398.723510 secs
           .   E0 *.         5398.723612 secs

perf sched replay 這個工具更是專門爲調度器開發人員所設計，它試圖重放 perf.data 文件中所記錄的調度場景。不少狀況下，通常用戶假如發現調度器的奇怪行爲，他們也沒法準確說明發生該情形的場景，或者一些測試場景不容易再次重現，或者僅僅是出於「偷懶」的目的，使用 perf replay，perf 將模擬 perf.data 中的場景，無需開發人員花費不少的時間去重現過去，這尤爲利於調試過程，由於須要一而再，再而三地重複新的修改是否能改善原始的調度場景所發現的問題。

run measurement overhead: 166 nsecs
sleep measurement overhead: 52177 nsecs
the run test took 999975 nsecs
the sleep test took 1064623 nsecs
nr_run_events:        11
nr_sleep_events:      581
nr_wakeup_events:     5
task      0 (             swapper:         0), nr_events: 11
task      1 (             swapper:         1), nr_events: 1
task      2 (             swapper:         2), nr_events: 1
task      3 (            kthreadd:         3), nr_events: 1
...
task    563 (            kthreadd:      7509), nr_events: 1
task    564 (                bash:      7751), nr_events: 1
task    565 (                 man:      7762), nr_events: 1
task    566 (            kthreadd:      7789), nr_events: 1
task    567 (                bash:      7800), nr_events: 1
task    568 (                sudo:      7801), nr_events: 4
task    569 (                perf:      7806), nr_events: 8
------------------------------------------------------------
#1 : 25.887, ravg: 25.89, cpu: 1919.68 / 1919.68
#2 : 27.994, ravg: 26.10, cpu: 2887.76 / 2016.49
#3 : 26.403, ravg: 26.13, cpu: 2976.09 / 2112.45
#4 : 29.400, ravg: 26.46, cpu: 1015.01 / 2002.70
#5 : 26.750, ravg: 26.49, cpu: 2942.80 / 2096.71
#6 : 27.647, ravg: 26.60, cpu: 3087.37 / 2195.78
#7 : 31.405, ravg: 27.08, cpu: 2762.43 / 2252.44
#8 : 23.770, ravg: 26.75, cpu: 2172.55 / 2244.45
#9 : 26.952, ravg: 26.77, cpu: 2794.93 / 2299.50
#10 : 30.904, ravg: 27.18, cpu: 973.26 / 2166.88

perf probe

Need to find vmlinux XXXXXXXXXXXXXXXXXX

能夠自定義探測點。

Define new dynamic tracepoints.

使用例子

(1) Display which lines in schedule() can be probed

# perf probe --line schedule

前面有行號的能夠探測，沒有行號的就不行了。

(2) Add a probe on schedule() function 12th line.

# perf probe -a schedule:12

在schedule函數的12處增長一個探測點。

perf bench

perf bench做爲benchmark工具的通用框架，包含sched/mem/numa/futex等子系統，all能夠指定全部。

perf bench sched：調度器和IPC機制。包含messaging和pipe兩個功能。

perf bench mem：內存存取性能。包含memcpy和memset兩個功能。

perf bench numa：NUMA架構的調度和內存處理性能。包含mem功能。

perf bench futex：futex壓力測試。包含hash/wake/wake-parallel/requeue/lock-pi功能。

perf bench all：全部bench測試的集合

perf bench sched all，顯示了messaging和pipi兩部分性能。

sched message 是從經典的測試程序 hackbench 移植而來，用來衡量調度器的性能，overhead 以及可擴展性。該 benchmark 啓動 N 個 reader/sender 進程或線程對，經過 IPC(socket 或者 pipe) 進行併發的讀寫。通常人們將 N 不斷加大來衡量調度器的可擴展性。Sched message 的用法及用途和 hackbench 同樣。

secsched pipe 從 Ingo Molnar 的 pipe-test-1m.c 移植而來。當初 Ingo 的原始程序是爲了測試不一樣的調度器的性能和公平性的。其工做原理很簡單，兩個進程互相經過 pipe 拼命地發 1000000 個整數，進程 A 發給 B，同時 B 發給 A。。。由於 A 和 B 互相依賴，所以假如調度器不公平，對 A 比 B 好，那麼 A 和 B 總體所須要的時間就會更長。

# 20 sender and receiver processes per group
# 10 groups == 400 processes run

     Total time: 0.158 [sec]

# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes

     Total time: 10.089 [sec]

      10.089625 usecs/op
          99111 ops/sec

perf bench mem all

該測試衡量不一樣版本的memcpy/memset/ 函數處理一個 1M 數據的所花費的時間，轉換成吞吐率。

# Running mem/memcpy benchmark...
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...

       1.242446 GB/sec
# function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

       3.160396 GB/sec
# function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

       3.090388 GB/sec
# function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
# Copying 1MB bytes ...

       3.180985 GB/sec

# Running mem/memset benchmark...
# function 'default' (Default memset() provided by glibc)
# Copying 1MB bytes ...

       8.642146 GB/sec
# function 'x86-64-unrolled' (unrolled memset() in arch/x86/lib/memset_64.S)
# Copying 1MB bytes ...

       8.642146 GB/sec
# function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
# Copying 1MB bytes ...

       8.719308 GB/sec
# function 'x86-64-stosb' (movsb-based memset() in arch/x86/lib/memset_64.S)
# Copying 1MB bytes ...

       8.642146 GB/sec

perf bench futex，Futex是一種用戶態和內核態混合機制，因此須要兩個部分合做完成，linux上提供了sys_futex系統調用，對進程競爭狀況下的同步處理提供支持。

全部的futex同步操做都應該從用戶空間開始，首先建立一個futex同步變量，也就是位於共享內存的一個整型計數器。
當進程嘗試持有鎖或者要進入互斥區的時候，對futex執行"down"操做，即原子性的給futex同步變量減1。若是同步變量變爲0，則沒有競爭發生，進程照常執行。若是同步變量是個負數，則意味着有競爭發生，須要調用futex系統調用的futex_wait操做休眠當前進程。
當進程釋放鎖或者要離開互斥區的時候，對futex進行"up"操做，即原子性的給futex同步變量加1。若是同步變量由0變成1，則沒有競爭發生，進程照常執行。如果加以前同步變量是負數，則意味着有競爭發生，須要調用futex系統調用的futex_wake操做喚醒一個或者多個等待進程。

# Running futex/hash benchmark...
Run summary [PID 29778]: 5 threads, each operating on 1024 [private] futexes for 10 secs.

[thread 0] futexes: 0x3421d20 ... 0x3422d1c [ 4050227 ops/sec ]
[thread 1] futexes: 0x3422d30 ... 0x3423d2c [ 4015308 ops/sec ]
[thread 2] futexes: 0x3423eb0 ... 0x3424eac [ 3824537 ops/sec ]
[thread 3] futexes: 0x3425030 ... 0x342602c [ 4016025 ops/sec ]
[thread 4] futexes: 0x34261b0 ... 0x34271ac [ 4180992 ops/sec ]

Averaged 4017417 operations/sec (+- 1.42%), total secs = 10

# Running futex/wake benchmark...
Run summary [PID 29778]: blocking on 5 threads (at [private] futex 0x7ef6ac), waking up 1 at a time.

[Run 1]: Wokeup 5 of 5 threads in 0.0310 ms
[Run 2]: Wokeup 5 of 5 threads in 0.0350 ms
[Run 3]: Wokeup 5 of 5 threads in 0.0310 ms
[Run 4]: Wokeup 5 of 5 threads in 0.0320 ms
[Run 5]: Wokeup 5 of 5 threads in 0.0310 ms
[Run 6]: Wokeup 5 of 5 threads in 0.0300 ms
[Run 7]: Wokeup 5 of 5 threads in 0.0320 ms
[Run 8]: Wokeup 5 of 5 threads in 0.0300 ms
[Run 9]: Wokeup 5 of 5 threads in 0.0230 ms
[Run 10]: Wokeup 5 of 5 threads in 0.0310 ms
Wokeup 5 of 5 threads in 0.0306 ms (+-3.13%)

# Running futex/wake-parallel benchmark...
Run summary [PID 29778]: blocking on 5 threads (at [private] futex 0x7ef820), 5 threads waking up 1 at a time.

[Run 1]: Avg per-thread latency (waking 1/5 threads) in 0.0208 ms (+-71.27%)
[Run 2]: Avg per-thread latency (waking 1/5 threads) in 0.0166 ms (+-56.05%)
[Run 3]: Avg per-thread latency (waking 1/5 threads) in 0.0286 ms (+-45.52%)
[Run 4]: Avg per-thread latency (waking 1/5 threads) in 0.0168 ms (+-33.11%)
[Run 5]: Avg per-thread latency (waking 1/5 threads) in 0.0146 ms (+-45.26%)
[Run 6]: Avg per-thread latency (waking 1/5 threads) in 0.0158 ms (+-55.59%)
[Run 7]: Avg per-thread latency (waking 1/5 threads) in 0.0316 ms (+-74.73%)
[Run 8]: Avg per-thread latency (waking 1/5 threads) in 0.0346 ms (+-47.43%)
[Run 9]: Avg per-thread latency (waking 1/5 threads) in 0.0230 ms (+-73.03%)
[Run 10]: Avg per-thread latency (waking 1/5 threads) in 0.0270 ms (+-43.30%)
Avg per-thread latency (waking 1/5 threads) in 0.0229 ms (+-17.57%)

# Running futex/requeue benchmark...
Run summary [PID 29778]: Requeuing 5 threads (from [private] 0x7ef99c to 0x7ef998), 1 at a time.

[Run 1]: Requeued 5 of 5 threads in 0.0090 ms
[Run 2]: Requeued 5 of 5 threads in 0.0090 ms
[Run 3]: Requeued 5 of 5 threads in 0.0080 ms
[Run 4]: Requeued 5 of 5 threads in 0.0090 ms
[Run 5]: Requeued 5 of 5 threads in 0.0090 ms
[Run 6]: Requeued 5 of 5 threads in 0.0090 ms
[Run 7]: Requeued 5 of 5 threads in 0.0100 ms
[Run 8]: Requeued 5 of 5 threads in 0.0080 ms
[Run 9]: Requeued 5 of 5 threads in 0.0090 ms
[Run 10]: Requeued 5 of 5 threads in 0.0090 ms
Requeued 5 of 5 threads in 0.0089 ms (+-2.02%)

# Running futex/lock-pi benchmark...
Run summary [PID 29778]: 5 threads doing pi lock/unlock pairing for 10 secs.

[thread   0] futex: 0x7efaa0 [ 2550 ops/sec ]
[thread   1] futex: 0x7efaa0 [ 2550 ops/sec ]
[thread   2] futex: 0x7efaa0 [ 2549 ops/sec ]
[thread   3] futex: 0x7efaa0 [ 2550 ops/sec ]
[thread   4] futex: 0x7efaa0 [ 2550 ops/sec ]

Averaged 2549 operations/sec (+- 0.01%), total secs = 10

perf trace

一般，面對看似複雜，實則較有規律的計算機輸出，程序員們老是會用腳原本進行處理：好比給定一個文本文件，想從中找出有多少個數字 0125，人們不會打開文件而後用肉眼去一個一個地數，而是用 grep 命令來進行處理。

perf 的輸出雖然是文本格式，但仍是不太容易分析和閱讀。每每也須要進一步處理，perl 和 python 是目前最強大的兩種腳本語言。Tom Zanussi 將 perl 和 python 解析器嵌入到 perf 程序中，從而使得 perf 可以自動執行 perl 或者 python 腳本進一步進行處理，從而爲 perf 提供了強大的擴展能力。由於任何人均可以編寫新的腳本，對 perf 的原始輸出數據進行所須要的進一步處理。這個特性所帶來的好處很相似於 plug-in 之於 eclipse。

下面的命令能夠查看系統中已經安裝的腳本：

 # perf trace -l 
    List of available trace scripts: 
      syscall-counts [comm]                system-wide syscall counts 
      syscall-counts-by-pid [comm]         system-wide syscall counts, by pid 
      failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid 
。。。

好比 failed-syscalls 腳本，執行的效果以下：

 # perf trace record failed-syscalls 
    ^C[ perf record: Woken up 11 times to write data ]                         
    [ perf record: Captured and wrote 1.939 MB perf.data (~84709 samples) ]   

 perf trace report failed-syscalls 
    perf trace started with Perl script \ 
	 /root/libexec/perf-core/scripts/perl/failed-syscalls.pl 

    failed syscalls, by comm: 

    comm                    # errors 
    --------------------  ---------- 
    firefox                     1721 
    claws-mail                   149 
    konsole                       99 
    X                             77 
    emacs                         56 
    [...] 

    failed syscalls, by syscall: 

    syscall                           # errors 
    ------------------------------  ---------- 
    sys_read                              2042 
    sys_futex                              130 
    sys_mmap_pgoff                          71 
    sys_access                              33 
    sys_stat64                               5 
    sys_inotify_add_watch                    4 
    [...]

該報表分別按進程和按系統調用顯示失敗的次數。很是簡單明瞭，而若是經過普通的 perf record 加 perf report 命令，則須要本身手工或者編寫腳原本統計這些數字。

參考資料

系統級性能分析工具 — Perf：http://blog.csdn.net/zhangskd/article/details/37902159/

linux perf - 性能測試和優化工具：http://www.cnblogs.com/hushaojun/p/4848269.html

Linux下的內核測試工具——perf使用簡介：http://www.blogjava.net/qileilove/archive/2013/09/04/403646.html

Perf -- Linux下的系統性能調優工具，第 1 部分：http://www.ibm.com/developerworks/cn/linux/l-cn-perf1/

Perf -- Linux下的系統性能調優工具，第 2 部分：https://www.ibm.com/developerworks/cn/linux/l-cn-perf2/

性能分析工具---Perf簡介彙總整理：http://sanwen.net/a/nzsrvoo.html

聯繫方式:arnoldlu@qq.com

分類: Linux相關學習總結, 性能優化

系統級性能分析工具perf的介紹與使用[轉]