Perf是Linux kernel自帶的系統性能優化工具。Perf的優點在於與Linux Kernel的緊密結合,它能夠最早應用到加入Kernel的new feature。pef能夠用於查看熱點函數,查看cashe miss的比率,從而幫助開發者來優化程序性能。php
1.perf的安裝html
因爲咱們常常是在本身編譯的內核上進行開發工做,這裏須要有包含調式信息的內核啓動鏡像文件vmlinux,在自定義內核的基礎之上,進入linux內核源碼,linux/tools/perfjava
makenode
make installlinux
提示:性能優化
1)可能在編譯的時候,有報錯大概是因爲平臺問題,數據類型不匹配,致使全部的warning都被看成error對待:出現這問題的緣由是-Werror這個gcc編譯選項。只要在makefile中找到包含這個-Werror選項的句子,將-Werror刪除,或是註釋掉就好了app
2)安裝完畢,perf可執行程序每每位於當前目錄,可能不在系統的PATH路徑中,此時須要改變環境變量PATHide
2.perf的運行原理函數
性能調優工具如 perf,Oprofile 等的基本原理都是對被監測對象進行採樣,最簡單的情形是根據 tick 中斷進行採樣,即在 tick 中斷內觸發採樣點,在採樣點裏判斷程序當時的上下文。假如一個程序 90% 的時間都花費在函數 foo() 上,那麼 90% 的採樣點都應該落在函數 foo() 的上下文中。運氣不可捉摸,但我想只要採樣頻率足夠高,採樣時間足夠長,那麼以上推論就比較可靠。所以,經過 tick 觸發採樣,咱們即可以瞭解程序中哪些地方最耗時間,從而重點分析。工具
稍微擴展一下思路,就能夠發現改變採樣的觸發條件使得咱們能夠得到不一樣的統計數據:
以時間點 ( 如 tick) 做爲事件觸發採樣即可以獲知程序運行時間的分佈。
以 cache miss 事件觸發採樣即可以知道 cache miss 的分佈,即 cache 失效常常發生在哪些程序代碼中。如此等等。
所以讓咱們先來了解一下 perf 中可以觸發採樣的事件有哪些。
使用perf list(在root權限下運行),能夠列出全部的採樣事件
事件分爲如下三種:
1)Hardware Event 是由 PMU 硬件產生的事件,好比 cache 命中,當您須要瞭解程序對硬件特性的使用狀況時,便須要對這些事件進行採樣;
2)Software Event 是內核軟件產生的事件,好比進程切換,tick 數等 ;
3)Tracepoint event 是內核中的靜態 tracepoint 所觸發的事件,這些 tracepoint 用來判斷程序運行期間內核的行爲細節,好比 slab 分配器的分配次數等。
上述每個事件均可以用於採樣,並生成一項統計數據,時至今日,尚沒有文檔對每個 event 的含義進行詳細解釋。
3.perfstat——概覽程序的運行狀況
面對一個問題程序,最好採用自頂向下的策略。先總體看看該程序運行時各類統計事件的大概,再針對某些方向深刻細節。而不要一會兒扎進瑣碎細節,會一葉障目的。
有些程序慢是由於計算量太大,其多數時間都應該在使用CPU進行計算,這叫作CPUbound型;有些程序慢是由於過多的IO,這種時候其CPU利用率應該不高,這叫作IObound型;對於CPUbound程序的調優和IObound的調優是不一樣的。
若是您認同這些說法的話,Perfstat應該是您最早使用的一個工具。它經過歸納精簡的方式提供被調試程序運行的總體狀況和彙總數據。
本篇中,咱們將在之後使用這個例子test1.c:
測試用例:test1
<SPAN style="FONT-SIZE: 14px"> //test.c void foo2() void foo1() int main(void) |
將它編譯爲可執行文件 test1
gcc – o test1 – g test.c
注意:此處必定要加-g選項,加入調試和符號表信息。
下面演示了 perf stat 針對程序 test1 的輸出:
結果分析:
對 test1進行調優應該要找到熱點 ( 即最耗時的代碼片斷 ),再看看是否可以提升熱點代碼的效率。
缺省狀況下,除了 task-clock-msecs 以外,perf stat 還給出了其餘幾個最經常使用的統計信息:
Task-clock-msecs:CPU 利用率,該值高,說明程序的多數時間花費在 CPU 計算上而非 IO。
Context-switches:進程切換次數,記錄了程序運行過程當中發生了多少次進程切換,頻繁的進程切換是應該避免的。
Cache-misses:程序運行過程當中整體的 cache 利用狀況,若是該值太高,說明程序的 cache 利用很差
CPU-migrations:表示進程 t1 運行過程當中發生了多少次 CPU 遷移,即被調度器從一個 CPU 轉移到另一個 CPU 上運行。
Cycles:處理器時鐘,一條機器指令可能須要多個 cycles,
Instructions: 機器指令數目。
IPC:是 Instructions/Cycles 的比值,該值越大越好,說明程序充分利用了處理器的特性。
Cache-references: cache 命中的次數
Cache-misses: cache 失效的次數。
4.精確制導——定位程序瓶頸perf record && perf report
4.1查找時間上的熱點函數
perf record – e cpu-clock ./test1
perf report
3個問題:
1)perf未能定位本地符號表對應的symbol和地址的對應關係:0x000003d4對應的什麼函數?
2)採樣頻率不夠高,失去了一些函數的信息:顯然一些內核函數沒有顯示在上面的結果中,由於採樣頻率若是不夠高,那麼勢必會有一些函數中的採樣點沒有/
3)如何克服採樣的隨機性帶來的問題:爲了在測量更加逼近正確值,咱們採用屢次重複取平均值的方法來逼近真實值。(這裏能夠用-r來指定重複次數)
對於問題2),咱們能夠用perf record -F count 來指定採樣頻率加以解決:
<SPAN style="FONT-SIZE: 14px">root@hyk-linux:/home/hyk/program/cprogram# perf record -F 50000 -e cpu-clock ./test1 [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 0.532 MB perf.data (~23245 samples) ] root@hyk-linux:/home/hyk/program/cprogram# perf report # ======== # captured on: Mon Aug 26 09:54:45 2013 # hostname : hyk-linux # os release : 3.10.9 # perf version : 3.10.9 # arch : i686 # nrcpus online : 4 # nrcpus avail : 4 # cpudesc : Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz # cpuid : GenuineIntel,6,42,7 # total memory : 4084184 kB # cmdline : /media/usr/src/linux-3.10.9/tools/perf/perf record -F 50000 -e cpu-c # event : name = cpu-clock, type = 1, config = 0x0, config1 = 0x0, config2 = 0x0 # HEADER_CPU_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore # ======== # # Samples: 13K of event 'cpu-clock' # Event count (approx.): 273580000 # # Overhead Command Shared Object Symbol # ........ ....... ................. ............................... # 99.77% test1 test1 [.] 0x000003c3 0.07% test1 ld-2.15.so [.] 0x00004c99 0.02% test1 [kernel.kallsyms] [k] __wake_up_bit 0.01% test1 [kernel.kallsyms] [k] __kunmap_atomic 0.01% test1 [kernel.kallsyms] [k] load_elf_binary 0.01% test1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.01% test1 libc-2.15.so [.] 0x00097d8e 0.01% test1 [kernel.kallsyms] [k] exit_itimers 0.01% test1 [kernel.kallsyms] [k] profile_munmap 0.01% test1 [kernel.kallsyms] [k] get_page_from_freelist 0.01% test1 [kernel.kallsyms] [k] vma_interval_tree_remove 0.01% test1 [kernel.kallsyms] [k] change_protection 0.01% test1 [kernel.kallsyms] [k] link_path_walk 0.01% test1 [kernel.kallsyms] [k] prepend_path 0.01% test1 [kernel.kallsyms] [k] __inode_wait_for_writeback 0.01% test1 [kernel.kallsyms] [k] aa_free_task_context 0.01% test1 [kernel.kallsyms] [k] radix_tree_lookup_element 0.01% test1 [kernel.kallsyms] [k] _raw_spin_lock </SPAN> |
結果解釋:
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function. The second column reports the process from which the samples were collected. In per-thread/per-process mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary. The third column shows the name of the ELF image where the samples came from. If a program is dynamically linked, then this may show the name of a shared library. When the samples come from the kernel, then the pseudo ELF image name [kernel.kallsyms] is used. The fourth column indicates the privilege level at which the sample was taken, i.e. when the program was running when it was interrupted:
[.] : user level
[k]: kernel level
[g]: guest kernel level (virtualization)
[u]: guest os user space
[H]: hypervisor
The final column shows the symbol name.
代碼是很是複雜難說的,t1 程序中的 foo1() 也是一個潛在的調優對象,爲何要調用 100 次那個無聊的 longa() 函數呢?但咱們在上圖中沒法發現 foo1 和 foo2,更沒法瞭解他們的區別了。
我曾發現本身寫的一個程序竟然有近一半的時間花費在 string 類的幾個方法上,string 是 C++ 標準,我毫不可能寫出比 STL 更好的代碼了。所以我只有找到本身程序中過多使用 string 的地方。所以我很須要按照調用關係進行顯示的統計信息。
使用 perf 的 -g 選項即可以獲得須要的信息:
perf record -g -e cpu-clock ./test1
perf report
固然,這裏符號表沒有定位的問題有依然沒有解決!
perf record的其餘參數:
-f:強制覆蓋產生的.data數據
-c:事件每發生count次採樣一次
-p:指定進程
-t:指定線程
4.2 perf report的相關參數:
-k:指定未經壓縮的內核鏡像文件,從而得到內核相關信息
--report:cpu按照cpu列出負載
5.使用tracepoint
當 perf 根據 tick 時間點進行採樣後,人們便可以獲得內核代碼中的 hot spot。那何時須要使用 tracepoint 來採樣呢?
我想人們使用 tracepoint 的基本需求是對內核的運行時行爲的關心,如前所述,有些內核開發人員須要專一於特定的子系統,好比內存管理模塊。這便須要統計相關內核函數的運行狀況。另外,內核行爲對應用程序性能的影響也是不容忽視的:
以以前的遺憾爲例,假如時光倒流,我想我要作的是統計該應用程序運行期間究竟發生了多少次系統調用。在哪裏發生的?
下面我用 ls 命令來演示 sys_enter 這個 tracepoint 的使用:
[root@ovispoly /]# perf stat -e raw_syscalls:sys_enter ls bin dbg etc lib media opt root selinux sys usr boot dev home lost+found mnt proc sbin srv tmp var Performance counter stats for 'ls': 101 raw_syscalls:sys_enter 0.003434730 seconds time elapsed [root@ovispoly /]# perf record -e raw_syscalls:sys_enter ls [root@ovispoly /]# perf report Failed to open .lib/ld-2.12.so, continuing without symbols # Samples: 70 # # Overhead Command Shared Object Symbol # ........ ............... ............... ...... # 97.14% ls ld-2.12.so [.] 0x0000000001629d 2.86% ls [vdso] [.] 0x00000000421424 # # (For a higher level overview, try: perf report --sort comm,dso) # |
這個報告詳細說明了在 ls 運行期間發生了多少次系統調用 ( 上例中有 101 次 ),多數系統調用都發生在哪些地方 (97% 都發生在 ld-2.12.so 中 )。
有了這個報告,或許我可以發現更多能夠調優的地方。好比函數 foo() 中發生了過多的系統調用,那麼我就能夠思考是否有辦法減小其中有些沒必要要的系統調用。
您可能會說 strace 也能夠作一樣事情啊,的確,統計系統調用這件事徹底能夠用 strace 完成,但 perf 還能夠幹些別的,您所須要的就是修改 -e 選項後的字符串。
羅列 tracepoint 實在是不太地道,本文固然不會這麼作。但學習每個 tracepoint 是有意義的,相似背單詞之於學習英語同樣,是一項緩慢痛苦卻不得不作的事情。'
5.2一樣,咱們跟蹤一下wirteback子系統的相關狀況:
<SPAN style="FONT-SIZE: 14px">root@hyk-linux:/home/hyk/program/cprogram# perf record -e writeback:* lsa.out cscope.po.out perf.data.old t2.c test1 testperf cscope.in.out malloc.c t1 tags test1s testperf.c cscope.out perf.data t2 test test.img [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.013 MB perf.data (~548 samples) ] root@hyk-linux:/home/hyk/program/cprogram# perf report# ========# captured on: Mon Aug 26 08:59:58 2013# hostname : hyk-linux# os release : 3.10.9# perf version : 3.10.9# arch : i686# nrcpus online : 4# nrcpus avail : 4# cpudesc : Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz# cpuid : GenuineIntel,6,42,7# total memory : 4084184 kB# cmdline : /media/usr/src/linux-3.10.9/tools/perf/perf record -e writeback:* ls# event : name = writeback:writeback_dirty_page, type = 2, config = 0x291, confi# event : name = writeback:writeback_dirty_inode_start, type = 2, config = 0x290# event : name = writeback:writeback_dirty_inode, type = 2, config = 0x28f, conf# event : name = writeback:writeback_write_inode_start, type = 2, config = 0x28e# event : name = writeback:writeback_write_inode, type = 2, config = 0x28d, conf# event : name = writeback:writeback_queue, type = 2, config = 0x28c, config1 =# event : name = writeback:writeback_exec, type = 2, config = 0x28b, config1 = 0# event : name = writeback:writeback_start, type = 2, config = 0x28a, config1 =# event : name = writeback:writeback_written, type = 2, config = 0x289, config1# event : name = writeback:writeback_wait, type = 2, config = 0x288, config1 = 0# event : name = writeback:writeback_pages_written, type = 2, config = 0x287, co# event : name = writeback:writeback_nowork, type = 2, config = 0x286, config1 =# event : name = writeback:writeback_wake_background, type = 2, config = 0x285,# event : name = writeback:writeback_bdi_register, type = 2, config = 0x284, con# event : name = writeback:writeback_bdi_unregister, type = 2, config = 0x283, c# event : name = writeback:wbc_writepage, type = 2, config = 0x282, config1 = 0x# event : name = writeback:writeback_queue_io, type = 2, config = 0x281, config1# event : name = writeback:global_dirty_state, type = 2, config = 0x280, config1# event : name = writeback:bdi_dirty_ratelimit, type = 2, config = 0x27f, config# event : name = writeback:balance_dirty_pages, type = 2, config = 0x27e, config# event : name = writeback:writeback_sb_inodes_requeue, type = 2, config = 0x27d# event : name = writeback:writeback_congestion_wait, type = 2, config = 0x27c,# event : name = writeback:writeback_wait_iff_congested, type = 2, config = 0x27# event : name = writeback:writeback_single_inode_start, type = 2, config = 0x27# event : name = writeback:writeback_single_inode, type = 2, config = 0x279, con# HEADER_CPU_TOPOLOGY info available, use -I to display# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore# ========## Samples: 0 of event 'writeback:writeback_dirty_page'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 1 of event 'writeback:writeback_dirty_inode_start'# Event count (approx.): 1## Overhead Command Shared Object Symbol# ........ ....... ................. ......................#100.00% ls [kernel.kallsyms] [k] __mark_inode_dirty# Samples: 1 of event 'writeback:writeback_dirty_inode'# Event count (approx.): 1## Overhead Command Shared Object Symbol# ........ ....... ................. ......................#100.00% ls [kernel.kallsyms] [k] __mark_inode_dirty# Samples: 0 of event 'writeback:writeback_write_inode_start'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_write_inode'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_queue'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_exec'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_start'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_written'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_wait'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_pages_written'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_nowork'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_wake_background'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_bdi_register'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_bdi_unregister'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:wbc_writepage'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_queue_io'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:global_dirty_state'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:bdi_dirty_ratelimit'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:balance_dirty_pages'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_sb_inodes_requeue'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......## Samples: 0 of event 'writeback:writeback_congestion_wait'# Event count (approx.): 0## Overhead Command Shared Object Symbol# ........ ....... ............. ......</SPAN> |