Perf是Linux kernel自帶的系統性能優化工具。雖然它的版本還只是0.0.2,Perf已經顯現出它強大的實力,足以與目前Linux流行的OProfile相媲美了。java
Perf 的優點在於與Linux Kernel的緊密結合,它能夠最早應用到加入Kernel的new feature。而像OProfile, GProf等一般會「慢一拍」。Perf的基本原理跟OProfile等相似,也是在CPU的PMU registers中Get/Set performance counters來得到諸如instructions executed, cache-missed suffered, branches mispredicted等信息。Linux kernel對這些registers進行了一系列抽象,因此你能夠按進程,按CPU或者按counter group等不一樣類別來查看Sample信息。node
Perf的使用流程和OProfile很像。因此若是你會用OProfile的話,用Perf就很簡單。這裏只是簡單翻譯一下在[1]中的Perf examples中舉的例子。有更多發現的話之後會繼續更新。python
$ perf record -f -- git gc Counting objects: 1283571, done. Compressing objects: 100% (206724/206724), done. Writing objects: 100% (1283571/1283571), done. Total 1283571 (delta 1070675), reused 1281443 (delta 1068566) [ perf record: Captured and wrote 31.054 MB perf.data (~1356768 samples) ]
$ perf report --sort comm,dso,symbol | head -10 # Samples: 1355726 # # Overhead Command Shared Object Symbol # ........ ............... ....................................... ...... # 31.53% git /usr/bin/git [.] 0x0000000009804f 13.41% git-prune /usr/bin/git-prune [.] 0x000000000ad06d 10.05% git /lib/tls/i686/cmov/libc-2.8.90.so [.] _nl_make_l10nflist 5.36% git-prune /usr/lib/libz.so.1.2.3.3 [.] 0x00000000009d51 4.48% git /lib/tls/i686/cmov/libc-2.8.90.so [.] memcpy
perf record至關於opcontrol –-start, 而perf report至關於opreport.linux
查看全部可用的counters用'perf list’:c++
titan:~> perf list [...] kmem:kmalloc [Tracepoint event] kmem:kmem_cache_alloc [Tracepoint event] kmem:kmalloc_node [Tracepoint event] kmem:kmem_cache_alloc_node [Tracepoint event] kmem:kfree [Tracepoint event] kmem:kmem_cache_free [Tracepoint event] kmem:mm_page_free_direct [Tracepoint event] kmem:mm_pagevec_free [Tracepoint event] kmem:mm_page_alloc [Tracepoint event] kmem:mm_page_alloc_zone_locked [Tracepoint event] kmem:mm_page_pcpu_drain [Tracepoint event] kmem:mm_page_alloc_extfrag [Tracepoint event]
你能夠用以上counter的任意組合來跑你的測試程序。好比,用如下命令來看跑
hackbench時page alloc/free的次數。
titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.575 Performance counter stats for './hackbench 10': 13857 kmem:mm_page_pcpu_drain 27576 kmem:mm_page_alloc 6025 kmem:mm_pagevec_free 20934 kmem:mm_page_free_direct 0.613972165 seconds time elapsed
Perf能夠幫你統計N次結果的數值波動狀況:
titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10 Time: 0.627 Time: 0.644 Time: 0.564 Time: 0.559 Time: 0.626Performance counter stats for './hackbench 10' (5 runs): 12920 kmem:mm_page_pcpu_drain ( +- 3.359% ) 25035 kmem:mm_page_alloc ( +- 3.783% ) 6104 kmem:mm_pagevec_free ( +- 0.934% ) 18376 kmem:mm_page_free_direct ( +- 4.941% ) 0.643954516 seconds time elapsed ( +- 2.363% )
有了以上的統計數據,你能夠開始sample某一個你關心的tracepoint(好比page
allocations):
titan:~/git> perf record -f -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ]
查看哪一個function引發了page allocations:
titan:~/git> perf report # Samples: 10646 # # Overhead Command Shared Object # ........ ............... .......................... # 23.57% git-repack /lib64/libc-2.5.so 21.81% git /lib64/libc-2.5.so 14.59% git ./git 11.79% git-repack ./git 7.12% git /lib64/ld-2.5.so 3.16% git-repack /lib64/libpthread-2.5.so 2.09% git-repack /bin/bash 1.97% rm /lib64/libc-2.5.so 1.39% mv /lib64/ld-2.5.so 1.37% mv /lib64/libc-2.5.so 1.12% git-repack /lib64/ld-2.5.so 0.95% rm /lib64/ld-2.5.so 0.90% git-update-serv /lib64/libc-2.5.so 0.73% git-update-serv /lib64/ld-2.5.so 0.68% perf /lib64/libpthread-2.5.so 0.64% git-repack /usr/lib64/libz.so.1.2.3
更進一步的查看:git
titan:~/git> perf report --sort comm,dso,symbol算法
# Samples: 10646 # # Overhead Command Shared Object Symbol # ........ ............... .......................... ...... # 9.35% git-repack ./git [.] insert_obj_hash 9.12% git ./git [.] insert_obj_hash 7.31% git /lib64/libc-2.5.so [.] memcpy 6.34% git-repack /lib64/libc-2.5.so [.] _int_malloc 6.24% git-repack /lib64/libc-2.5.so [.] memcpy 5.82% git-repack /lib64/libc-2.5.so [.] __GI___fork 5.47% git /lib64/libc-2.5.so [.] _int_malloc 2.99% git /lib64/libc-2.5.so [.] memset
同時,call-graph(函數調用圖)也能夠被記錄下來,而且能告訴你每一個函數所佔用的百分比。
titan:~/git> perf record -f -g -e kmem:mm_page_alloc -c 1 ./git gc Counting objects: 1148, done. Delta compression using up to 2 threads. Compressing objects: 100% (450/450), done. Writing objects: 100% (1148/1148), done. Total 1148 (delta 690), reused 1148 (delta 690) [ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ]titan:~/git> perf report -g # Samples: 10686 # # Overhead Command Shared Object # ........ ............... .......................... # 23.25% git-repack /lib64/libc-2.5.so | |--50.00%-- _int_free | |--37.50%-- __GI___fork | make_child | |--12.50%-- ptmalloc_unlock_all2 | make_child | --6.25%-- __GI_strcpy 21.61% git /lib64/libc-2.5.so | |--30.00%-- __GI_read | | | --83.33%-- git_config_from_file | git_config | | [...]
用如下命令能夠查看整個系統10秒內的page allocation次數:
titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 10sass
Performance counter stats for 'sleep 10':171585 kmem:mm_page_pcpu_drain 322114 kmem:mm_page_alloc 73623 kmem:mm_pagevec_free 254115 kmem:mm_page_free_direct 10.000591410 seconds time elapsed
用如下命令查看每隔1秒,系統page allocation的波動情況:ruby
titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct sleep 1Performance counter stats for 'sleep 1' (10 runs): 17254 kmem:mm_page_pcpu_drain ( +- 3.709% ) 34394 kmem:mm_page_alloc ( +- 4.617% ) 7509 kmem:mm_pagevec_free ( +- 4.820% ) 25653 kmem:mm_page_free_direct ( +- 3.672% ) 1.058135029 seconds time elapsed ( +- 3.089% )
經過反彙編每每能找出是哪行代碼生成的指令會引發問題。性能優化
titan:~/git> perf annotate __GI___fork ------------------------------------------------ Percent | Source code & Disassembly of libc-2.5.so ------------------------------------------------ : : : Disassembly of section .plt: : Disassembly of section .text: : : 00000031a2e95560 <__fork>: [...] 0.00 : 31a2e95602: b8 38 00 00 00 mov $0x38,�x 0.00 : 31a2e95607: 0f 05 syscall 83.42 : 31a2e95609: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 0.00 : 31a2e9560f: 0f 87 4d 01 00 00 ja 31a2e95762 <__fork+0x202> 0.00 : 31a2e95615: 85 c0 test �x,�x
以上結果顯示__GI__forks的83.42%的時間來源於0x38的系統調用。
你也許想知道是否值得去優化你程序中的某個特定函數。一個很好的例子是git mailing list中討論的關於SHA1 哈希算法優化的問題,咱們能夠用perf來預判優化的結果。具體參見Linus的回信[2].
"perf report --sort comm,dso,symbol" profiling shows the following for 'git fsck --full' on the kernel repo, using the Mozilla SHA1: 47.69% git /home/torvalds/git/git [.] moz_SHA1_Update 22.98% git /lib64/libz.so.1.2.3 [.] inflate_fast 7.32% git /lib64/libc-2.10.1.so [.] __GI_memcpy 4.66% git /lib64/libz.so.1.2.3 [.] inflate 3.76% git /lib64/libz.so.1.2.3 [.] adler32 2.86% git /lib64/libz.so.1.2.3 [.] inflate_table 2.41% git /home/torvalds/git/git [.] lookup_object 1.31% git /lib64/libc-2.10.1.so [.] _int_malloc 0.84% git /home/torvalds/git/git [.] patch_delta 0.78% git [kernel] [k] hpet_next_event
很明顯,SHA1加密算法的性能在這裏很關鍵。
如何測量latency
若是你在build kernel時enabled了
CONFIG_PERF_COUNTER=y CONFIG_EVENT_TRACING=y
那你能夠加-tip參數來使用幾個新的performance counter來測量scheduler的lantencies。
perf stat -e sched:sched_stat_wait -e task-clock ./hackbench 20
以上命令可以得出等待CPU用了多少時間。你能夠重複10次這樣的操做:
aldebaran:/home/mingo> perf stat --repeat 10 -e / sched:sched_stat_wait:r -e task-clock ./hackbench 20 Time: 0.251 Time: 0.214 Time: 0.254 Time: 0.278 Time: 0.245 Time: 0.308 Time: 0.242 Time: 0.222 Time: 0.268 Time: 0.244
Performance counter stats for './hackbench 20' (10 runs):59826 sched:sched_stat_wait # 0.026 M/sec ( +- 5.540% ) 2280.099643 task-clock-msecs # 7.525 CPUs ( +- 1.620% )0.303013390 seconds time elapsed ( +- 3.189% )
讀取scheduling的events counter
# perf list 2>&1 | grep sched: sched:sched_kthread_stop [Tracepoint event] sched:sched_kthread_stop_ret [Tracepoint event] sched:sched_wait_task [Tracepoint event] sched:sched_wakeup [Tracepoint event] sched:sched_wakeup_new [Tracepoint event] sched:sched_switch [Tracepoint event] sched:sched_migrate_task [Tracepoint event] sched:sched_process_free [Tracepoint event] sched:sched_process_exit [Tracepoint event] sched:sched_process_wait [Tracepoint event] sched:sched_process_fork [Tracepoint event] sched:sched_signal_send [Tracepoint event] sched:sched_stat_wait [Tracepoint event] sched:sched_stat_sleep [Tracepoint event] sched:sched_stat_iowait [Tracepoint event]
對於latency analysis而言,stat_wait/sleep/iowait是值得注意的event。若是你想看全部delays和它們的mix/max/avg,你能夠:
perf record -e sched:sched_stat_wait:r -f -R -c 1 ./hackbench 20 perf trace
perf stats for doing nothinghttp://blog.csdn.net/bluebeach/article/details/5912062
Perf stats for "doing nothing"
I've recently discovered the
perf
Linux tool. I heard that oprofile was deprecated and that there is a new tool, and I noted down to try it sometime.Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.
The problem with perf stats is that I hate bloat, or even perceived bloat. Even when it doesn't affect me in any way, the concept of wasted cycles makes me really sad.
You probably can guess where this is going… I said, well, let's see what perf says about a simple "null" program. Surely doing nothing should be just a small number of instructions, right?
Note: I think that perf also records kernel-side code, because the lowest I could get was about ~50K instructions for starting a null program in assembler that doesn't use libc and just executes the
syscall
asm instruction. However, these ~50K instructions are noise the moment you start to use more high-level languages. Yes, this is expected, but the I was still shocked. And there's lots of delta between languages I'd expected to behave somewhat identical.Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.
Test setup
Each program was the equivalent of 'exit 0' in the appropriate form for the language. During the measurements, the machine was as much as possible idle (single-user mode, measurements run at real-time priority, etc.). For compiled languages,
-O2
was used. For scripts, a simple#!/path/to/interpreter
(without options, except in the case of Python, see below) was used. Each program/script was run 500 times (perf's-r 500
) and I've checked that the variations were small (±0.80% on the metrics I used).You can find all the programs I've used at http://git.k1024.org/perf-null.git/, the current tests are for the tag version perf-null-0.1.
The raw data for the below tables/graphs is at log-4.
Results
Compiled languages
Language Cycles Instructions asm 63K 51K c-dietlibc 74K 57K c-libc-static 177K 107K c-libc-shared 506K 300K c++-static 178K 107K c++-dynamic 1,750K 1,675K haskell-single 2,229K 1,338K haskell-threaded 2,629K 1,522K ocaml-bytecode 3,271K 2,741K ocaml-native 1,042K 666K Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn't manage to compile a program dynamically-linked against dietlibc.
C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude more instructions. I would guess that the initialisation of the standard C++ library is complex?
Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.
OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ☺. Nice!
Shells
Language Cycles Instructions dash 766K 469K bash 1,680K 1,044K mksh 1,258K 942K mksh-static 504K 322K So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I'd guess that dash is implemented in C ☺.
Next, bash is indeed slower on startup than dash, and by slightly more than 2× (both instructions and cycles). So yes, switching
/bin/sh
from bash to dash makes sense.I wasn't aware of
mksh
, so thanks for the comments. It is, in the static variant, more efficient that dash, by about 1.5×. However, the dynamically linked version doesn't look too great (dash is also dynamically linked; I would guess a statically-linked dash "beats" mksh-static).Text processing
I've added perl here (even though it's a 'full' language) just for comparison; it's also in the next section.
Language Cycles Instructions mawk 849K 514K gawk 1,363K 980K perl 2,946K 2,213K A normal spread. I knew the reason why mawk is
Priority: required
is that it's faster than gawk, but I wouldn't have guessed it's almost twice as fast.Interpreted languages
Here is where the fun starts…
Language Cycles Instructions lua 5.1 1,947K 1,485K lua 5.2 1,724K 1,335K lua jit 1,209K 803K perl 2,946K 2,213K tcl 8.4 5,011K 4,552K tcl 8.5 6,888K 6,022K tcl 8.6 8,196K 7,236K ruby 1.8 7,013K 6,128K ruby 1.9.3 35,870K 35,022K python 2.6 -S 11,752K 10,247K python 2.7 -S 11,438K 10,198K python 3.2 -S 29,003K 27,409K pypy -S 21,106K 10,036K python 2.6 25,143K 21,989K python 2.7 47,325K 50,217K python 2.7 -O 47,341K 50,185K python 3.2 113,567K 124,133K python 3.2 -O 113,424K 124,133K pypy 90,779K 68,455K The numbers here are not quite what I expected. There's a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).
I wasn't familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.
Perl is the one that goes above C++'s instructions count, but not by much. From the point of view of the system, a Perl 'hello world' is only about 1.3×-1.6x slower than a C++ one. Not bad, not bad.
Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn't expect it to be slower to startup than Perl; good to know.
Last category is Python. Oh my. If you run
perf stat python -c 'pass'
you get some unbelievable numbers, like 50M instructions to do, well, nothing. Yes, it has a GC, yes, it does import modules at runtime, but still… On closer investigation, thesite
module and the imports it does do eat a lot of time. Running a simplerpython -S
brings it back to a more reasonable 10M instructions, which is in-line with the other interpreted languages.However, even with the -S taken into account, Python also slows down across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a 3× slowdown from 2.7 to 3.2. Trying the 「optimised」 version (
-O
) doesn't help at all. Trying pypy, which was based on Python 2.7, makes it around 2× slower to startup (both with and without-S
).So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.
Java
Java is in its own category; you guess why ☺, right?
GCJ was version 4.6, whereas by
java
below I meanOpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4)
.
Language Cycles Instructions null-gcj 97,156K 74,576K java -jamvm 85,535K 80,102K java -server 147,174K 136,803K java -zero 132,967K 124,977K java -cacao 229,799K 205,312K Using gcj to compile to 「native code」 (not sure whether that's native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.
However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.
The other thing about java is that it was the only one that couldn't be put nicely in a file that you just ‘exec’ (there is
binfmt_misc
indeed, but that doesn't allow different Java classes to use different Java VMs, so I don't count this), as opposed to every single other thing I tested here. Someone didn't grow on Unix?Comparative analysis
Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:
Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there's something to this (mostly joking, two data points don't make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?
Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…
Ruby, TCL and Python have consistent values across versions.
And that's about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:
Note that the two lines shouldn't really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.
There are two low-value outliers:
- dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
- gcj-compiled java has a very low branch miss percentage (only 2.82%!!!), even though is has a 「regular」 branch percentage (20.85%)
So it seems the gcj libraries are well optimised? I'm not familiar enough with this topic, but on the graph it does indeed stand out.
On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?
I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ☺. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).
Summary
So to recap, counting just instructions:
- going from dietlibc to glibc: 2× increase
- going from statically-linked libc to dynamically-linked libc: doubles it again
- going from C to C++: 5× increase
- C++ to Perl: 1.3×
- Perl to Ruby: 3×
- Ruby to Python (-S): 1.6x
- Python -S to regular Python: 5x
- Python to Java: 1×-2×, depending on version/runtime
- branch percentage (per total instructions) is quite consistent across all of the programs
Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.
On the other hand, I learned some interesting things while doing it, so it wasn't quite for nothing ☺.