在上一篇文章介紹了一種加IACA標記的方法,但使用仍是很麻煩,因此我嘗試修改pony編譯器,直接增長了IACA支持,目前代碼在iaca分支。git
由於還沒發PR到上游,因此要本身克隆編譯。github
git clone https://github.com/oraoto/ponyc.git cd ponyc git checkout iaca
而後安裝官方的編譯步驟編譯就行了,一般就是一句make
。web
在須要添加IACA標記的代碼加上IACA.start()
和IACA.stop()
就能夠了。以pony-websocket
裏的代碼爲例:websocket
while (i + 4) < size do IACA.start() p(i)? = p(i)? xor m1 p(i + 1)? = p(i + 1)? xor m2 p(i + 2)? = p(i + 2)? xor m3 p(i + 3)? = p(i + 3)? xor m4 i = i + 4 end IACA.stop()
編譯後就能夠用iaca進行分析了:app
$ iaca ./echo-server.exe F:\build > iaca .\echo-server.exe Intel(R) Architecture Code Analyzer Version - v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24 Analyzed File - .\echo-server.exe Binary Format - 64Bit Architecture - SKL Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 6.74 Cycles Throughput Bottleneck: Dependency chains Loop Count: 22 Port Binding In Cycles Per Iteration: -------------------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------------------------- | Cycles | 2.5 0.0 | 2.5 | 4.0 4.0 | 4.0 4.0 | 4.0 | 2.5 | 2.5 | 0.0 | -------------------------------------------------------------------------------------------------- DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3) F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion occurred # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected X - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1* | | | | | | | | | cmp rax, rbx | 0*F | | | | | | | | | jbe 0x95 | 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1], r8b | 1 | | 0.5 | | | | 0.5 | | | lea rsi, ptr [rbx+0x1] | 1* | | | | | | | | | cmp rax, rsi | 0*F | | | | | | | | | jbe 0x8b | 4 | 0.5 | | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x1], r9b | 1 | 0.5 | | | | | 0.5 | | | add rsi, 0x1 | 1* | | | | | | | | | cmp rax, rsi | 0*F | | | | | | | | | jbe 0x80 | 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x2], r10b | 1 | 0.5 | | | | | 0.5 | | | add rsi, 0x1 | 1* | | | | | | | | | cmp rax, rsi | 0*F | | | | | | | | | jbe 0x75 | 4 | | 0.5 | 1.0 1.0 | 1.0 1.0 | 1.0 | | 0.5 | | xor byte ptr [rdi+rbx*1+0x3], r11b | 1 | | 0.5 | | | | 0.5 | | | lea rdx, ptr [rsi+0x5] | 1 | 0.5 | | | | | | 0.5 | | add rsi, 0x1 | 1* | | | | | | | | | cmp rdx, rax | 0*F | | | | | | | | | jb 0xffffffffffffffab | 1 | 0.5 | | | | | 0.5 | | | add rbx, 0x4 Total Num Of Uops: 27
pony的builtin包裏,有些代碼是這樣的:socket
fun _apply(i: USize): this->A => compile_intrinsic fun ref _update(i: USize, value: A!): A^ => compile_intrinsic fun _offset(n: USize): this->Pointer[A] => compile_intrinsic
函數體只有一句compile_intrinsic
,這些函數編譯器內置的,能夠直接生成代碼。因此我直接在builtin包里加了ide
primitive IACA fun start(): None => compile_intrinsic fun stop(): None => compile_intrinsic
這時編譯是不經過的,由於編譯器還不知道怎樣編譯這兩個函數,因此要在編譯器裏「註冊」,這裏只要參考Platform包的處理就能夠了。函數
實際生成代碼的方法:oop
static void iaca_start(compile_t* c, reach_type_t* t, token_id cap) { FIND_METHOD("start", cap); compile_type_t* t_result = (compile_type_t*)m->result->c_type; start_function(c, t, m, t_result->use_type, &c_t->use_type, 1); LLVMAddFunctionAttr(c_m->func, LLVMAlwaysInlineAttribute); LLVMTypeRef void_fn = LLVMFunctionType(c->void_type, NULL, 0, false); LLVMValueRef asmstr = LLVMConstInlineAsm(void_fn, ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", "", true, false); LLVMValueRef call = LLVMBuildCall(c->builder, asmstr, NULL, 0, ""); LLVMBuildRet(c->builder, t_result->instance); codegen_finishfun(c); }
就是是生成一句inline asm的LLVM IR。不經優化生成的IR是這樣的:fetch
while_body: ; preds = %invoke13, %while_init %28 = call fastcc %IACA* @IACA_val_create_o(%IACA* @IACA_Inst), !dbg !5828, !pony.newcall !3 %29 = call fastcc %None* @IACA_val_start_o(%IACA* %28), !dbg !5830 ; Function Attrs: alwaysinline define fastcc %None* @IACA_val_start_o(%IACA* noalias readonly dereferenceable(8)) unnamed_addr #7 !pony.abi !3 { entry: call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""() ret %None* @None_Inst }
沒錯,生成的是個函數調用,因此咱們依賴於優化把這個函數內聯到調用點,優化的結果是:
; <label>:38: ; preds = %35, %67 %39 = phi i64 [ %71, %67 ], [ 0, %35 ] tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""() #2 %40 = icmp ugt i64 %4, %39 br i1 %40, label %43, label %41
這就是咱們要的。
目前的不足是,由於仍是生成了函數的代碼,iaca有時會分析錯位置,會出現下面的結果:
| Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------------------------------- | 1 | | | | | | 1.0 | | | lea rax, ptr [rip+0x71a8] | 3^# | | | 1.0 1.0 | | | | 0.1 | | ret | 1 | 0.4 | | | | | | 0.6 | | mov ebx, 0x6f | 1 | 0.6 | | | | | | 0.4 | | addr32 nop | 1 | | 1.0 | | | | | | | lea rax, ptr [rip+0x71a8] | 3^ | | | | 1.0 1.0 | | | | | ret Total Num Of Uops: 10
編譯結果竟然不是肯定的?遇到這種狀況,如今只能再編譯,直到出現正確的結果。