爲pony程序添加IACA標記(二)

在上一篇文章介紹了一種加IACA標記的方法,但使用仍是很麻煩,因此我嘗試修改pony編譯器,直接增長了IACA支持,目前代碼在iaca分支。git

使用方法

由於還沒發PR到上游,因此要本身克隆編譯。github

git clone https://github.com/oraoto/ponyc.git
cd ponyc
git checkout iaca

而後安裝官方的編譯步驟編譯就行了,一般就是一句makeweb

在須要添加IACA標記的代碼加上IACA.start()IACA.stop()就能夠了。以pony-websocket裏的代碼爲例:websocket

while (i + 4) < size do
  IACA.start()
  p(i)?     = p(i)?     xor m1
  p(i + 1)? = p(i + 1)? xor m2
  p(i + 2)? = p(i + 2)? xor m3
  p(i + 3)? = p(i + 3)? xor m4
  i = i + 4
end
IACA.stop()

編譯後就能夠用iaca進行分析了:app

$ iaca ./echo-server.exe

F:\build > iaca .\echo-server.exe
Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;17:30:24
Analyzed File -  .\echo-server.exe
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 6.74 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  2.5     0.0  |  2.5  |  4.0     4.0  |  4.0     4.0  |  4.0  |  2.5  |  2.5  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | cmp rax, rbx
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x95
|   4      |             | 0.5  | 1.0     1.0 | 1.0     1.0 | 1.0  |      | 0.5  |      | xor byte ptr [rdi+rbx*1], r8b
|   1      |             | 0.5  |             |             |      | 0.5  |      |      | lea rsi, ptr [rbx+0x1]
|   1*     |             |      |             |             |      |      |      |      | cmp rax, rsi
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x8b
|   4      | 0.5         |      | 1.0     1.0 | 1.0     1.0 | 1.0  |      | 0.5  |      | xor byte ptr [rdi+rbx*1+0x1], r9b
|   1      | 0.5         |      |             |             |      | 0.5  |      |      | add rsi, 0x1
|   1*     |             |      |             |             |      |      |      |      | cmp rax, rsi
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x80
|   4      |             | 0.5  | 1.0     1.0 | 1.0     1.0 | 1.0  |      | 0.5  |      | xor byte ptr [rdi+rbx*1+0x2], r10b
|   1      | 0.5         |      |             |             |      | 0.5  |      |      | add rsi, 0x1
|   1*     |             |      |             |             |      |      |      |      | cmp rax, rsi
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x75
|   4      |             | 0.5  | 1.0     1.0 | 1.0     1.0 | 1.0  |      | 0.5  |      | xor byte ptr [rdi+rbx*1+0x3], r11b
|   1      |             | 0.5  |             |             |      | 0.5  |      |      | lea rdx, ptr [rsi+0x5]
|   1      | 0.5         |      |             |             |      |      | 0.5  |      | add rsi, 0x1
|   1*     |             |      |             |             |      |      |      |      | cmp rdx, rax
|   0*F    |             |      |             |             |      |      |      |      | jb 0xffffffffffffffab
|   1      | 0.5         |      |             |             |      | 0.5  |      |      | add rbx, 0x4
Total Num Of Uops: 27

實現方式

pony的builtin包裏,有些代碼是這樣的:socket

fun _apply(i: USize): this->A =>
  compile_intrinsic

fun ref _update(i: USize, value: A!): A^ =>
  compile_intrinsic

fun _offset(n: USize): this->Pointer[A] =>
  compile_intrinsic

函數體只有一句compile_intrinsic,這些函數編譯器內置的,能夠直接生成代碼。因此我直接在builtin包里加了ide

primitive IACA
  fun start(): None => compile_intrinsic
  fun stop(): None => compile_intrinsic

這時編譯是不經過的,由於編譯器還不知道怎樣編譯這兩個函數,因此要在編譯器裏「註冊」,這裏只要參考Platform包的處理就能夠了。函數

實際生成代碼的方法:oop

static void iaca_start(compile_t* c, reach_type_t* t, token_id cap)
{
  FIND_METHOD("start", cap);

  compile_type_t* t_result = (compile_type_t*)m->result->c_type;
  start_function(c, t, m, t_result->use_type, &c_t->use_type, 1);

  LLVMAddFunctionAttr(c_m->func, LLVMAlwaysInlineAttribute);

  LLVMTypeRef void_fn = LLVMFunctionType(c->void_type, NULL, 0, false);
  LLVMValueRef asmstr = LLVMConstInlineAsm(void_fn,
    ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", "", true, false);
  LLVMValueRef call = LLVMBuildCall(c->builder, asmstr, NULL, 0, "");

  LLVMBuildRet(c->builder, t_result->instance);

  codegen_finishfun(c);
}

就是是生成一句inline asm的LLVM IR。不經優化生成的IR是這樣的:fetch

while_body:                                       ; preds = %invoke13, %while_init
  %28 = call fastcc %IACA* @IACA_val_create_o(%IACA* @IACA_Inst), !dbg !5828, !pony.newcall !3
  %29 = call fastcc %None* @IACA_val_start_o(%IACA* %28), !dbg !5830

; Function Attrs: alwaysinline
define fastcc %None* @IACA_val_start_o(%IACA* noalias readonly dereferenceable(8)) unnamed_addr #7 !pony.abi !3 {
entry:
  call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""()
  ret %None* @None_Inst
}

沒錯,生成的是個函數調用,因此咱們依賴於優化把這個函數內聯到調用點,優化的結果是:

; <label>:38:                                     ; preds = %35, %67
  %39 = phi i64 [ %71, %67 ], [ 0, %35 ]
  tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90", ""() #2
  %40 = icmp ugt i64 %4, %39
  br i1 %40, label %43, label %41

這就是咱們要的。

目前的不足是,由於仍是生成了函數的代碼,iaca有時會分析錯位置,會出現下面的結果:

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [rip+0x71a8]
|   3^#    |             |      | 1.0     1.0 |             |      |      | 0.1  |      | ret
|   1      | 0.4         |      |             |             |      |      | 0.6  |      | mov ebx, 0x6f
|   1      | 0.6         |      |             |             |      |      | 0.4  |      | addr32 nop
|   1      |             | 1.0  |             |             |      |      |      |      | lea rax, ptr [rip+0x71a8]
|   3^     |             |      |             | 1.0     1.0 |      |      |      |      | ret
Total Num Of Uops: 10

編譯結果竟然不是肯定的?遇到這種狀況,如今只能再編譯,直到出現正確的結果。

相關文章
相關標籤/搜索