float32_t Sum_float(float32_t *data, const int count) { float32x4_t res = vdupq_n_f32(0.0f); for(int i = 0; i < (count & (~15)); i += 16) { #if 01 float32x4x4_t v0 = vld1q_f32_x4(data + i); float32x4_t v00 = v0.val[0]; float32x4_t v01 = v0.val[1]; float32x4_t v02 = v0.val[2]; float32x4_t v03 = v0.val[3]; #else float32x4_t v00 = vld1q_f32(data + i); float32x4_t v01 = vld1q_f32(data + i + 4); float32x4_t v02 = vld1q_f32(data + i + 8); float32x4_t v03 = vld1q_f32(data + i + 12); #endif v00 = vaddq_f32(v00, v02); v01 = vaddq_f32(v01, v03); res = vaddq_f32(res, vaddq_f32(v00, v01)); } float32x2_t res1 = vadd_f32(vget_low_f32(res), vget_high_f32(res)); float32_t v0[2]; vst1_f32(v0, res1); v0[0] += v0[1]; for(int i = count & (~15); i < count; ++i){ v0[0] += data[i]; } return v0[0]; }
首先,查閱了https://static.docs.arm.com/ihi0073/c/IHI0073C_arm_neon_intrinsics_ref.pdf,對於vld1q_f32_x4
這個指令,v7/A32/A64
都是支持的。android
不一樣編譯器版本結果:首先,對於全部的版本,若是使用#else
塊的代碼,都是能夠編譯成功的,對於使用#if 01
塊的代碼,結果以下:c++
armeabi-v7a with o1 | armeabi=v7a with o0 | arm64-v8a | |
---|---|---|---|
r20c | clang++: error: clang frontend command failed due to signal (use -v to see invocation) | ok | ok |
r19c | ok | ok | ok |
r15c | error: use of undeclared identifier 'vld1q_f32_x4' | error: use of undeclared identifier 'vld1q_f32_x4' | ok |
不單單vld1q_f32_x4
,對於vld1_u8_x2;vst1q_f32_x4
等相似指令都存在這樣的問題。git
測試代碼:github
int main() { const size_t len = 1024*1024 * 16; float32_t *data = new float32_t[len]; for(size_t i = 0; i < len; ++i) { data[i] = std::rand() / 100.0; } clock_t t0 = std::clock(); float32_t sum = Sum_float(data, len); printf("sum=%f , time cost=%f \n", sum, 1000.0 * (double)(std::clock() - t0) / CLOCKS_PER_SEC); return 0; }
測試了使用三種NDK版本編譯arm64-v8a
測試,同時使用r19c編譯了armeabi-v7a
,分別使用#if
和#else
分之,發現耗時都是在3.55ms左右,無明顯差異。shell
相似問題:https://github.com/mattgodbolt/compiler-explorer/issues/1906frontend
雖然使用r19c的版本編譯armeabi-v7a
成功,或者使用不優化的r20c也同樣,可是執行時發生了crash。緣由是執行vldN(q)_type_xN
指令時,地址不對齊致使的crash。ionic
而對於arm64-v8a
版本,把全部傳給vldN(q)_type_xN
的地址打印出來,一樣發現也有0x7350800001
這樣的地址,並且地址末位爲0到E的都有,可是卻沒有報錯。也即,對於該指令只有armeabi-v7a
有地址對齊要求,而arm64-v8a
卻沒有?ide
同時,常規的vldN(q)_type
指令則沒有地址對齊的要求,因此最好不要使用vldN(q)_type_xN
。性能
在代碼中由於地址對齊而致使的crash日誌:測試
libc : Fatal signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001 in tid 27659 (ClarityOpt), pid 27659 (ClarityOpt) crash_dump32: obtaining output fd from tombstoned, type: kDebuggerdTombstone crash_dump32: performing dump of process 27659 (target tid = 27659) DEBUG : Process name is /data/local/tmp/ClarityOpt, not key_process DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** DEBUG : Build fingerprint: 'OPPO/PCCM00/OP4A7A:10/QKQ1.191222.002/1584699103:user/release-keys' DEBUG : Revision: '0' DEBUG : ABI: 'arm' DEBUG : Timestamp: 2020-05-09 15:15:16+0800 DEBUG : pid: 27659, tid: 27659, name: ClarityOpt >>> /data/local/tmp/ClarityOpt <<< DEBUG : uid: 0 crash_dump32: type=1400 audit(0.0:27044): avc: denied { read } for name="ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1 crash_dump32: type=1400 audit(0.0:27045): avc: denied { open } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1 crash_dump32: type=1400 audit(0.0:27046): avc: denied { getattr } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1 crash_dump32: type=1400 audit(0.0:27047): avc: denied { map } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1 DEBUG : signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001 DEBUG : r0 00000043 r1 00000000 r2 a9a5ac6f r3 00000003 DEBUG : r4 f0900001 r5 ffcb0a00 r6 ffcb0a40 r7 ffcb0b60 DEBUG : r8 f0900007 r9 00000001 r10 f0900000 r11 f0900000 DEBUG : ip ffcb0500 sp ffcb09f0 lr 00000004 pc 021d265e DEBUG : DEBUG : backtrace: DEBUG : #00 pc 0000365e /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70) DEBUG : #01 pc 00004271 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70) DEBUG : #02 pc 00004c9f /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70) DEBUG : #03 pc 00004dd3 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70) DEBUG : #04 pc 000513bb /apex/com.android.runtime/lib/bionic/libc.so (__libc_init+66) (BuildId: 8e41d0dce7911ae25a51deb63aa9720c) DEBUG : #05 pc 00002a98 /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)