NEON編程2 剩餘數據處理

時間 2019-12-20

標籤 neon 編程剩餘數據處理欄目大數據简体版

原文原文鏈接

原文地址數組

在這裏，咱們將處理一種常見的問題：輸入的數據不是向量長度的倍數，須要處理數組開頭或者結尾的剩餘數據時。這種狀況下，NEON能夠如何處理。app

剩餘數據

使用NEON一般都是操做長度爲4到16位的數據向量。常常地，你將會發現數組並非那些長度的倍數，你必須單獨處理這些剩餘的數據。ide

例如，你想要在每一個迭代中用NEON加載、處理及存儲8個數據，可是你的數組是21個數據長度。前兩次的迭代都可以正常進行，可是第三個迭代中，只有5個數據須要處理時，應該怎麼辦。oop

問題解決

有三種方法能夠處理這些剩餘數據。根據不一樣的需求、性能及代碼大小，每種方法都不盡相同。這些方法以下，速度越快的越靠前。性能

用更大的數組

若是你可以改變你將要處理的數組的大小，用填充數據的方式增長數組大小至下一個向量大小的倍數。這可讓你在不影響鄰近存儲的狀況下讀取和寫入數據。優化

在下面的例子中，增長數組大小至24個數據，使得第三次迭代能夠完成。spa

須要注意的是

分配更大的數組會消耗更多的內存。若是有不少短數組，這樣的分配會帶來更多的消耗。
在末尾進行填充的數據須要被初始化，一般將其初始化爲不影響計算結果的數據。例如，若是你在對數組進行求和，新的數據必須初始化爲0。若是你在查找最小值，新的元素必須設置爲最大值。
在一些狀況下，初始化不影響填充數據可能不太好實現，如當須要查找數組數據的範圍時。code

代碼

@ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array

 @ We can assume that the array length is greater than zero, is an integer 
 @ number of vectors, and is greater than or equal to the length of data 
 @ in the array.

     add  r2, r2, #7      @ add (vector length-1) to the data length
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed

 loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array pointed to
                             @  by r0 into d0, and update r0 to point to the 
                             @  next vector
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update r1 to point to next vector
     bne  loop            @ if r2 is not equal to 0, loop

重疊overlapping

若是操做容許的話，剩餘數據能夠經過重疊操做進行處理。這將使得一些數據被處理兩次。orm

在上述的例子中，第一次迭代將處理0到7的數據，第二次迭代處理5到12，第三次迭代處理13到20.須要注意5到7的數據被處理了兩次。blog

須要注意的是

重疊操做只有在數據不受訪問次數的影響時才能被使用，該操做必須是冪等性的。如，當你須要在數組中查找最大值的時候，能夠使用這種策略。當你須要對一個數組進行求和時，重疊的數據會被重複計算。
數據的個數必須至少可以填充一個完整的向量。

代碼

 @ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array

 @ We can assume that the operation is idempotent, and the array is greater
 @ than or equal to one vector long.

     ands    r3, r2, #7      @ calculate number of elements left over after
                             @  processing complete vectors using
                             @  data length & (vector length - 1)
     beq  loopsetup    @ if the result of the ands is zero, the length
                             @  of the data is an integer number of vectors,
                             @  so there is no overlap, and processing can begin 
                             @  at the loop

                             @ handle the first vector separately
     vld1.8  {d0}, [r0], r3  @ load the first eight elements from the array,
                             @  and update the pointer by the number of elements
                             @  left over
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1], r3  @ write eight elements to the output array, and
                             @  update the pointer

                             @ now, set up the vector processing loop
 loopsetup:
     lsr  r2, r2, #3      @ divide the length of the array by the length
                             @  of a vector, 8, to find the number of
                             @  vectors of data to be processed

                             @ the loop can now be executed as normal. the
                             @  first few elements of the first vector will
                             @  overlap with some of those processed above
 loop:
     subs    r2, r2, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array, and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  loop            @ if r2 is not equal to 0, loop

單個數據處理

NEON提供了可以在向量中處理單個數據的加載及存儲指令。經過這些，你能夠加載一個包含一個數據的向量，進行操做，而且寫入內存中。

在上述的例子中，前兩次的迭代都正常進行，處理0到7，8到15的數據。第三次迭代須要處理5個數據，能夠在一個單獨的循環中進行處理，每次循環處理一個數據。

須要注意的是

該方法比前面的方法都慢，由於每一個數據必須被單獨的加載、處理和存儲。
處理剩餘數據須要兩次循環，一次以向量爲單位，第二次以單個數據爲單位。這將增大代碼大小。
NEON單數據加載只改變目標數據的值而不影響其餘數據。若是你在向量化的計算中涉及到操做向量的指令，如VPADD，寄存器在加載第一個單數據時必須被初始化。

代碼

@ r0 = input array pointer
 @ r1 = output array pointer
 @ r2 = length of data in array

     lsrs    r3, r2, #3      @ calculate the number of complete vectors to be
                             @  processed and set flags
     beq  singlesetup  @ if there are zero complete vectors, branch to
                             @  the single element handling code

                             @ process vector loop
 vectors:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0}, [r0]!  @ load eight elements from the array and update
                             @  the pointer
     ...
     ...                  @ process the input in d0
     ...

     vst1.8  {d0}, [r1]!  @ write eight elements to the output array, and
                             @  update the pointer
     bne  vectors      @ if r3 is not equal to zero, loop

 singlesetup:
     ands    r3, r2, #7      @ calculate the number of single elements to process
     beq  exit            @ if the number of single elements is zero, branch
                             @  to exit

                             @ process single element loop
 singles:
     subs    r3, r3, #1      @ decrement the loop counter, and set flags
     vld1.8  {d0[0]}, [r0]!  @ load single element into d0, and update the
                             @  pointer
     ...
     ...                  @ process the input in d0[0]
     ...

     vst1.8  {d0[0]}, [r1]!  @ write the single element to the output array,
                             @  and update the pointer
     bne  singles      @ if r3 is not equal to zero, loop

 exit:

更深刻的思考

起始仍是結束

重疊和單數據處理技術可以應用在數組的起始和結束，上面的代碼可以根據須要修改。

對齊

加載和存儲地址必須對齊到cache，容許更有效的內存存取。

在Cortex-A8這要求至少16字的對齊。若是沒法對輸入和輸出數組進行對齊，你就必須處理數組最開始的部分（須要對齊的部分）及結尾的部分（未完成的向量）。

當對齊數據訪問時，記得在加載及存儲指令中使用:64 或者:128 或者:256地址限定符來優化性能。你能夠對比須要處理加載和存儲的時鐘週期的個數，使用在 Technical Reference Manual 中的數據。
Here’s the relevant page in the Cortex-A8 TRM.