Pytorch中RoI pooling layer的幾種實現

Faster-RCNN論文中在RoI-Head網絡中，將128個RoI區域對應的feature map進行截取，然後利用RoI pooling層輸出7*7大小的feature map。在pytorch中能夠利用：html

torch.nn.functional.adaptive_max_pool2d(input, output_size, return_indices=False)
torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)

這個函數很方便調用，可是這個實現有個缺點，就是慢。python

因此有許多其餘不一樣的實現方式，借鑑其餘人的實現方法，這裏借鑑github作一個更加豐富對比實驗。總共有4種方法：git

方法1. 利用cffi進行C擴展實現,而後利用Pytorch調用：須要單獨的 C 和 CUDA 源文件，還須要事先進行編譯，不但過程比較繁瑣，代碼結構也稍顯凌亂。對於一些簡單的 CUDA 擴展（代碼量不大，沒有複雜的庫依賴），顯得不夠友好。github

方法2.利用Cupy實如今線編譯，直接爲 pytorch 提供 CUDA 擴展（固然，也能夠是純 C 的擴展）。Cupy實現了在cuda上兼容numpy格式的多維數組。GPU加速的矩陣運算，而Numpy並無利用GPU。Cupy目前已脫離chainer成爲一個獨立的庫。數組

方法3.利用chainer實現，相較其餘深度學習框架來講，chainer知名度不夠高，可是是一款很是優秀的深度學習框架，純python實現，設計思想簡潔，語法簡單。chainer中的GPU加速也是經過Cupy實現的。此外，chainer還有其餘附加包，例如ChainerCV，其中便有對Faster-RCNN、SSD等網絡的實現。
網絡

圖源：Chainer官網slides框架

方法4.利用Pytorch實現，也就是文章伊始給出的兩個函數。ide

從方法1至方法4，實現過程愈來愈簡單，因此速度愈來愈慢。函數

如下是一個簡單的對比試驗結果：實驗中以輸入batch大小、圖像尺寸（嚴格講是特徵圖尺寸）大小、rois數目、是否反向傳播爲變量來進行對比，注意輸出尺寸和Faster原論文一致都是7*7，都利用cuda，且設置scale=1，即特徵圖和原圖同大小。學習

對比1：只正向傳播

use_cuda: True, has_backward: True method1: 0.001353292465209961, batch_size: 8, size: 8, num_rois: 10 method2: 0.04485161781311035, batch_size: 8, size: 8, num_rois: 10 method3: 0.06167919635772705, batch_size: 8, size: 8, num_rois: 10 method4: 0.009436330795288085, batch_size: 8, size: 8, num_rois: 10 method1: 0.0003777980804443359, batch_size: 8, size: 8, num_rois: 100 method2: 0.001593632698059082, batch_size: 8, size: 8, num_rois: 100 method3: 0.00210268497467041, batch_size: 8, size: 8, num_rois: 100 method4: 0.061138014793396, batch_size: 8, size: 8, num_rois: 100 method1: 0.001754002571105957, batch_size: 64, size: 64, num_rois: 100 method2: 0.0047376775741577145, batch_size: 64, size: 64, num_rois: 100 method3: 0.006129913330078125, batch_size: 64, size: 64, num_rois: 100 method4: 0.06233139038085937, batch_size: 64, size: 64, num_rois: 100 method1: 0.0018497371673583984, batch_size: 64, size: 64, num_rois: 1000 method2: 0.010891580581665039, batch_size: 64, size: 64, num_rois: 1000 method3: 0.023005642890930177, batch_size: 64, size: 64, num_rois: 1000 method4: 0.5292188739776611, batch_size: 64, size: 64, num_rois: 1000 method1: 0.09110891819000244, batch_size: 256, size: 256, num_rois: 100 method2: 0.4102628231048584, batch_size: 256, size: 256, num_rois: 100 method3: 0.3902537250518799, batch_size: 256, size: 256, num_rois: 100 method4: 0.6544218873977661, batch_size: 256, size: 256, num_rois: 100 method1: 0.09256606578826904, batch_size: 256, size: 256, num_rois: 1000 method2: 0.641594967842102, batch_size: 256, size: 256, num_rois: 1000 method3: 1.3756087446212768, batch_size: 256, size: 256, num_rois: 1000 method4: 4.076273036003113, batch_size: 256, size: 256, num_rois: 1000

對比2：含反向傳播

use_cuda: True, has_backward: False method1: 0.000156359672546386, batch_size: 8, size: 8, num_rois: 10 method2: 0.009024391174316406, batch_size: 8, size: 8, num_rois: 10 method3: 0.009477467536926269, batch_size: 8, size: 8, num_rois: 10 method4: 0.002876405715942383, batch_size: 8, size: 8, num_rois: 10 method1: 0.00017533779144287, batch_size: 8, size: 8, num_rois: 100 method2: 0.00040388107299804, batch_size: 8, size: 8, num_rois: 100 method3: 0.00085462093353271, batch_size: 8, size: 8, num_rois: 100 method4: 0.02638674259185791, batch_size: 8, size: 8, num_rois: 100 method1: 0.00018683433532714, batch_size: 64, size: 64, num_rois: 100 method2: 0.00039398193359375, batch_size: 64, size: 64, num_rois: 100 method3: 0.00234550476074218, batch_size: 64, size: 64, num_rois: 100 method4: 0.02483976364135742, batch_size: 64, size: 64, num_rois: 100 method1: 0.0013917160034179, batch_size: 64, size: 64, num_rois: 1000 method2: 0.0010843658447265, batch_size: 64, size: 64, num_rois: 1000 method3: 0.0025740385055541, batch_size: 64, size: 64, num_rois: 1000 method4: 0.2577446269989014, batch_size: 64, size: 64, num_rois: 1000 method1: 0.0003826856613153, batch_size: 256, size: 256, num_rois: 100 method2: 0.0004550600051874, batch_size: 256, size: 256, num_rois: 100 method3: 0.2729876136779785, batch_size: 256, size: 256, num_rois: 100 method4: 0.0269237756729125, batch_size: 256, size: 256, num_rois: 100 method1: 0.0008277797698974, batch_size: 256, size: 256, num_rois: 1000 method2: 0.0021707582473754, batch_size: 256, size: 256, num_rois: 1000 method3: 0.2724076747894287, batch_size: 256, size: 256, num_rois: 1000 method4: 0.2687232542037964, batch_size: 256, size: 256, num_rois: 1000

能夠觀察到最後一種方法老是最慢的，由於對於全部的num_roi依次循環迭代，效率極低。

對比3：固定1個batch（一張圖），size假設爲50*50（特徵圖大小，因此原圖爲800*800），特徵圖通道設爲512，num_rois設爲300，這是近似於 batch爲1的Faster-RCNN的測試過程，看一下用時狀況：此時輸入特徵圖爲（1,512,50,50），rois爲（300,5）。rois的第一列爲batch index，由於是1個batch，因此此項全爲0，沒有實質做用。

use_cuda: True, has_backward: True method0: 0.0344547653198242, batch_size: 1, size: 50, num_rois: 300 method1: 0.1322056961059570, batch_size: 1, size: 50, num_rois: 300 method2: 0.1307379817962646, batch_size: 1, size: 50, num_rois: 300 method3: 0.2016681671142578, batch_size: 1, size: 50, num_rois: 300

能夠看到，方法2和方法3速度幾乎一致，因此可使用更簡潔的chainer方法，然而當使用多batch訓練Faster時，最好利用方法1，速度極快。

代碼已上傳：github

Pytorch中RoI pooling layer的幾種實現

torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)