Faster-RCNN論文中在RoI-Head網絡中,將128個RoI區域對應的feature map進行截取,然後利用RoI pooling層輸出7*7大小的feature map。在pytorch中能夠利用:html
- torch.nn.functional.adaptive_max_pool2d(input, output_size, return_indices=False)
-
torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)
這個函數很方便調用,可是這個實現有個缺點,就是慢。python
因此有許多其餘不一樣的實現方式,借鑑其餘人的實現方法,這裏借鑑github作一個更加豐富對比實驗。總共有4種方法:git
方法1. 利用cffi進行C擴展實現,而後利用Pytorch調用:須要單獨的 C 和 CUDA 源文件,還須要事先進行編譯,不但過程比較繁瑣,代碼結構也稍顯凌亂。對於一些簡單的 CUDA 擴展(代碼量不大,沒有複雜的庫依賴),顯得不夠友好。github
方法2.利用Cupy實如今線編譯,直接爲 pytorch 提供 CUDA 擴展(固然,也能夠是純 C 的擴展)。Cupy實現了在cuda上兼容numpy格式的多維數組。GPU加速的矩陣運算,而Numpy並無利用GPU。Cupy目前已脫離chainer成爲一個獨立的庫。數組
方法3.利用chainer實現,相較其餘深度學習框架來講,chainer知名度不夠高,可是是一款很是優秀的深度學習框架,純python實現,設計思想簡潔,語法簡單。chainer中的GPU加速也是經過Cupy實現的。此外,chainer還有其餘附加包,例如ChainerCV,其中便有對Faster-RCNN、SSD等網絡的實現。
網絡
圖源:Chainer官網slides框架
方法4.利用Pytorch實現,也就是文章伊始給出的兩個函數。ide
從方法1至方法4,實現過程愈來愈簡單,因此速度愈來愈慢。函數
如下是一個簡單的對比試驗結果:實驗中以輸入batch大小、圖像尺寸(嚴格講是特徵圖尺寸)大小、rois數目、是否反向傳播爲變量來進行對比,注意輸出尺寸和Faster原論文一致都是7*7,都利用cuda,且設置scale=1,即特徵圖和原圖同大小。學習
對比1: 只正向傳播
use_cuda: True, has_backward: True method1: 0.001353292465209961, batch_size: 8, size: 8, num_rois: 10 method2: 0.04485161781311035, batch_size: 8, size: 8, num_rois: 10 method3: 0.06167919635772705, batch_size: 8, size: 8, num_rois: 10 method4: 0.009436330795288085, batch_size: 8, size: 8, num_rois: 10 method1: 0.0003777980804443359, batch_size: 8, size: 8, num_rois: 100 method2: 0.001593632698059082, batch_size: 8, size: 8, num_rois: 100 method3: 0.00210268497467041, batch_size: 8, size: 8, num_rois: 100 method4: 0.061138014793396, batch_size: 8, size: 8, num_rois: 100 method1: 0.001754002571105957, batch_size: 64, size: 64, num_rois: 100 method2: 0.0047376775741577145, batch_size: 64, size: 64, num_rois: 100 method3: 0.006129913330078125, batch_size: 64, size: 64, num_rois: 100 method4: 0.06233139038085937, batch_size: 64, size: 64, num_rois: 100 method1: 0.0018497371673583984, batch_size: 64, size: 64, num_rois: 1000 method2: 0.010891580581665039, batch_size: 64, size: 64, num_rois: 1000 method3: 0.023005642890930177, batch_size: 64, size: 64, num_rois: 1000 method4: 0.5292188739776611, batch_size: 64, size: 64, num_rois: 1000 method1: 0.09110891819000244, batch_size: 256, size: 256, num_rois: 100 method2: 0.4102628231048584, batch_size: 256, size: 256, num_rois: 100 method3: 0.3902537250518799, batch_size: 256, size: 256, num_rois: 100 method4: 0.6544218873977661, batch_size: 256, size: 256, num_rois: 100 method1: 0.09256606578826904, batch_size: 256, size: 256, num_rois: 1000 method2: 0.641594967842102, batch_size: 256, size: 256, num_rois: 1000 method3: 1.3756087446212768, batch_size: 256, size: 256, num_rois: 1000 method4: 4.076273036003113, batch_size: 256, size: 256, num_rois: 1000
對比2:含反向傳播
use_cuda: True, has_backward: False method1: 0.000156359672546386, batch_size: 8, size: 8, num_rois: 10 method2: 0.009024391174316406, batch_size: 8, size: 8, num_rois: 10 method3: 0.009477467536926269, batch_size: 8, size: 8, num_rois: 10 method4: 0.002876405715942383, batch_size: 8, size: 8, num_rois: 10 method1: 0.00017533779144287, batch_size: 8, size: 8, num_rois: 100 method2: 0.00040388107299804, batch_size: 8, size: 8, num_rois: 100 method3: 0.00085462093353271, batch_size: 8, size: 8, num_rois: 100 method4: 0.02638674259185791, batch_size: 8, size: 8, num_rois: 100 method1: 0.00018683433532714, batch_size: 64, size: 64, num_rois: 100 method2: 0.00039398193359375, batch_size: 64, size: 64, num_rois: 100 method3: 0.00234550476074218, batch_size: 64, size: 64, num_rois: 100 method4: 0.02483976364135742, batch_size: 64, size: 64, num_rois: 100 method1: 0.0013917160034179, batch_size: 64, size: 64, num_rois: 1000 method2: 0.0010843658447265, batch_size: 64, size: 64, num_rois: 1000 method3: 0.0025740385055541, batch_size: 64, size: 64, num_rois: 1000 method4: 0.2577446269989014, batch_size: 64, size: 64, num_rois: 1000 method1: 0.0003826856613153, batch_size: 256, size: 256, num_rois: 100 method2: 0.0004550600051874, batch_size: 256, size: 256, num_rois: 100 method3: 0.2729876136779785, batch_size: 256, size: 256, num_rois: 100 method4: 0.0269237756729125, batch_size: 256, size: 256, num_rois: 100 method1: 0.0008277797698974, batch_size: 256, size: 256, num_rois: 1000 method2: 0.0021707582473754, batch_size: 256, size: 256, num_rois: 1000 method3: 0.2724076747894287, batch_size: 256, size: 256, num_rois: 1000 method4: 0.2687232542037964, batch_size: 256, size: 256, num_rois: 1000
能夠觀察到最後一種方法老是最慢的,由於對於全部的num_roi依次循環迭代,效率極低。
對比3:固定1個batch(一張圖),size假設爲50*50(特徵圖大小,因此原圖爲800*800),特徵圖通道設爲512,num_rois設爲300,這是近似於 batch爲1的Faster-RCNN的測試過程,看一下用時狀況:此時輸入特徵圖爲(1,512,50,50),rois爲(300,5)。rois的第一列爲batch index,由於是1個batch,因此此項全爲0,沒有實質做用。
use_cuda: True, has_backward: True method0: 0.0344547653198242, batch_size: 1, size: 50, num_rois: 300 method1: 0.1322056961059570, batch_size: 1, size: 50, num_rois: 300 method2: 0.1307379817962646, batch_size: 1, size: 50, num_rois: 300 method3: 0.2016681671142578, batch_size: 1, size: 50, num_rois: 300
能夠看到,方法2和方法3速度幾乎一致,因此可使用更簡潔的chainer方法,然而當使用多batch訓練Faster時,最好利用方法1,速度極快。
代碼已上傳:github