非極大值抑制（NMS）的幾種實現

時間 2019-11-30

標籤極大值抑制 nms 幾種實現简体版

原文原文鏈接

由於以前對比了RoI pooling的幾種實現，發現python、pytorch的自帶工具函數速度確實很慢，因此這裏再對Faster-RCNN中另外一個速度瓶頸NMS作一個簡單對比試驗。html

這裏作了四組對比試驗，來簡單驗證不一樣方法對NMS速度的影響。python

方法1：純python語言實現：簡介方便、速度慢linux

方法2：直接利用Cython模塊編譯ios

方法3：先將所有變量定義爲靜態類型，再利用Cython模塊編譯c++

方法4：在方法3的基礎上再加入cuda加速模塊，再利用Cython模塊編譯，即利用gpu加速git

一. 幾點說明

1. 簡單說明Cython：github

Cython是一個快速生成Python擴展模塊的工具，從語法層面上來說是Python語法和C語言語法的混血，當Python性能遇到瓶頸時，Cython直接將C的原生速度植入Python程序，這樣使Python程序無需使用C重寫，能快速整合原有的Python程序，這樣使得開發效率和執行效率都有很大的提升，而這些中間的部分，都是Cython幫咱們作了。web

2. 簡單介紹NMS：app

Faster-RCNN中有兩處使用NMS，第一處是訓練+預測的時候，利用ProposalCreator來生成proposal的時候，由於只須要一部分proposal，因此利用NMS進行篩選。第二處使用是預測的時候，當獲得300個分類與座標偏移結果的時候，須要對每一個類別逐一進行非極大值抑制。也許有人問爲何對於每一個類別不直接取置信度最高的那一個？由於一張圖中某個類別可能不止一個，例如一張圖中有多我的，直接取最高置信度的只能預測其中的一我的，而經過NMS理想狀況下可使得每一個人（每類中的每一個個體）都會有且僅有一個bbox框。
dom

二. 四種方法實現

1. 純python實現：nms_py.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon May  7 21:45:37 2018

@author: lps
"""
import numpy as np


boxes=np.array([[100,100,210,210,0.72],
        [250,250,420,420,0.8],
        [220,220,320,330,0.92],
        [100,100,210,210,0.72],
        [230,240,325,330,0.81],
        [220,230,315,340,0.9]]) 


def py_cpu_nms(dets, thresh):
    # dets:(m,5)  thresh:scaler
    
    x1 = dets[:,0]
    y1 = dets[:,1]
    x2 = dets[:,2]
    y2 = dets[:,3]
    
    areas = (y2-y1+1) * (x2-x1+1)
    scores = dets[:,4]
    keep = []
    
    index = scores.argsort()[::-1]
    
    while index.size >0:

        i = index[0]       # every time the first is the biggst, and add it directly
        keep.append(i)
        
        x11 = np.maximum(x1[i], x1[index[1:]])    # calculate the points of overlap 
        y11 = np.maximum(y1[i], y1[index[1:]])
        x22 = np.minimum(x2[i], x2[index[1:]])
        y22 = np.minimum(y2[i], y2[index[1:]])
        
        w = np.maximum(0, x22-x11+1)    # the weights of overlap
        h = np.maximum(0, y22-y11+1)    # the height of overlap
       
        overlaps = w*h
        
        ious = overlaps / (areas[i]+areas[index[1:]] - overlaps)
        
        idx = np.where(ious<=thresh)[0]
        
        index = index[idx+1]   # because index start from 1
        
    return keep
        

import matplotlib.pyplot as plt
def plot_bbox(dets, c='k'):
    
    x1 = dets[:,0]
    y1 = dets[:,1]
    x2 = dets[:,2]
    y2 = dets[:,3]
    
    
    plt.plot([x1,x2], [y1,y1], c)
    plt.plot([x1,x1], [y1,y2], c)
    plt.plot([x1,x2], [y2,y2], c)
    plt.plot([x2,x2], [y1,y2], c)
    plt.title("after nms")

plot_bbox(boxes,'k')   # before nms

keep = py_cpu_nms(boxes, thresh=0.7)
plot_bbox(boxes[keep], 'r')# after nms

View Code

結果大體這樣：

新建nms文件夾，將nms_py.py 和__init__.py（空）文件放在其內成爲包，能夠調用。而後在nms文件夾外新建測試運行時間腳本 test_num.py：

import numpy as np
import time
from nms.nums_py import py_cpu_nms  # for cpu
#from nms.gpu_nms import gpu_nms   # for gpu 


np.random.seed( 1 )   # keep fixed
num_rois = 6000
minxy = np.random.randint(50,145,size=(num_rois ,2))
maxxy = np.random.randint(150,200,size=(num_rois ,2))
score = 0.8*np.random.random_sample((num_rois ,1))+0.2

boxes_new = np.concatenate((minxy,maxxy,score), axis=1).astype(np.float32)

def nms_test_time(boxes_new):

    thresh = [0.7,0.8,0.9]
    T = 50
    for i in range(len(thresh)):
        since = time.time()
        for t in range(T):

            keep = py_cpu_nms(boxes_new, thresh=thresh[i])     # for cpu
#            keep = gpu_nms(boxes_new, thresh=thresh[i])       # for gpu
        print("thresh={:.1f}, time wastes:{:.4f}".format(thresh[i], (time.time()-since)/T))
    
    return keep


if __name__ =="__main__":
    nms_test_time(boxes_new)

View Code

測試數據爲6000個初始的rois，並設置nms閾值爲0.7~0.9。閾值越大越慢，由於知足小於閾值的roi越多，須要循環的次數也越多。對每一個閾值循環執行NMS 50次求平均：

直接運行獲得運行時間：

thresh=0.7, time wastes:0.0287
thresh=0.8, time wastes:0.1057
thresh=0.9, time wastes:0.4204

2.直接利用Cython模塊編譯：nms_py1.pyx

首先複製一份nms_py.py並重命名爲nms_py1.pyx，pyx即爲Cython文件。而後在nms文件夾下新建setup1.py:

from distutils.core import setup
from Cython.Build import cythonize

setup(
      name = 'nms_module',
      ext_modules = cythonize('nums_py1.pyx'),
      )

下面開始生成動態連接庫：在終端執行：

python3 setup1.py build

而後在當前目錄會生成nums_py1.c,即C源代碼，而後在nms/build/lib.linux-x86_64-3.5下會生成nums_py1.cpython-35m-x86_64-linux-gnu.so這一動態連接庫，將其複製一份至nms文件夾下，則如今能夠在測試腳本中進行測試了：只需將測試腳本中的 from nms.nums_py import py_cpu_nms 改成 from nms.nums1_py import py_cpu_nms 便可。由於pyx是不能夠直接執行的，只有build完成後才能夠。

運行測試腳本獲得如下結果：

thresh=0.7, time wastes:0.0272
thresh=0.8, time wastes:0.1038
thresh=0.9, time wastes:0.4184

發現與純python速度相比僅有微小提高，下面再利用第3種方法。

3. 更改變量定義後再利用Cython模塊編譯：nms_py2.pyx

import numpy as np
cimport numpy as np
#
#boxes=np.array([[100,100,210,210,0.72],
#        [250,250,420,420,0.8],
#        [220,220,320,330,0.92],
#        [100,100,210,210,0.72],
#        [230,240,325,330,0.81],
#        [220,230,315,340,0.9]]) 
#


cdef inline np.float32_t max(np.float32_t a, np.float32_t b):
    return a if a >= b else b

cdef inline np.float32_t min(np.float32_t a, np.float32_t b):
    return a if a <= b else b

def py_cpu_nms(np.ndarray[np.float32_t,ndim=2] dets, np.float thresh):
    # dets:(m,5)  thresh:scaler
    
    cdef np.ndarray[np.float32_t, ndim=1] x1 = dets[:,0]
    cdef np.ndarray[np.float32_t, ndim=1] y1 = dets[:,1]
    cdef np.ndarray[np.float32_t, ndim=1] x2 = dets[:,2]
    cdef np.ndarray[np.float32_t, ndim=1] y2 = dets[:,3]
    cdef np.ndarray[np.float32_t, ndim=1] scores = dets[:, 4]
    
    cdef np.ndarray[np.float32_t, ndim=1] areas = (y2-y1+1) * (x2-x1+1)
    cdef np.ndarray[np.int_t, ndim=1]  index = scores.argsort()[::-1]    # can be rewriten
    keep = []
    
    cdef int ndets = dets.shape[0]
    cdef np.ndarray[np.int_t, ndim=1] suppressed = np.zeros(ndets, dtype=np.int)
    
    cdef int _i, _j
    
    cdef int i, j
    
    cdef np.float32_t ix1, iy1, ix2, iy2, iarea
    cdef np.float32_t w, h
    cdef np.float32_t overlap, ious
    
    j=0
    
    for _i in range(ndets):
        i = index[_i]
        
        if suppressed[i] == 1:
            continue
        keep.append(i)
        
        ix1 = x1[i]
        iy1 = y1[i]
        ix2 = x2[i]
        iy2 = y2[i]
        
        iarea = areas[i]
        
        for _j in range(_i+1, ndets):
            j = index[_j]
            if suppressed[j] == 1:
                continue
            xx1 = max(ix1, x1[j])
            yy1 = max(iy1, y1[j])
            xx2 = max(ix2, x2[j])
            yy2 = max(iy2, y2[j])
    
            w = max(0.0, xx2-xx1+1)
            h = max(0.0, yy2-yy1+1)
            
            overlap = w*h 
            ious = overlap / (iarea + areas[j] - overlap)
            if ious>thresh:
                suppressed[j] = 1
    
    return keep

import matplotlib.pyplot as plt
def plot_bbox(dets, c='k'):
    
    x1 = dets[:,0]
    y1 = dets[:,1]
    x2 = dets[:,2]
    y2 = dets[:,3]
    
    plt.plot([x1,x2], [y1,y1], c)
    plt.plot([x1,x1], [y1,y2], c)
    plt.plot([x1,x2], [y2,y2], c)
    plt.plot([x2,x2], [y1,y2], c)

View Code

其中變量靜態類型能夠極大的提升效率，緣由是參與計算的主要是變量，主要的變化是將變量利用cdef定義。

而後同上創建setup2.py：

from distutils.core import setup
from Cython.Build import cythonize

setup(
      name = 'nms_module',
      ext_modules = cythonize('nums_py2.pyx'),
      )

build後將動態庫.so拷貝到nms文件夾下，而後同上修改測試腳本，執行測試腳本：

thresh=0.7, time wastes:0.0019
thresh=0.8, time wastes:0.0028
thresh=0.9, time wastes:0.0036

發現速度相較於純python分別提高了15倍、38倍、118倍！

4. 在方法3的基礎上利用GPU：gpu_nms.pyx

import numpy as np
cimport numpy as np

assert sizeof(int) == sizeof(np.int32_t)

cdef extern from "gpu_nms.hpp":
    void _nms(np.int32_t*, int*, np.float32_t*, int, int, float, int)

def gpu_nms(np.ndarray[np.float32_t, ndim=2] dets, np.float thresh,
            np.int32_t device_id=0):
    cdef int boxes_num = dets.shape[0]
    cdef int boxes_dim = dets.shape[1]
    cdef int num_out
    cdef np.ndarray[np.int32_t, ndim=1] \
        keep = np.zeros(boxes_num, dtype=np.int32)
    cdef np.ndarray[np.float32_t, ndim=1] \
        scores = dets[:, 4]
    cdef np.ndarray[np.int_t, ndim=1] \
        order = scores.argsort()[::-1]
    cdef np.ndarray[np.float32_t, ndim=2] \
        sorted_dets = dets[order, :]
    _nms(&keep[0], &num_out, &sorted_dets[0, 0], boxes_num, boxes_dim, thresh, device_id)
    keep = keep[:num_out]
    return list(order[keep])

View Code

再創建文件nms_gpu.hpp:

void _nms(int* keep_out, int* num_out, const float* boxes_host, int boxes_num,
          int boxes_dim, float nms_overlap_thresh, int device_id);

和nms_kernel.cu文件：

#include "gpu_nms.hpp"
#include <vector>
#include <iostream>

#define CUDA_CHECK(condition) \
  /* Code block avoids redefinition of cudaError_t error */ \
  do { \
    cudaError_t error = condition; \
    if (error != cudaSuccess) { \
      std::cout << cudaGetErrorString(error) << std::endl; \
    } \
  } while (0)

#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
int const threadsPerBlock = sizeof(unsigned long long) * 8;

__device__ inline float devIoU(float const * const a, float const * const b) {
  float left = max(a[0], b[0]), right = min(a[2], b[2]);
  float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
  float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
  float interS = width * height;
  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
  return interS / (Sa + Sb - interS);
}

__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
                           const float *dev_boxes, unsigned long long *dev_mask) {
  const int row_start = blockIdx.y;
  const int col_start = blockIdx.x;

  // if (row_start > col_start) return;

  const int row_size =
        min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
  const int col_size =
        min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);

  __shared__ float block_boxes[threadsPerBlock * 5];
  if (threadIdx.x < col_size) {
    block_boxes[threadIdx.x * 5 + 0] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 0];
    block_boxes[threadIdx.x * 5 + 1] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 1];
    block_boxes[threadIdx.x * 5 + 2] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 2];
    block_boxes[threadIdx.x * 5 + 3] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 3];
    block_boxes[threadIdx.x * 5 + 4] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 4];
  }
  __syncthreads();

  if (threadIdx.x < row_size) {
    const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
    const float *cur_box = dev_boxes + cur_box_idx * 5;
    int i = 0;
    unsigned long long t = 0;
    int start = 0;
    if (row_start == col_start) {
      start = threadIdx.x + 1;
    }
    for (i = start; i < col_size; i++) {
      if (devIoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) {
        t |= 1ULL << i;
      }
    }
    const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
    dev_mask[cur_box_idx * col_blocks + col_start] = t;
  }
}

void _set_device(int device_id) {
  int current_device;
  CUDA_CHECK(cudaGetDevice(&current_device));
  if (current_device == device_id) {
    return;
  }
  // The call to cudaSetDevice must come before any calls to Get, which
  // may perform initialization using the GPU.
  CUDA_CHECK(cudaSetDevice(device_id));
}

void _nms(int* keep_out, int* num_out, const float* boxes_host, int boxes_num,
          int boxes_dim, float nms_overlap_thresh, int device_id) {
  _set_device(device_id);

  float* boxes_dev = NULL;
  unsigned long long* mask_dev = NULL;

  const int col_blocks = DIVUP(boxes_num, threadsPerBlock);

  CUDA_CHECK(cudaMalloc(&boxes_dev,
                        boxes_num * boxes_dim * sizeof(float)));
  CUDA_CHECK(cudaMemcpy(boxes_dev,
                        boxes_host,
                        boxes_num * boxes_dim * sizeof(float),
                        cudaMemcpyHostToDevice));

  CUDA_CHECK(cudaMalloc(&mask_dev,
                        boxes_num * col_blocks * sizeof(unsigned long long)));

  dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
              DIVUP(boxes_num, threadsPerBlock));
  dim3 threads(threadsPerBlock);
  nms_kernel<<<blocks, threads>>>(boxes_num,
                                  nms_overlap_thresh,
                                  boxes_dev,
                                  mask_dev);

  std::vector<unsigned long long> mask_host(boxes_num * col_blocks);
  CUDA_CHECK(cudaMemcpy(&mask_host[0],
                        mask_dev,
                        sizeof(unsigned long long) * boxes_num * col_blocks,
                        cudaMemcpyDeviceToHost));

  std::vector<unsigned long long> remv(col_blocks);
  memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);

  int num_to_keep = 0;
  for (int i = 0; i < boxes_num; i++) {
    int nblock = i / threadsPerBlock;
    int inblock = i % threadsPerBlock;

    if (!(remv[nblock] & (1ULL << inblock))) {
      keep_out[num_to_keep++] = i;
      unsigned long long *p = &mask_host[0] + i * col_blocks;
      for (int j = nblock; j < col_blocks; j++) {
        remv[j] |= p[j];
      }
    }
  }
  *num_out = num_to_keep;

  CUDA_CHECK(cudaFree(boxes_dev));
  CUDA_CHECK(cudaFree(mask_dev));
}

View Code

而後在nms文件夾外創建setup3.py:

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import subprocess
import numpy as np
import os
from os.path import join as pjoin


def find_in_path(name, path):
    "Find a file in a search path"
    # Adapted fom
    # http://code.activestate.com/recipes/52224-find-a-file-given-a-search-path/
    for dir in path.split(os.pathsep):
        binpath = pjoin(dir, name)
        if os.path.exists(binpath):
            return os.path.abspath(binpath)
    return None

def locate_cuda():
    """Locate the CUDA environment on the system
    Returns a dict with keys 'home', 'nvcc', 'include', and 'lib64'
    and values giving the absolute path to each directory.
    Starts by looking for the CUDAHOME env variable. If not found, everything
    is based on finding 'nvcc' in the PATH.
    """

    # first check if the CUDAHOME env variable is in use
    if 'CUDAHOME' in os.environ:
        home = os.environ['CUDAHOME']
        nvcc = pjoin(home, 'bin', 'nvcc')
    else:
        # otherwise, search the PATH for NVCC
        default_path = pjoin(os.sep, 'usr', 'local', 'cuda', 'bin')
        nvcc = find_in_path('nvcc', os.environ['PATH'] + os.pathsep + default_path)
        if nvcc is None:
            raise EnvironmentError('The nvcc binary could not be '
                'located in your $PATH. Either add it to your path, or set $CUDAHOME')
        home = os.path.dirname(os.path.dirname(nvcc))

    cudaconfig = {'home':home, 'nvcc':nvcc,
                  'include': pjoin(home, 'include'),
                  'lib64': pjoin(home, 'lib64')}
    for k, v in cudaconfig.items():
        if not os.path.exists(v):
            raise EnvironmentError('The CUDA %s path could not be located in %s' % (k, v))

    return cudaconfig
CUDA = locate_cuda()

try:
    numpy_include = np.get_include()
except AttributeError:
    numpy_include = np.get_numpy_include()


def customize_compiler_for_nvcc(self):
    """inject deep into distutils to customize how the dispatch
    to gcc/nvcc works.
    If you subclass UnixCCompiler, it's not trivial to get your subclass
    injected in, and still have the right customizations (i.e.
    distutils.sysconfig.customize_compiler) run on it. So instead of going
    the OO route, I have this. Note, it's kindof like a wierd functional
    subclassing going on."""

    # tell the compiler it can processes .cu
    self.src_extensions.append('.cu')

    # save references to the default compiler_so and _comple methods
    default_compiler_so = self.compiler_so
    super = self._compile

    # now redefine the _compile method. This gets executed for each
    # object but distutils doesn't have the ability to change compilers
    # based on source extension: we add it.
    def _compile(obj, src, ext, cc_args, extra_postargs, pp_opts):
        if os.path.splitext(src)[1] == '.cu':
            # use the cuda for .cu files
            self.set_executable('compiler_so', CUDA['nvcc'])
            # use only a subset of the extra_postargs, which are 1-1 translated
            # from the extra_compile_args in the Extension class
            postargs = extra_postargs['nvcc']
        else:
            postargs = extra_postargs['gcc']

        super(obj, src, ext, cc_args, postargs, pp_opts)
        # reset the default compiler_so, which we might have changed for cuda
        self.compiler_so = default_compiler_so

    # inject our redefined _compile method into the class
    self._compile = _compile


# run the customize_compiler
class custom_build_ext(build_ext):
    def build_extensions(self):
        customize_compiler_for_nvcc(self.compiler)
        build_ext.build_extensions(self)

ext_modules =     [Extension('nms.gpu_nms',
        ['nms/nms_kernel.cu', 'nms/gpu_nms.pyx'],
        library_dirs=[CUDA['lib64']],
        libraries=['cudart'],
        language='c++',
        runtime_library_dirs=[CUDA['lib64']],
        # this syntax is specific to this build system
        # we're only going to use certain compiler args with nvcc and not with
        # gcc the implementation of this trick is in customize_compiler() below
        extra_compile_args={'gcc': ["-Wno-unused-function"],
                            'nvcc': ['-arch=sm_35',
                                     '--ptxas-options=-v',
                                     '-c',
                                     '--compiler-options',
                                     "'-fPIC'"]},
        include_dirs = [numpy_include, CUDA['include']]
    )]

setup(
    name='fast_rcnn',
    ext_modules=ext_modules,
    # inject our custom trigger
    cmdclass={'build_ext': custom_build_ext},
)

View Code

而後同上修改測試腳本，執行測試腳本：

import numpy as np
import time
#from nms.nums_py2 import py_cpu_nms  # for cpu
from nms.gpu_nms import gpu_nms   # for gpu 


np.random.seed( 1 )   # keep fixed
num_rois = 6000
minxy = np.random.randint(50,145,size=(num_rois ,2))
maxxy = np.random.randint(150,200,size=(num_rois ,2))
score = 0.8*np.random.random_sample((num_rois ,1))+0.2

boxes_new = np.concatenate((minxy,maxxy,score), axis=1).astype(np.float32)

def nms_test_time(boxes_new):

    thresh = [0.7,0.8,0.9]
    T = 50
    for i in range(len(thresh)):
        since = time.time()
        for t in range(T):

#            keep = py_cpu_nms(boxes_new, thresh=thresh[i])     # for cpu
            keep = gpu_nms(boxes_new, thresh=thresh[i])       # for gpu
        print("thresh={:.1f}, time wastes:{:.4f}".format(thresh[i], (time.time()-since)/T))
    
    return keep


if __name__ =="__main__":
    nms_test_time(boxes_new)

View Code

結果：

thresh=0.7, time wastes:0.0120
thresh=0.8, time wastes:0.0063
thresh=0.9, time wastes:0.0071

發現比方法3還要慢一點，應該是計算量較小，並且時間損耗在調用GPU上吧。若是在Faster-RCNN中利方法4確定是最快的，畢竟是rbg的實現，暫時來看方法3也足夠了（我不會cuda啊）

完整代碼見：github

Reference：

Cython的簡單使用: 利用Cython快速實現生成C代碼

py-faster-rcnn: rbg的NMS實現

一些不錯的博客：

Cython基本用法

pytorch源碼：C擴展

基於 NVRTC 和 CUPY 的 Python CUDA 擴展

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。