【飛槳開發者說】武秉泓,國內一線互聯網大廠工程師,計算機視覺技術愛好者,研究方向爲目標檢測、醫療影像php
內容簡介ios
下載安裝命令 ## CPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle ## GPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu
EfficientDet是由Google Brain於2019年底在目標檢測領域所提出的當之無愧的新SOTA算法,並被收錄於CVPR2020。本項目對目標檢測算法EfficientDet進行了詳細的解析,並介紹了基於官方目標檢測開發套件PaddleDetection進行模型復現的細節。git
EfficientDet源於CVPR2020年的一篇文章 github
https://arxiv.org/abs/1911.09070(源碼:算法
https://github.com/google/automl/tree/master/efficientdet), 其主要核心是在已完成網絡結構搜索的EfficientNet的基礎上,經過新設計的BiFPN進一步進行多尺度特徵融合,最後經由分類/迴歸分支生成檢測框,從而實現從高效分類器到高效檢測器的拓展。在總體結構上,EfficientDet與RetinaNet等一系列anchor-based的one-stage detector無明顯差異,但在每一個單一模塊上,EfficientDet在計算/存儲資源有限的狀況下將模型性能提高到了極致。網絡
EfficientDet與現有其餘主流模型的性能比對:app
EfficientDet網絡結構:ide
如上圖從左至右所示,EfficientDet總共分爲三個部分,依次爲檢測器的特徵提取模塊(Backbone)EfficientNet,檢測器的多尺度特徵融合模塊(Neck)BiFPN,以及檢測器的分類/迴歸預測模塊(Head)Class/Box prediction net,其具體模型結構定義代碼可見:性能
class EfficientDet(object): """ EfficientDet architecture, see https://arxiv.org/abs/1911.09070 Args: backbone (object): backbone instance fpn (object): feature pyramid network instance retina_head (object): `RetinaHead` instance """ __category__ = 'architecture' __inject__ = ['backbone', 'fpn', 'efficient_head', 'anchor_grid'] def __init__(self, backbone, fpn, efficient_head, anchor_grid, box_loss_weight=50.): super(EfficientDet, self).__init__() self.backbone = backbone self.fpn = fpn self.efficient_head = efficient_head self.anchor_grid = anchor_grid self.box_loss_weight = box_loss_weight def build(self, feed_vars, mode='train'): im = feed_vars['image'] if mode == 'train': gt_labels = feed_vars['gt_label'] gt_targets = feed_vars['gt_target'] fg_num = feed_vars['fg_num'] else: im_info = feed_vars['im_info'] mixed_precision_enabled = mixed_precision_global_state() is not None if mixed_precision_enabled: im = fluid.layers.cast(im, 'float16') body_feats = self.backbone(im) if mixed_precision_enabled: body_feats = [fluid.layers.cast(f, 'float32') for f in body_feats] body_feats = self.fpn(body_feats) anchors = self.anchor_grid() if mode == 'train': loss = self.efficient_head.get_loss(body_feats, gt_labels, gt_targets, fg_num) loss_cls = loss['loss_cls'] loss_bbox = loss['loss_bbox'] total_loss = loss_cls + self.box_loss_weight * loss_bbox loss.update({'loss': total_loss}) return loss else: pred = self.efficient_head.get_prediction(body_feats, anchors, im_info) return pred
在主體結構上,EfficientDet使用了從簡易到複雜的不一樣的配置來作到速度和精度上的權衡,以EfficientDet-D0爲例,模型組網&訓練配置參數可見(YML文件格式):測試
architecture: EfficientDet … pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_pretrained.tar weights: output/efficientdet_d0/model_final … EfficientDet: backbone: EfficientNet fpn: BiFPN efficient_head: EfficientHead anchor_grid: AnchorGrid box_loss_weight: 50. EfficientNet: norm_type: sync_bn scale: b0 use_se: true BiFPN: num_chan: 64 repeat: 3 levels: 5 EfficientHead: repeat: 3 num_chan: 64 prior_prob: 0.01 num_anchors: 9 gamma: 1.5 alpha: 0.25 delta: 0.1 output_decoder: score_thresh: 0.05 # originally 0. nms_thresh: 0.5 pre_nms_top_n: 1000 # originally 5000 detections_per_im: 100 nms_eta: 1.0 AnchorGrid: anchor_base_scale: 4 num_scales: 3 aspect_ratios: [[1, 1], [1.4, 0.7], [0.7, 1.4]] …
EfficientNet是由同做者Mingxing Tan發表於ICML 2019的分類網絡,主要是在計算資源有限的狀況下,考慮如何更高效地進行網絡結構組合,從而使得模型具備更高的分類精度。在EfficientNet的網絡設計時,其主要考慮三個維度對模型性能和資源佔用的影響:網絡深度(depth)、網絡寬度(width)和輸入圖像分辨率 (resolution)大小。在進行網絡結構搜索的背景下,做者所定義的優化目標以下:
在網絡結構搜索時,網絡深度(depth)、網絡寬度(width)和輸入圖像分辨率 (resolution)則爲可變量,爲了在固定計算資源的狀況下建模三者之間的關聯,做者又提出瞭如下的建模方式,用來表示三者之間的約束關係:
在網絡結構搜索時,與其餘網絡設計文章所不一樣的是,EfficientNet提出了複合擴張方法(compound scaling method),其主要分爲兩步:
1. 在計算資源固定的狀況下,經過網格搜索獲得基準的網絡深度/寬度/分辨率大小;
2. 經過複合係數(compound coefficient) 對深度/寬度/分辨率進行同時增長,從而獲得一系列的網絡結構EfficientNet-B0至B7。
在PaddleDetection中EfficientNet實現以下所示:
from __future__ import absolute_import from __future__ import division import collections import math import re from paddle import fluid from paddle.fluid.regularizer import L2Decay from ppdet.core.workspace import register __all__ = ['EfficientNet'] GlobalParams = collections.namedtuple('GlobalParams', [ 'batch_norm_momentum', 'batch_norm_epsilon', 'width_coefficient', 'depth_coefficient', 'depth_divisor' ]) BlockArgs = collections.namedtuple('BlockArgs', [ 'kernel_size', 'num_repeat', 'input_filters', 'output_filters', 'expand_ratio', 'stride', 'se_ratio' ]) GlobalParams.__new__.__defaults__ = (None, ) * len(GlobalParams._fields) BlockArgs.__new__.__defaults__ = (None, ) * len(BlockArgs._fields) def _decode_block_string(block_string): assert isinstance(block_string, str) ops = block_string.split('_') options = {} for op in ops: splits = re.split(r'(\d.*)', op) if len(splits) >= 2: key, value = splits[:2] options[key] = value assert (('s' in options and len(options['s']) == 1) or (len(options['s']) == 2 and options['s'][0] == options['s'][1])) return BlockArgs( kernel_size=int(options['k']), num_repeat=int(options['r']), input_filters=int(options['i']), output_filters=int(options['o']), expand_ratio=int(options['e']), se_ratio=float(options['se']) if 'se' in options else None, stride=int(options['s'][0])) def get_model_params(scale): block_strings = [ 'r1_k3_s11_e1_i32_o16_se0.25', 'r2_k3_s22_e6_i16_o24_se0.25', 'r2_k5_s22_e6_i24_o40_se0.25', 'r3_k3_s22_e6_i40_o80_se0.25', 'r3_k5_s11_e6_i80_o112_se0.25', 'r4_k5_s22_e6_i112_o192_se0.25', 'r1_k3_s11_e6_i192_o320_se0.25', ] block_args = [] for block_string in block_strings: block_args.append(_decode_block_string(block_string)) params_dict = { # width, depth 'b0': (1.0, 1.0), 'b1': (1.0, 1.1), 'b2': (1.1, 1.2), 'b3': (1.2, 1.4), 'b4': (1.4, 1.8), 'b5': (1.6, 2.2), 'b6': (1.8, 2.6), 'b7': (2.0, 3.1), } w, d = params_dict[scale] global_params = GlobalParams( batch_norm_momentum=0.99, batch_norm_epsilon=1e-3, width_coefficient=w, depth_coefficient=d, depth_divisor=8) return block_args, global_params def round_filters(filters, global_params): multiplier = global_params.width_coefficient if not multiplier: return filters divisor = global_params.depth_divisor filters *= multiplier min_depth = divisor new_filters = max(min_depth, int(filters + divisor / 2) // divisor * divisor) if new_filters < 0.9 * filters: # prevent rounding by more than 10% new_filters += divisor return int(new_filters) def round_repeats(repeats, global_params): multiplier = global_params.depth_coefficient if not multiplier: return repeats return int(math.ceil(multiplier * repeats)) def conv2d(inputs, num_filters, filter_size, stride=1, padding='SAME', groups=1, use_bias=False, name='conv2d'): param_attr = fluid.ParamAttr(name=name + '_weights') bias_attr = False if use_bias: bias_attr = fluid.ParamAttr(name=name + '_offset', regularizer=L2Decay(0.)) feats = fluid.layers.conv2d(inputs, num_filters, filter_size, groups=groups, name=name, stride=stride, padding=padding, param_attr=param_attr, bias_attr=bias_attr) return feats def batch_norm(inputs, momentum, eps, name=None): param_attr = fluid.ParamAttr(name=name + '_scale', regularizer=L2Decay(0.)) bias_attr = fluid.ParamAttr(name=name + '_offset', regularizer=L2Decay(0.)) return fluid.layers.batch_norm(input=inputs, momentum=momentum, epsilon=eps, name=name, moving_mean_name=name + '_mean', moving_variance_name=name + '_variance', param_attr=param_attr, bias_attr=bias_attr) def mb_conv_block(inputs, input_filters, output_filters, expand_ratio, kernel_size, stride, momentum, eps, se_ratio=None, name=None): feats = inputs num_filters = input_filters * expand_ratio if expand_ratio != 1: feats = conv2d(feats, num_filters, 1, name=name + '_expand_conv') feats = batch_norm(feats, momentum, eps, name=name + '_bn0') feats = fluid.layers.swish(feats) feats = conv2d(feats, num_filters, kernel_size, stride, groups=num_filters, name=name + '_depthwise_conv') feats = batch_norm(feats, momentum, eps, name=name + '_bn1') feats = fluid.layers.swish(feats) if se_ratio is not None: filter_squeezed = max(1, int(input_filters * se_ratio)) squeezed = fluid.layers.pool2d(feats, pool_type='avg', global_pooling=True) squeezed = conv2d(squeezed, filter_squeezed, 1, use_bias=True, name=name + '_se_reduce') squeezed = fluid.layers.swish(squeezed) squeezed = conv2d(squeezed, num_filters, 1, use_bias=True, name=name + '_se_expand') feats = feats * fluid.layers.sigmoid(squeezed) feats = conv2d(feats, output_filters, 1, name=name + '_project_conv') feats = batch_norm(feats, momentum, eps, name=name + '_bn2') if stride == 1 and input_filters == output_filters: feats = fluid.layers.elementwise_add(feats, inputs) return feats @register class EfficientNet(object): """ EfficientNet, see https://arxiv.org/abs/1905.11946 Args: scale (str): compounding scale factor, 'b0' - 'b7'. use_se (bool): use squeeze and excite module. norm_type (str): normalization type, 'bn' and 'sync_bn' are supported """ __shared__ = ['norm_type'] def __init__(self, scale='b0', use_se=True, norm_type='bn'): assert scale in ['b' + str(i) for i in range(8)], "valid scales are b0 - b7" assert norm_type in ['bn', 'sync_bn'], "only 'bn' and 'sync_bn' are supported" super(EfficientNet, self).__init__() self.norm_type = norm_type self.scale = scale self.use_se = use_se def __call__(self, inputs): blocks_args, global_params = get_model_params(self.scale) momentum = global_params.batch_norm_momentum eps = global_params.batch_norm_epsilon num_filters = round_filters(32, global_params) feats = conv2d(inputs, num_filters=num_filters, filter_size=3, stride=2, name='_conv_stem') feats = batch_norm(feats, momentum=momentum, eps=eps, name='_bn0') feats = fluid.layers.swish(feats) layer_count = 0 feature_maps = [] for b, block_arg in enumerate(blocks_args): for r in range(block_arg.num_repeat): input_filters = round_filters(block_arg.input_filters, global_params) output_filters = round_filters(block_arg.output_filters, global_params) kernel_size = block_arg.kernel_size stride = block_arg.stride se_ratio = None if self.use_se: se_ratio = block_arg.se_ratio if r > 0: input_filters = output_filters stride = 1 feats = mb_conv_block(feats, input_filters, output_filters, block_arg.expand_ratio, kernel_size, stride, momentum, eps, se_ratio=se_ratio, name='_blocks.{}.'.format(layer_count)) layer_count += 1 feature_maps.append(feats) return list(feature_maps[i] for i in [2, 4, 6])
其中EfficientNet入參scale即對應着原文的複合係數compound coefficient,可選項爲b0 - b7,在模型訓練/推理時,會將最後三個不一樣block的feature map返回,並送入BiFPN進行進一步的多尺度特徵融合。
Neck:BiFPN
做爲EfficientDet的主要創新點,BiFPN主要經過堆疊多個「BiFPN Layer」來實現不一樣尺度特徵的高度融合。下圖爲BiFPN Layer和不一樣結構的FPN的對比,相比於基本的FPN結構,BiFPN在具備top-down鏈接以外,同時有具備第二次bottom up的自底向上的特徵融合同路。而相比於一樣具備第二次bottom up的PANet,BiFPN具備同尺度特徵的跨層鏈接(紫色箭頭),且每一個BiFPN Layer中每一個層均帶有獨立的attention權重,在計算每一個節點的特徵圖時,均以鏈接此節點的箭頭末端的feature map先進行尺度上的縮放,再乘以歸一化權重以後進行特徵融合,從而獲得該節點的feature map。
以P6節點爲例,特徵融合方式以下,其中爲中間一列節點的計算方式,則爲最後一列:
BiFPN Layer具體實現代碼以下所示(BiFPNCell類):
from __future__ import absolute_import from __future__ import division from paddle import fluid from paddle.fluid.param_attr import ParamAttr from paddle.fluid.regularizer import L2Decay from paddle.fluid.initializer import Constant, Xavier from ppdet.core.workspace import register __all__ = ['BiFPN'] class FusionConv(object): def __init__(self, num_chan): super(FusionConv, self).__init__() self.num_chan = num_chan def __call__(self, inputs, name=''): x = fluid.layers.swish(inputs) # depthwise x = fluid.layers.conv2d(x, self.num_chan, filter_size=3, padding='SAME', groups=self.num_chan, param_attr=ParamAttr(initializer=Xavier(), name=name + '_dw_w'), bias_attr=False) # pointwise x = fluid.layers.conv2d(x, self.num_chan, filter_size=1, param_attr=ParamAttr(nitializer=Xavier(), name=name + '_pw_w'), bias_attr=ParamAttr(regularizer=L2Decay(0.), name=name + '_pw_b')) # bn + act x = fluid.layers.batch_norm(x, momentum=0.997, epsilon=1e-04, param_attr=ParamAttr(initializer=Constant(1.0), regularizer=L2Decay(0.), name=name + '_bn_w'), bias_attr=ParamAttr(regularizer=L2Decay(0.), name=name + '_bn_b')) return x class BiFPNCell(object): def __init__(self, num_chan, levels=5): super(BiFPNCell, self).__init__() self.levels = levels self.num_chan = num_chan num_trigates = levels - 2 num_bigates = levels self.trigates = fluid.layers.create_parameter(shape=[num_trigates, 3], dtype='float32', default_initializer=fluid.initializer.Constant(1.)) self.bigates = fluid.layers.create_parameter(shape=[num_bigates, 2], dtype='float32', default_initializer=fluid.initializer.Constant(1.)) self.eps = 1e-4 def __call__(self, inputs, cell_name=''): assert len(inputs) == self.levels def upsample(feat): return fluid.layers.resize_nearest(feat, scale=2.) def downsample(feat): return fluid.layers.pool2d(feat, pool_type='max', pool_size=3, pool_stride=2, pool_padding='SAME') fuse_conv = FusionConv(self.num_chan) # normalize weight trigates = fluid.layers.relu(self.trigates) bigates = fluid.layers.relu(self.bigates) trigates /= fluid.layers.reduce_sum(trigates, dim=1, keep_dim=True) + self.eps bigates /= fluid.layers.reduce_sum(bigates, dim=1, keep_dim=True) + self.eps feature_maps = list(inputs) # make a copy # top down path for l in range(self.levels - 1): p = self.levels - l - 2 w1 = fluid.layers.slice(bigates, axes=[0, 1], starts=[l, 0], ends=[l + 1, 1]) w2 = fluid.layers.slice(bigates, axes=[0, 1], starts=[l, 1], ends=[l + 1, 2]) above = upsample(feature_maps[p + 1]) feature_maps[p] = fuse_conv(w1 * above + w2 * inputs[p], name='{}_tb_{}'.format(cell_name, l)) # bottom up path for l in range(1, self.levels): p = l name = '{}_bt_{}'.format(cell_name, l) below = downsample(feature_maps[p - 1]) if p == self.levels - 1: # handle P7 w1 = fluid.layers.slice(bigates, axes=[0, 1], starts=[p, 0], ends=[p + 1, 1]) w2 = fluid.layers.slice(bigates, axes=[0, 1], starts=[p, 1], ends=[p + 1, 2]) feature_maps[p] = fuse_conv(w1 * below + w2 * inputs[p], name=name) else: w1 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 0], ends=[p, 1]) w2 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 1], ends=[p, 2]) w3 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 2], ends=[p, 3]) feature_maps[p] = fuse_conv(w1 * feature_maps[p] + w2 * below + w3 * inputs[p], name=name) return feature_maps @register class BiFPN(object): """ Bidirectional Feature Pyramid Network, see https://arxiv.org/abs/1911.09070 Args: num_chan (int): number of feature channels repeat (int): number of repeats of the BiFPN module level (int): number of FPN levels, default: 5 """ def __init__(self, num_chan, repeat=3, levels=5): super(BiFPN, self).__init__() self.num_chan = num_chan self.repeat = repeat self.levels = levels def __call__(self, inputs): feats = [] # NOTE add two extra levels for idx in range(self.levels): if idx <= len(inputs): if idx == len(inputs): feat = inputs[-1] else: feat = inputs[idx] if feat.shape[1] != self.num_chan: feat = fluid.layers.conv2d(feat, self.num_chan, filter_size=1, padding='SAME', param_attr=ParamAttr(initializer=Xavier()), bias_attr=ParamAttr(regularizer=L2Decay(0.))) feat = fluid.layers.batch_norm(feat, momentum=0.997, epsilon=1e-04, param_attr=ParamAttr(initializer=Constant(1.0), regularizer=L2Decay(0.)), bias_attr=ParamAttr(regularizer=L2Decay(0.))) if idx >= len(inputs): feat = fluid.layers.pool2d(feat, pool_type='max', pool_size=3, pool_stride=2, pool_padding='SAME') feats.append(feat) biFPN = BiFPNCell(self.num_chan, self.levels) for r in range(self.repeat): feats = biFPN(feats, 'bifpn_{}'.format(r)) return feats
在此BiFPN layer的基礎上,經過堆疊不一樣個數的BiFPN Layer即組成了BiFPN,在backbone使用不一樣複雜程度的EfficientNet的配置時,BiFPN一樣沿用了EfficientNet的部分設計思想:經過相似的複合係數來控制BiFPN的寬度和深度,其具體公式以下:
經過與EfficientNet相似的複合係數的控制,backbone和BiFPN的配比規模以下所示:
在PaddleDetection實現中,BiFPN實現如上連接中BiFPN所示,而參數修改則對應爲配置文件efficientdet_d0.yml連接中的BiFPN處,其中repeat參數對應爲BiFPN中「#layers」,num_chan參數則對應爲BiFPN中「#channels」。整體來看,隨着EfficientDet的backbone網絡的不斷複雜加深,BiFPN也隨之不斷繼續堆疊,且通道數也同時隨之增加。
Head:Class prediction net
& Box prediction net
做爲Anchor based的目標檢測算法,EfficientDet的Head與現有其餘SOTA算法基本一致,一樣是經過對從BiFPN中所獲取的5個不一樣尺度的特徵圖分別進行檢測框的分類與迴歸。在實現邏輯上,Class prediction net和Box prediction net均使用深度可分離卷積層(Depthwise separable convolution layer),且不一樣複雜度的backbone一樣對應着不一樣的堆疊次數。而與其餘Anchor-based detector所不一樣的是,EfficientDet的分類和迴歸卷積層採起了「參數共享」的方式來減少模型的參數量,具體而言,即分類和迴歸的大部分卷積層在特徵計算的時候使用的一樣的卷積核參數,但須要注意的是,二者的BN層相互獨立,以下可見,卷積層的名稱不受level參數的影響,而bn層則隨着level參數的不一樣而不一樣:
def subnet(inputs, prefix, level): feat = inputs for i in range(self.repeat): # NOTE share weight across FPN levels conv_name = '{}_pred_conv_{}'.format(prefix, i) feat = separable_conv(feat, self.num_chan, name=conv_name) # NOTE batch norm params are not shared bn_name = '{}_pred_bn_{}_{}'.format(prefix, level, i) feat = fluid.layers.batch_norm(input=feat, act='swish', momentum=0.997, epsilon=1e-4, moving_mean_name=bn_name + '_mean', moving_variance_name=bn_name + '_variance', param_attr=ParamAttr(name=bn_name + '_w', initializer=Constant(value=1.), regularizer=L2Decay(0.)), bias_attr=ParamAttr(name=bn_name + '_b', regularizer=L2Decay(0.))) return feat
訓練方式
在efficientdet原文中所註明的是,訓練efficientdet d0-d7使用了128的batchsize,並在32塊TPUv3上訓練了300個epoch(D7/D7x使用了600個epoch)。使用咱們在paddledetection上所復現的efficientdet-d0的配置進行訓練,在第216個epoch的時候完成收斂,在coco-minival上測試性能以下,與原文指標波動在0.02map內:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.341 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.523 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.360 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.134 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.401 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.525 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.289 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.445 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.196 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.559 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.690
因而可知,所復現的EfficientDet-D0性能與原文所描述效果基本符合,所有復現代碼與相關模型將在近期更新至PaddleDetection官方代碼庫上,後期也會陸續增長更高配置的coco預訓練模型,歡迎小夥伴們多提意見、多多使用:
https://github.com/PaddlePaddle/PaddleDetection
下載安裝命令 ## CPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle ## GPU版本安裝命令 pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu