深度學習筆記（七）SSD 論文閱讀筆記

時間 2019-11-12

標籤深度學習筆記 ssd 論文閱讀欄目存儲简体版

原文原文鏈接

源碼地址：https://github.com/weiliu89/caffe/tree/ssd

一. 算法概述

本文提出的SSD算法是一種直接預測目標類別和bounding box的多目標檢測算法。與faster rcnn相比，該算法沒有生成 proposal 的過程，這就極大提升了檢測速度。針對不一樣大小的目標檢測，傳統的作法是先將圖像轉換成不一樣大小（圖像金字塔），而後分別檢測，最後將結果綜合起來（NMS）。而SSD算法則利用不一樣卷積層的 feature map 進行綜合也能達到一樣的效果。文章的核心之一是同時採用 lower 和 upper 尺度的 feature map 作檢測。ios

Fig.1 SSD 框架git

算法的主網絡結構是 VGG16，將最後兩個全鏈接層改爲卷積層，並隨後增長了 4 個卷積層來構造網絡結構。對其中 5 個不一樣大小的卷積層的輸出（feature map）分別用兩個不一樣的 3×3 的卷積核進行卷積，一個輸出分類用的 confidence，每一個 default box 生成21個類別 confidence；一個輸出迴歸用的 localization，每一個 default box 生成4個座標值（x, y, w, h）。此外，這 5 個 feature map 還通過 PriorBox 層生成 prior box（生成的是座標）。上述 5 個 feature map 中每一層的 default box 的數量是給定的(8732個)。最後將前面三個計算結果分別合併而後傳給loss層。github

二. Default box

文章的核心之一是做者同時採用 lower 和 upper 的 feature map 作檢測。如圖Fig 2 所示，這裏假定有 8×8 和 4×4 兩種不一樣的 feature map。第一個概念是 feature map cell，feature map cell 是指feature map 中每個小格子，如圖中分別有 64 和 16 個cell。另外有一個概念：default box，是指在 feature map 的每一個小格(cell)上都有一系列固定大小的 box，以下圖有 4 個（下圖中的虛線框，仔細看格子的中間有比格子還小的一個box）。假設每一個 feature map cell 有 k 個 default box，對於每一個 default box 都須要預測 c 個類別 score 和 4 個 offset，那麼若是一個 feature map 的大小是 m×n，也就是有 m*n 個feature map cell，那麼這個feature map就一共有（c+4）*k * m*n 個輸出。這些輸出個數的含義是：採用3×3的卷積覈對該層的feature map卷積時卷積核的個數，包含兩部分：數量 c*k*m*n 是 confidence 輸出，表示每一個 default box 的是某一類別的 confidence；數量 4*k*m*n 是 localization 輸出，表示每一個 default box 迴歸後的座標）。訓練中還有一個東西：prior box，是指實際中選擇的 default box（你能夠認爲 default box 是抽象類，而 prior box 是具體實例）。訓練中對於每一個圖像樣本，須要先將 prior box 與 ground truth box 作匹配，匹配成功說明這個 prior box 所包含的是個目標，但離完整目標的 ground truth box 還有段距離，訓練的目的是保證 prior box 的分類confidence 的同時將 prior box 儘量迴歸到 ground truth box。舉個列子：假設一個訓練樣本中有2個 ground truth box，全部的 feature map 中獲取的 prior box一共有8732個。那個可能分別有十、20個prior box能分別與這2個 ground truth box 匹配上。訓練的損失包含分類損失和迴歸損失兩部分。算法

Fig.2 default boxes網絡

做者的實驗代表 default box 的 shape 數量越多，效果越好 (固然耗時也越大)。app

這裏用到的 default box 和Faster RCNN中的 anchor 很像，在Faster RCNN中 anchor 只用在最後一個卷積層，可是在本文中，default box 是應用在多個不一樣層的 feature map 上。框架

那麼default box的 scale（大小）和 aspect ratio（橫縱比）要怎麼定呢？假設咱們用 m 個 feature maps 作預測，那麼對於每一個 featuer map 而言其 default box 的 scale 是按如下公式計算的： dom

$\vee$ide

$S_k=S_{min} + \frac{S_{max} - S_{min}}{m-1}(k-1), k\in{[1, m]}$函數

這裏smin是0.2，表示最底層的scale是0.2；smax是0.9，表示最高層的scale是0.9。

至於aspect ratio，用$a_r$表示爲下式：注意這裏一共有5種aspect ratio

$a_r = \{1, 2, 3, 1/2, 1/3\}$

所以每一個default box的寬的計算公式爲：

$w_k^a=s_k\sqrt{a_r}$

高的計算公式爲：（很容易理解寬和高的乘積是scale的平方）

$h_k^a=s_k/\sqrt{a_r}$

另外當aspect ratio爲1時，做者還增長一種scale的default box：

$s_k^{'}=\sqrt{s_{k}s_{k+1}}$

所以，對於每一個 feature map cell 而言，一共能夠有 6 種 default box。

能夠看出這種 default box 在不一樣的f eature層有不一樣的 scale，在同一個 feature 層又有不一樣的 aspect ratio，所以基本上能夠覆蓋輸入圖像中的各類形狀和大小的 object！

（訓練本身的樣本的時候能夠在 FindMatch() 以後檢查是否覆蓋了全部得 ground truth box，其實是全覆蓋了，由於會至少找一個最大匹配）

具體到代碼 ssd_pascal.py 中是這樣設計的：這裏與論文中的公式有細微變化，本身體會。。。

mbox_source_layers = ['conv4_3', 'fc7', 'conv6_2', 'conv7_2', 'conv8_2', 'conv9_2']
# in percent %
min_ratio = 20
max_ratio = 90
step = int(math.floor((max_ratio - min_ratio) / (len(mbox_source_layers) - 2)))
min_sizes = []
max_sizes = []
for ratio in xrange(min_ratio, max_ratio + 1, step):
  min_sizes.append(min_dim * ratio / 100.)
  max_sizes.append(min_dim * (ratio + step) / 100.)
min_sizes = [min_dim * 10 / 100.] + min_sizes
max_sizes = [min_dim * 20 / 100.] + max_sizes
steps = [8, 16, 32, 64, 100, 300]
aspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]]

caffe 源碼 prior_box_layer.cpp 中是這樣提取 prior box 的：

 for (int h = 0; h < layer_height; ++h) {
    for (int w = 0; w < layer_width; ++w) {
      float center_x = (w + offset_) * step_w;
      float center_y = (h + offset_) * step_h;
      float box_width, box_height;
      for (int s = 0; s < min_sizes_.size(); ++s) {
        int min_size_ = min_sizes_[s];
        // first prior: aspect_ratio = 1, size = min_size
        box_width = box_height = min_size_;
        // xmin
        top_data[idx++] = (center_x - box_width / 2.) / img_width;
        // ymin
        top_data[idx++] = (center_y - box_height / 2.) / img_height;
        // xmax
        top_data[idx++] = (center_x + box_width / 2.) / img_width;
        // ymax
        top_data[idx++] = (center_y + box_height / 2.) / img_height;

        if (max_sizes_.size() > 0) {
          CHECK_EQ(min_sizes_.size(), max_sizes_.size());
          int max_size_ = max_sizes_[s];
          // second prior: aspect_ratio = 1, size = sqrt(min_size * max_size)
          box_width = box_height = sqrt(min_size_ * max_size_);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;
        }

        // rest of priors
        for (int r = 0; r < aspect_ratios_.size(); ++r) {
          float ar = aspect_ratios_[r];
          if (fabs(ar - 1.) < 1e-6) {
            continue;
          }
          box_width = min_size_ * sqrt(ar);
          box_height = min_size_ / sqrt(ar);
          // xmin
          top_data[idx++] = (center_x - box_width / 2.) / img_width;
          // ymin
          top_data[idx++] = (center_y - box_height / 2.) / img_height;
          // xmax
          top_data[idx++] = (center_x + box_width / 2.) / img_width;
          // ymax
          top_data[idx++] = (center_y + box_height / 2.) / img_height;
        }
      }
    }
  }

View Code

具體到每個 feature map上得到 prior box 時，會從這 6 種中進行選擇。以下表和圖所示最後會獲得（38*38*4 + 19*19*6 + 10*10*6 + 5*5*6 + 3*3*4 + 1*1*4）= 8732 個 prior box。

feature map	feature map size	min_size($s_k$)	max_size($s_{k+1}$)	aspect_ratio	step	offset	variance
conv4_3	38×38	30	60	1,2	8	0.50	0.1， 0.1， 0.2， 0.2
fc6	19×19	60	111	1,2,3	16
conv6_2	10×10	111	162	1,2,3	32
conv7_2	5×5	162	213	1,2,3	64
conv8_2	3×3	213	264	1,2	100
conv9_2	1×1	264	315	1,2	300

三. 正負樣本

將 prior box 和 grount truth box 按照IOU（JaccardOverlap）進行匹配，匹配成功則這個 prior box 就是positive example（正樣本），若是匹配不上，就是 negative example（負樣本），顯然這樣產生的負樣本的數量要遠遠多於正樣本。這裏默認作了難例挖掘：簡單描述起來就是，將全部的匹配不上的 negative prior box 按照前向 loss 進行排序，選擇最高的 num_sel 個 prior box 序號集合做爲最終的負樣本集。這裏就能夠利用 num_sel 來控制最後正、負樣本的比例在 1：3 左右。

Fig.3 positive and negtive sample VS ground_truth box

1.正樣本得到

咱們已經在圖上畫出了prior box，同時也有了ground truth，那麼下一步就是將prior box匹配到ground truth上，這是在 src/caffe/utlis/bbox_util.cpp 的 FindMatches 以及子函數 MatchBBox 函數裏完成的。具體的：先從 groudtruth box 出發，爲每一個 groudtruth box 找到最匹配的一個 prior box 放入候選正樣本集；而後再嘗試從剩下的每一個 prior box 出發，尋找與 groundtruth box 知足 $IOU>0.5$ 的最大匹配，若是找到了這樣的一個匹配結果就放入候選正樣本集。這個作的目的是保證每一個 groundtruth box 都至少有一個匹配正樣本

void FindMatches(const vector<LabelBBox>& all_loc_preds,                //
      const map<int, vector<NormalizedBBox> >& all_gt_bboxes,        // 全部的 ground truth
      const vector<NormalizedBBox>& prior_bboxes,                                // 全部的default boxes，8732個
      const vector<vector<float> >& prior_variances,
      const MultiBoxLossParameter& multibox_loss_param,
      vector<map<int, vector<float> > >* all_match_overlaps,        // 全部匹配上的default box jaccard overlap
      vector<map<int, vector<int> > >* all_match_indices) {            // 全部匹配上的default box序號

  const int num_classes = multibox_loss_param.num_classes(); // 類別總數 = 21
  const bool share_location = multibox_loss_param.share_location(); // 共享？ true
  const int loc_classes = share_location ? 1 : num_classes; // 1
  const MatchType match_type = multibox_loss_param.match_type(); // MultiBoxLossParameter_MatchType_PER_PREDICTION
  const float overlap_threshold = multibox_loss_param.overlap_threshold(); // jaccard overlap = 0.5
  const bool use_prior_for_matching =multibox_loss_param.use_prior_for_matching(); // true
  const int background_label_id = multibox_loss_param.background_label_id(); 
  const CodeType code_type = multibox_loss_param.code_type();
  const bool encode_variance_in_target =
      multibox_loss_param.encode_variance_in_target();
  const bool ignore_cross_boundary_bbox =
      multibox_loss_param.ignore_cross_boundary_bbox();
  // Find the matches.
  int num = all_loc_preds.size();
  for (int i = 0; i < num; ++i) {
    map<int, vector<int> > match_indices; // 匹配上的default box 序號
    map<int, vector<float> > match_overlaps; // 匹配上的default box jaccard overlap
    // Check if there is ground truth for current image.
    if (all_gt_bboxes.find(i) == all_gt_bboxes.end()) {
      // There is no gt for current image. All predictions are negative.
      all_match_indices->push_back(match_indices);
      all_match_overlaps->push_back(match_overlaps);
      continue;
    }
    // Find match between predictions and ground truth.
    const vector<NormalizedBBox>& gt_bboxes = all_gt_bboxes.find(i)->second; // N個ground truth
    if (!use_prior_for_matching) {
      for (int c = 0; c < loc_classes; ++c) {
        int label = share_location ? -1 : c;
        if (!share_location && label == background_label_id) {
          // Ignore background loc predictions.
          continue;
        }
        // Decode the prediction into bbox first.
        vector<NormalizedBBox> loc_bboxes;
        bool clip_bbox = false;
        DecodeBBoxes(prior_bboxes, prior_variances,
                     code_type, encode_variance_in_target, clip_bbox,
                     all_loc_preds[i].find(label)->second, &loc_bboxes);
        MatchBBox(gt_bboxes, loc_bboxes, label, match_type,
                  overlap_threshold, ignore_cross_boundary_bbox,
                  &match_indices[label], &match_overlaps[label]);
      }
    } else {
      // Use prior bboxes to match against all ground truth.
      vector<int> temp_match_indices;
      vector<float> temp_match_overlaps;
      const int label = -1;
      MatchBBox(gt_bboxes, prior_bboxes, label, match_type, overlap_threshold,
                ignore_cross_boundary_bbox, &temp_match_indices,
                &temp_match_overlaps);
      if (share_location) {
        match_indices[label] = temp_match_indices;
        match_overlaps[label] = temp_match_overlaps;
      } else {
        // Get ground truth label for each ground truth bbox.
        vector<int> gt_labels;
        for (int g = 0; g < gt_bboxes.size(); ++g) {
          gt_labels.push_back(gt_bboxes[g].label());
        }
        // Distribute the matching results to different loc_class.
        for (int c = 0; c < loc_classes; ++c) {
          if (c == background_label_id) {
            // Ignore background loc predictions.
            continue;
          }
          match_indices[c].resize(temp_match_indices.size(), -1);
          match_overlaps[c] = temp_match_overlaps;
          for (int m = 0; m < temp_match_indices.size(); ++m) {
            if (temp_match_indices[m] > -1) {
              const int gt_idx = temp_match_indices[m];
              CHECK_LT(gt_idx, gt_labels.size());
              if (c == gt_labels[gt_idx]) {
                match_indices[c][m] = gt_idx;
              }
            }
          }
        }
      }
    }
    all_match_indices->push_back(match_indices);
    all_match_overlaps->push_back(match_overlaps);
  }
}

View Code

void MatchBBox(const vector<NormalizedBBox>& gt_bboxes,
    const vector<NormalizedBBox>& pred_bboxes, const int label,
    const MatchType match_type, const float overlap_threshold,
    const bool ignore_cross_boundary_bbox,
    vector<int>* match_indices, vector<float>* match_overlaps) {
  int num_pred = pred_bboxes.size();
  match_indices->clear();
  match_indices->resize(num_pred, -1);
  match_overlaps->clear();
  match_overlaps->resize(num_pred, 0.);

  int num_gt = 0;
  vector<int> gt_indices;
  if (label == -1) {
    // label -1 means comparing against all ground truth.
    num_gt = gt_bboxes.size();
    for (int i = 0; i < num_gt; ++i) {
      gt_indices.push_back(i);
    }
  } else {
    // Count number of ground truth boxes which has the desired label.
    for (int i = 0; i < gt_bboxes.size(); ++i) {
      if (gt_bboxes[i].label() == label) {
        num_gt++;
        gt_indices.push_back(i);
      }
    }
  }
  if (num_gt == 0) {
    return;
  }

  // Store the positive overlap between predictions and ground truth.
  map<int, map<int, float> > overlaps;
  for (int i = 0; i < num_pred; ++i) {
    if (ignore_cross_boundary_bbox && IsCrossBoundaryBBox(pred_bboxes[i])) {
      (*match_indices)[i] = -2;
      continue;
    }
    for (int j = 0; j < num_gt; ++j) {
      float overlap = JaccardOverlap(pred_bboxes[i], gt_bboxes[gt_indices[j]]);
      if (overlap > 1e-6) {
        (*match_overlaps)[i] = std::max((*match_overlaps)[i], overlap);
        overlaps[i][j] = overlap;
      }
    }
  }

  // Bipartite matching.
  vector<int> gt_pool;
  for (int i = 0; i < num_gt; ++i) {
    gt_pool.push_back(i);
  }
  while (gt_pool.size() > 0) {
    // Find the most overlapped gt and cooresponding predictions.
    int max_idx = -1;
    int max_gt_idx = -1;
    float max_overlap = -1;
    for (map<int, map<int, float> >::iterator it = overlaps.begin();
         it != overlaps.end(); ++it) {
      int i = it->first;
      if ((*match_indices)[i] != -1) {
        // The prediction already has matched ground truth or is ignored.
        continue;
      }
      for (int p = 0; p < gt_pool.size(); ++p) {
        int j = gt_pool[p];
        if (it->second.find(j) == it->second.end()) {
          // No overlap between the i-th prediction and j-th ground truth.
          continue;
        }
        // Find the maximum overlapped pair.
        if (it->second[j] > max_overlap) {
          // If the prediction has not been matched to any ground truth,
          // and the overlap is larger than maximum overlap, update.
          max_idx = i;
          max_gt_idx = j;
          max_overlap = it->second[j];
        }
      }
    }
    if (max_idx == -1) {
      // Cannot find good match.
      break;
    } else {
      CHECK_EQ((*match_indices)[max_idx], -1);
      (*match_indices)[max_idx] = gt_indices[max_gt_idx];
      (*match_overlaps)[max_idx] = max_overlap;
      // Erase the ground truth.
      gt_pool.erase(std::find(gt_pool.begin(), gt_pool.end(), max_gt_idx));
    }
  }

  switch (match_type) {
    case MultiBoxLossParameter_MatchType_BIPARTITE:
      // Already done.
      break;
    case MultiBoxLossParameter_MatchType_PER_PREDICTION:
      // Get most overlaped for the rest prediction bboxes.
      for (map<int, map<int, float> >::iterator it = overlaps.begin();
           it != overlaps.end(); ++it) {
        int i = it->first;
        if ((*match_indices)[i] != -1) {
          // The prediction already has matched ground truth or is ignored.
          continue;
        }
        int max_gt_idx = -1;
        float max_overlap = -1;
        for (int j = 0; j < num_gt; ++j) {
          if (it->second.find(j) == it->second.end()) {
            // No overlap between the i-th prediction and j-th ground truth.
            continue;
          }
          // Find the maximum overlapped pair.
          float overlap = it->second[j];
          if (overlap >= overlap_threshold && overlap > max_overlap) {
            // If the prediction has not been matched to any ground truth,
            // and the overlap is larger than maximum overlap, update.
            max_gt_idx = j;
            max_overlap = overlap;
          }
        }
        if (max_gt_idx != -1) {
          // Found a matched ground truth.
          CHECK_EQ((*match_indices)[i], -1);
          (*match_indices)[i] = gt_indices[max_gt_idx];
          (*match_overlaps)[i] = max_overlap;
        }
      }
      break;
    default:
      LOG(FATAL) << "Unknown matching type.";
      break;
  }

  return;
}

View Code

將每個prior box 與每個 groundtruth box 進行匹配，得到待處理匹配 map<int, map<int, float> > overlaps（小於8732），JaccardOverlap > 0 的 prior box 才保留，其餘捨去。一個 ground truth box 可能和多個 prior box 能匹配上。
從待處理匹配中爲 ground truth box 找到最匹配的一對放入候選正樣本集 vector<int>* match_indices, vector<float>* match_overlaps。
剩下的每一個待處理匹配中一個 ground truth box 可能匹配多個 prior box，所以咱們爲剩下的每一個 prior box 尋找知足與 groundtruth box 的 JaccardOverlap > 0.5 的一個最大匹配放入候選正樣本集 vector<int>* match_indices, vector<float>* match_overlaps。

這就使得一個ground truth box中咱們可能得到多個候選正樣本。

2.負樣本得到

在生成一系列的 prior boxes 以後，會產生不少個符合 ground truth box 的 positive boxes（候選正樣本集），但同時，不符合 ground truth boxes 更多，這個 negative boxes（候選負樣本集）遠多於 positive boxes。這會形成 negative boxes、positive boxes 之間的不均衡。訓練時難以收斂。

所以本文采起，先將每個物體位置上對應 predictions（prior boxes）loss 進行排序。對於候選正樣本集：選擇 loss 最高的幾個 prior box 集合與候選正樣本集進行匹配(box索引同時存在於這兩個集合裏則匹配成功)，匹配不成功則刪除這個正樣本（由於這個正樣本不在難例裏已經很接近ground truth box了，不須要再訓練了）；對於候選負樣本集：選擇 loss 最高的幾個 prior box 與候選負樣本集匹配，匹配成功則做爲負樣本。這就是一個難例挖掘的過程，舉個例子，假設在這8732個prior box裏，通過 FindMatches 後獲得候選正樣本 $P$ 個，候選負樣本那就有 $8732-P$ 個。將 prior box 的 prediction loss 按照從大到小順序排列後選擇最高的 $M$ 個 prior box。若是這 $P$ 個候選正樣本里有 $a$ 個box不在這 $M$ 個prior box裏，將這 $a$ 個box從候選正樣本集中踢出去。若是這 $8732-P$ 個候選負樣本集中包含的$ 8732-P$ 有 $b$ 個在這 $M$ 個 prior box，則將這 $b$ 個候選負樣本做爲最終的負樣本。總歸一句話就是：選擇 loss 值高的難例做爲最後的負樣本參與 loss 反傳計算。SSD算法中經過這種方式來保證 positives、negatives 的比例

若是選擇HARD_EXAMPLE方式（源於論文Training Region-based Object Detectors with Online Hard Example Mining），則默認$M = 64$，因爲沒法控制正樣本數量，這種方式就有點相似於分類、迴歸按比重不一樣交替訓練了。
若是選擇MAX_NEGATIVE方式，則 $M = P*neg\_pos\_ratio$，這裏當 $neg\_pos\_ratio = 3$ 的時候,就是論文中的正負樣本比例 1:3 了。

enum MultiBoxLossParameter_MiningType {
  MultiBoxLossParameter_MiningType_NONE = 0,
  MultiBoxLossParameter_MiningType_MAX_NEGATIVE = 1,
  MultiBoxLossParameter_MiningType_HARD_EXAMPLE = 2
};

3.迴歸

以prior box爲基準，SSD裏的迴歸目標不是簡單的中心點誤差以及寬、高縮放。由於涉及到一個編碼的過程，這裏簡單說一下默認的解碼過程，編碼是個反過程：

輸入：預約義prior box = [prior_center_x, prior_center_y, prior_width, prior_height]

預測輸出 predict box = [bbox.xmin(), bbox.ymin(), bbox.xmax), bbox.ymax()]

編碼係數 prior_variance = [0.1, 0.1, 0.2, 0.2]

輸出 decode_bbox

decode_bbox_center_x = prior_variance[0] * bbox.xmin() * prior_width + prior_center_x;
decode_bbox_center_y = prior_variance[1] * bbox.ymin() * prior_height + prior_center_y;
decode_bbox_width = exp(prior_variance[2] * bbox.xmax()) * prior_width;
decode_bbox_height = exp(prior_variance[3] * bbox.ymax()) * prior_height;

decode_bbox->set_xmin(decode_bbox_center_x - decode_bbox_width / 2.);
decode_bbox->set_ymin(decode_bbox_center_y - decode_bbox_height / 2.);
decode_bbox->set_xmax(decode_bbox_center_x + decode_bbox_width / 2.);
decode_bbox->set_ymax(decode_bbox_center_y + decode_bbox_height / 2.);

4.Data argument

本文同時對訓練數據作了 data augmentation，數據增廣。

每一張訓練圖像，隨機的進行以下幾種選擇：

使用原始的圖像
隨機採樣多個 patch(CropImage)，與物體之間最小的 jaccard overlap 爲：

採樣的 patch 是原始圖像大小比例是

當 groundtruth box 的中心（center）在採樣的 patch 中且在採樣的 patch中 groundtruth box面積大於0時，咱們保留CropImage。

在這些採樣步驟以後，每個採樣的 patch 被 resize 到固定的大小，而且以

這樣一個樣本被諸多batch_sampler採樣器採樣後會生成多個候選樣本，而後從中隨機選一個樣本送人網絡訓練。

一個sampler的參數說明

// Sample a bbox in the normalized space [0, 1] with provided constraints.
message Sampler {
// 最大最小scale數
optional float min_scale = 1 [default = 1.];
optional float max_scale = 2 [default = 1.];
// 最大最小採樣長寬比，真實的長寬比在這兩個數中間取值
optional float min_aspect_ratio = 3 [default = 1.];
optional float max_aspect_ratio = 4 [default = 1.];
}

對於選擇的sample_box的限制條件

// Constraints for selecting sampled bbox.
message SampleConstraint {
  // Minimum Jaccard overlap between sampled bbox and all bboxes in
  // AnnotationGroup.
  optional float min_jaccard_overlap = 1;
  // Maximum Jaccard overlap between sampled bbox and all bboxes in
  // AnnotationGroup.
  optional float max_jaccard_overlap = 2;
  // Minimum coverage of sampled bbox by all bboxes in AnnotationGroup.
  optional float min_sample_coverage = 3;
  // Maximum coverage of sampled bbox by all bboxes in AnnotationGroup.
  optional float max_sample_coverage = 4;
  // Minimum coverage of all bboxes in AnnotationGroup by sampled bbox.
  optional float min_object_coverage = 5;
  // Maximum coverage of all bboxes in AnnotationGroup by sampled bbox.
  optional float max_object_coverage = 6;
} 咱們們每每只用max_jaccard_overlap

對於一個batch進行採樣的參數設置

// Sample a batch of bboxes with provided constraints.
message BatchSampler {
  // 是否使用原來的圖片
  optional bool use_original_image = 1 [default = true];
  // sampler的參數
  optional Sampler sampler = 2;
  // 對於採樣box的限制條件，決定一個採樣數據positive or negative
  optional SampleConstraint sample_constraint = 3;
  // 當採樣總數知足條件時，直接結束
  optional uint32 max_sample = 4;
  // 爲了不死循環，採樣最大try的次數.
  optional uint32 max_trials = 5 [default = 100];
}

轉存datalayer數據的參數

message TransformationParameter {
  // 對於數據預處理，咱們能夠僅僅進行scaling和減掉預先提供的平均值。
  // 須要注意的是在scaling以前要先減掉平均值
  optional float scale = 1 [default = 1];
  // 是否隨機鏡像操做
  optional bool mirror = 2 [default = false];
  // 是否隨機crop操做
  optional uint32 crop_size = 3 [default = 0];
  optional uint32 crop_h = 11 [default = 0];
  optional uint32 crop_w = 12 [default = 0];
  // 提供mean_file的路徑，可是不能和mean_value同時提供
  // if specified can be repeated once (would substract it from all the 
  // channels) or can be repeated the same number of times as channels
  // (would subtract them from the corresponding channel)
  optional string mean_file = 4;
  repeated float mean_value = 5;
  // Force the decoded image to have 3 color channels.
  optional bool force_color = 6 [default = false];
  // Force the decoded image to have 1 color channels.
  optional bool force_gray = 7 [default = false];
  // Resize policy
  optional ResizeParameter resize_param = 8;
  // Noise policy
  optional NoiseParameter noise_param = 9;
  // Distortion policy
  optional DistortionParameter distort_param = 13;
  // Expand policy
  optional ExpansionParameter expand_param = 14;
  // Constraint for emitting the annotation after transformation.
  optional EmitConstraint emit_constraint = 10;
}

SSD中的數據轉換和採樣參數設置

transform_param {
    mirror: true
    mean_value: 104
    mean_value: 117
    mean_value: 123
    resize_param {
      prob: 1
      resize_mode: WARP
      height: 300
      width: 300
      interp_mode: LINEAR
      interp_mode: AREA
      interp_mode: NEAREST
      interp_mode: CUBIC
      interp_mode: LANCZOS4
    }
    emit_constraint {
      emit_type: CENTER
    }
    distort_param {
      brightness_prob: 0.5
      brightness_delta: 32
      contrast_prob: 0.5
      contrast_lower: 0.5
      contrast_upper: 1.5
      hue_prob: 0.5
      hue_delta: 18
      saturation_prob: 0.5
      saturation_lower: 0.5
      saturation_upper: 1.5
      random_order_prob: 0.0
    }
    expand_param {
      prob: 0.5
      max_expand_ratio: 4.0
    }
  }

 annotated_data_param {
    batch_sampler {
      max_sample: 1
      max_trials: 1
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        min_jaccard_overlap: 0.1
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        min_jaccard_overlap: 0.3
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        min_jaccard_overlap: 0.5
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        min_jaccard_overlap: 0.7
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        min_jaccard_overlap: 0.9
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1.0
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2.0
      }
      sample_constraint {
        max_jaccard_overlap: 1.0
      }
      max_sample: 1
      max_trials: 50
    }
    label_map_file: "E:/tyang/caffe-master_/data/VOC0712/labelmap_voc.prototxt"
  }

View Code

Fig.4 SSD data argument

四. 網絡結構

SSD的結構在VGG16網絡的基礎上進行修改，訓練時一樣爲conv1_1，conv1_2，conv2_1，conv2_2，conv3_1，conv3_2，conv3_3，conv4_1，conv4_2，conv4_3，conv5_1，conv5_2，conv5_3（512），fc6通過3*3*1024的卷積（原來VGG16中的fc6是全鏈接層，這裏變成卷積層，下面的fc7層同理），fc7通過1*1*1024的卷積，conv6_1，conv6_2（對應上圖的conv8_2），conv7_1，conv7_2，conv,8_1，conv8_2，conv9_1，conv9_2，loss。而後針對conv4_3（4），fc7（6），conv6_2（6），conv7_2（6），conv8_2（4），conv9_2（4）的每個再分別採用兩個3*3大小的卷積核進行卷積，這兩個卷積核是並列的（括號裏的數字表明prior box的數量，能夠參考Caffe代碼，因此上圖中SSD結構的倒數第二列的數字8732表示的是全部prior box的數量，是這麼來的38*38*4+19*19*6+10*10*6+5*5*6+3*3*4+1*1*4=8732），這兩個3*3的卷積核一個是用來作localization的（迴歸用，若是prior box是6個，那麼就有6*4=24個這樣的卷積核，卷積後map的大小和卷積前同樣，由於pad=1，下同），另外一個是用來作confidence的（分類用，若是prior box是6個，VOC的object類別有20個，那麼就有6*（20+1）=126個這樣的卷積核）。以下圖是conv6_2的localizaiton的3*3卷積核操做，卷積核個數是24（6*4=24，因爲pad=1，因此卷積結果的map大小不變，下同）：這裏的permute層就是交換的做用，好比你卷積後的維度是32×24×19×19，那麼通過交換層後就變成32×19×19×24，順序變了而已。而flatten層的做用就是將32×19×19×24變成32*8664，32是batchsize的大小。

Fig.5 SSD 流程

SSD 網絡中輸入圖片尺寸是3×300×300，通過pool5層後輸出爲512×19×19，接下來通過fc6（改爲卷積層）

layer {
  name: "fc6"
  type: "Convolution"
  bottom: "pool5"
  top: "fc6"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 1024 pad: 6
    kernel_size: 3
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
    dilation: 6
  }
}

這裏來簡單提下卷積和池化層先後尺寸的變化：

公式以下，其中，input表明height或者width; dilation默認爲1，因此默認kernel_extent=kernel：
$output = \frac{(input + 2*pad - kernel\_extern)}{stride} +１$
$kernel\_extern = dilation * (kernel - 1) + 1$

注意：具體計算時，對於沒法整除的狀況，Convolution向下取整，Pooling向上取整。。

計算下來會獲得fc6層的輸出維度：1024×19×19，fc7層：1024×19×19，conv6_1：256×19×19，conv6_2：512×10×10，conv7_1：128×10×10，conv7_2：256×5×5，conv8_1：128×5×5，conv8_2：256×3×3，conv9_1：128×3×3，conv9_2：256×1×1。

計算完後，咱們來看看用來檢測的６個 feature map 的維度：

feature map	conv4_3	fc7	conv6_2	conv7_2	conv8_2	conv9_2
size	512×38×38	1024×19×19	512×10×10	256×5×5	256×3×3	256×1×1

每一個用來檢測的特徵層，可使用一系列 convolutional filters，去產生一系列固定大小的 predictions。對於一個大小爲$m×n$，具備$p$通道的feature map，使用的 convolutional filters 就是$3×3×p$的kernels。產生的 predictions，要麼就是歸屬類別的一個score，要麼就是相對於 prior box coordinate 的 shape offsets。

對於score，通過卷積預測器後的輸出維度爲(c*k)×m×n，這裏c是類別總數，k是該層設定的default box種類（不一樣層k的取值不一樣，分別爲4,6,6,6,4,4）

layer {
  name: "conv6_2_mbox_conf"
  type: "Convolution"
  bottom: "conv6_2"
  top: "conv6_2_mbox_conf"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 126
    pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

維度變化爲：

layer	conv4_3_norm_mbox_ conf	fc7_mbox_ conf	conv6_2_mbox_ conf	conv7_2_mbox_ conf	conv8_2_mbox_ conf	conv9_2_mbox_ conf
size	84 38 38	126 19 19	126 10 10	126 5 5	84 3 3	84 1 1

最後通過 permute 層交換維度

layer	conv4_3_norm_mbox_ conf_perm	fc7_mbox_ conf_perm	conv6_2_mbox_ conf_perm	conv7_2_mbox_ conf_perm	conv8_2_mbox_ conf_perm	conv9_2_mbox_ conf_perm
size	38 38 84	19 19 126	10 10 126	5 5 126	3 3 84	1 1 84

最後通過 flatten 層整合

layer	conv4_3_norm_mbox_ conf_flat	fc7_mbox_ conf_flat	conv6_2_mbox_ conf_flat	conv7_2_mbox_ conf_flat	conv8_2_mbox_ conf_flat	conv9_2_mbox_ conf_flat
size	121296	45486	12600	3150	756	84

對於offset，通過卷積預測器後的輸出值爲(4*k)×m×n，這裏4是(x,y,w,h)，k是該層設定的prior box數量（不一樣層k的取值不一樣，分別爲4,6,6,6,4,4）：

layer {
  name: "conv6_2_mbox_loc"
  type: "Convolution"
  bottom: "conv6_2"
  top: "conv6_2_mbox_loc"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 24
    pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

維度變化爲：

layer	conv4_3_norm_mbox_ loc	fc7_mbox_ loc	conv6_2_mbox_ loc	conv7_2_mbox_ loc	conv8_2_mbox_ loc	conv9_2_mbox_ loc
size	16 38 38	24 19 19	24 10 10	24 5 5	16 3 3	16 1 1

隨後通過 permute 層交換維度

layer	conv4_3_norm_mbox_ loc_perm	fc7_mbox_ loc_perm	conv6_2_mbox_ loc_perm	conv7_2_mbox_ loc_perm	conv8_2_mbox_ loc_perm	conv9_2_mbox_ loc_perm
size	38 38 16	19 19 24	10 10 24	5 5 24	3 3 16	1 1 16

最後通過 flatten 層整合

layer	conv4_3_norm_mbox_ loc_flat	fc7_mbox_ loc_flat	conv6_2_mbox_ loc_flat	conv7_2_mbox_ loc_flat	conv8_2_mbox_ loc_flat	conv9_2_mbox_ loc_flat
size	23104	8664	2400	600	144	16

同時，各個feature map層通過priorBox層生成prior box

生成prior box的操做：根據最小尺寸 scale以及橫縱比aspect ratio按照步長step生成，step表示該層的一個像素點至關於輸入圖像裏的尺寸，簡單講就是感覺野，源碼裏面是經過將原始的 input image的大小除以該層 feature map的大小來獲得的。variance是一個尺度變換因子，本文的四個座標採用的是中心座標加上長寬，計算loss的時候可能須要對中心座標的loss和長寬的loss作一個權衡，因此有了這個variance。若是採用的是box的四大頂點座標這種方式，默認variance都是0.1，即相互之間沒有權重差別。It is used to encode the ground truth box w.r.t. the prior box. You can check this function. Note that it is used in the original MultiBox paper by Erhan etal. It is also used in Faster R-CNN as well. I think the major goal of including the variance is to scale the gradient. Of course you can also think of it as approximate a gaussian distribution with variance of 0.1 around the box coordinates.

layer { name: "conv6_2_mbox_priorbox" type: "PriorBox" bottom: "conv6_2" bottom: "data" top: "conv6_2_mbox_priorbox" prior_box_param { min_size: 111.0 max_size: 162.0 aspect_ratio: 2.0 aspect_ratio: 3.0 flip: true clip: false variance: 0.10000000149 variance: 0.10000000149 variance: 0.20000000298 variance: 0.20000000298 step: 32.0 offset: 0.5 } }

維度變化爲：

layer	conv4_3_norm_mbox_ priorbox	fc7_mbox_ priorbox	conv6_2_mbox_ priorbox	conv7_2_mbox_ priorbox	conv8_2_mbox_ priorbox	conv9_2_mbox_ priorbox
size	2 23104	2 8664	2 2400	2 600	2 144	2 16

通過上述3個操做後，對每一層feature的處理就結束了。

對前面所列的5個卷積層輸出都執行上述的操做後，就將獲得的結果合併：採用Concat，相似googleNet的Inception操做，是通道合併而不是數值相加。

layer {
  name: "mbox_loc"
  type: "Concat"
  bottom: "conv4_3_norm_mbox_loc_flat"
  bottom: "fc7_mbox_loc_flat"
  bottom: "conv6_2_mbox_loc_flat"
  bottom: "conv7_2_mbox_loc_flat"
  bottom: "conv8_2_mbox_loc_flat"
  bottom: "conv9_2_mbox_loc_flat"
  top: "mbox_loc"
  concat_param {
    axis: 1
  }
}
layer {
  name: "mbox_conf"
  type: "Concat"
  bottom: "conv4_3_norm_mbox_conf_flat"
  bottom: "fc7_mbox_conf_flat"
  bottom: "conv6_2_mbox_conf_flat"
  bottom: "conv7_2_mbox_conf_flat"
  bottom: "conv8_2_mbox_conf_flat"
  bottom: "conv9_2_mbox_conf_flat"
  top: "mbox_conf"
  concat_param {
    axis: 1
  }
}
layer {
  name: "mbox_priorbox"
  type: "Concat"
  bottom: "conv4_3_norm_mbox_priorbox"
  bottom: "fc7_mbox_priorbox"
  bottom: "conv6_2_mbox_priorbox"
  bottom: "conv7_2_mbox_priorbox"
  bottom: "conv8_2_mbox_priorbox"
  bottom: "conv9_2_mbox_priorbox"
  top: "mbox_priorbox"
  concat_param {
    axis: 2
  }
}

這是幾個通道合併後的維度：

layer	mbox_loc	mbox_conf	mbox_priorbox
size	34928(8732*4)	183372(8732*21)	2 34928(8732*4)

最後就是做者自定義的損失函數層，這裏的overlap_threshold表示prior box和ground truth的重合度超過這個閾值則爲正樣本。另外我以爲具體哪些prior box是正樣本，哪些是負樣本是在loss層計算出來的，不過這個細節與算法關係不大：

layer {
  name: "mbox_loss"
  type: "MultiBoxLoss"
  bottom: "mbox_loc"
  bottom: "mbox_conf"
  bottom: "mbox_priorbox"
  bottom: "label"
  top: "mbox_loss"
  include {
    phase: TRAIN
  }
  propagate_down: true
  propagate_down: true
  propagate_down: false
  propagate_down: false
  loss_param {
    normalization: VALID
  }
  multibox_loss_param {
    loc_loss_type: SMOOTH_L1
    conf_loss_type: SOFTMAX
    loc_weight: 1.0
    num_classes: 21
    share_location: true
    match_type: PER_PREDICTION
    overlap_threshold: 0.5
    use_prior_for_matching: true
    background_label_id: 0
    use_difficult_gt: true
    neg_pos_ratio: 3.0
    neg_overlap: 0.5
    code_type: CENTER_SIZE
    ignore_cross_boundary_bbox: false
    mining_type: MAX_NEGATIVE
  }
}

損失函數方面：和Faster RCNN的基本同樣，由分類和迴歸兩部分組成，能夠參考Faster RCNN，這裏不細講。總之，迴歸部分的loss是但願預測的box和prior box的差距儘量跟ground truth和prior box的差距接近，這樣預測的box就能儘可能和ground truth同樣。

上面獲得的8732個目標框通過Jaccard Overlap篩選剩下幾個了；其中不知足的框標記爲負數，其他留下的標爲正數框。緊隨其後：

訓練過程當中的 prior boxes 和 ground truth boxes 的匹配，基本思路是：讓每個 prior box 迴歸而且到 ground truth box，這個過程的調控咱們須要損失層的幫助，他會計算真實值和預測值之間的偏差，從而指導學習的走向。

SSD 訓練的目標函數（training objective）源自於 MultiBox 的目標函數，可是本文將其拓展，使其能夠處理多個目標類別。具體過程是咱們會讓每個 prior box 通過Jaccard係數計算和真實框的類似度，閾值只有大於 0.5 的才能夠列爲候選名單；假設選擇出來的是N個匹配度高於百分之五十的框吧，咱們令 i 表示第 i 個默認框，j 表示第 j 個真實框，p表示第p個類。那麼$x_{ij}^p$ 表示第 i 個 prior box 與類別 p 的第 j 個 ground truth box 相匹配的Jaccard係數，若不匹配的話，則$x_{ij}^p=0$。總的目標損失函數（objective loss function）就由 localization loss（loc）與 confidence loss（conf）的加權求和：

localization loss（loc）是 Fast R-CNN 中 Smooth L1 Loss，用在 predict box（與 ground truth box（參數（即中心座標位置，width、height）中，迴歸 bounding boxes 的中心位置，以及 width、height

confidence loss（conf）是 Softmax Loss，輸入爲每一類的置信度

權重項

五. 代碼

代碼不少，想快速SSD算法只須要詳細瞭解下它的樣本增廣、正負樣本獲取方式、損失函數這三個方面就好，主要包括 include/caffe/ 目錄下面的 annotated_data_layer.hpp 、 detection_evaluate_layer.hpp 、 detection_output_layer.hpp 、 multibox_loss_layer.hpp 、 prior_box_layer.hpp ，以及對應的 src/caffe/layers/ 目錄下面的cpp和cu文件，另外還有 src/caffe/utlis/ 目錄下面的 bbox_util.cpp 。從名字就能夠看出來， annotated_data_layer 是提供數據的、 detection_evaluate_layer 是驗證模型效果用的、 detection_output_layer 是輸出檢測結果用的、 multibox_loss_layer 是loss、 prior_box_layer 是計算prior bbox的。

annotated_data_layer.cpp　　sampler.cpp　　data_transformer.cpp

這部分代碼涉及到樣本讀取和data augment，同時把每張圖裏的groundtruth bbox讀出來傳給下一層。從 load_batch() 函數開始，主要包含四個部分，具體見上面一個圖

① DistortImage();
② ExpandImage();
③ GenerateBatchSamples();
④ CropImage();

prior_box_layer.cpp

這一層完成的是給定一系列feature map後如何在上面生成prior box。從函數 Forward_cpu() 函數開始。SSD的作法頗有意思，對於輸入大小是 W×H 的feature map，生成的prior box中心就是 W×H 個，均勻分佈在整張圖上，像下圖中演示的同樣。在每一箇中心上，能夠生成多個不一樣長寬比的prior box，如[1/3, 1/2, 1, 2, 3]。因此在一個feature map上能夠生成的prior box總數是 W×H×length_of_aspect_ratio ，對於比較大的feature map，如VGG的conv4_3，生成的prior box能夠達到數千個。固然對於邊界上的box，還要作一些處理保證其不超出圖片範圍，這都是細節了。

這裏須要注意的是，雖然prior box的位置是在 W×H 的格子上，但prior box的大小並非跟格子同樣大，而是人工指定的，原論文中隨着feature map從底層到高層，prior box的大小在0.2到0.9之間均勻變化。

multibox_loss_layer.cpp

FindMatches()

咱們已經在圖上畫出了prior box，同時也有了ground truth，那麼下一步就是將prior box匹配到ground truth上，這是在 src/caffe/utlis/bbox_util.cpp 的 FindMatches 函數裏完成的。

值得注意的是這裏不光是給每一個groudtruth box找到了最匹配的prior box，而是給每一個prior box都找到了匹配的groundtruth box（若是有的話），這樣顯然大大增大了正樣本的數量。

MineHardExamples()

給每一個prior box找到匹配（包括物體和背景）以後，彷佛能夠定義一個損失函數，給每一個prior box標記一個label，扔進去一通訓練。

但須要注意的是，任意一張圖裏負樣本必定是比正樣本多得多的，這種嚴重不平衡的數據會嚴重影響模型的性能，因此對負樣本要有所選擇。

這裏簡單描述下：假設咱們上一步獲得了N個負樣本，接下來咱們將 loc_pred 損失進行排序，選擇 N 個最大的 loc_pred 保留下來。而後只將索引存在於 loc_pred 裏的負樣本留下來。

EncodeLocPrediction() && EncodeConfPrediction ()

由於咱們對prior box是有選擇的，因此數據的形狀在這裏已經被打亂了，沒辦法直接在後面鏈接一個loss（Caffe等框架須要每一層的輸入是四維張量），

因此須要咱們把選出來的數據從新整理一下，這一步是在 src/caffe/utlis/bbox_util.cpp 的 EncodeLocPrediction 和 EncodeConfPrediction 兩個函數裏完成的。