『計算機視覺』YOLO系列總結

時間 2019-11-05

標籤計算機視覺 yolo 系列總結欄目快樂工作简体版

原文原文鏈接

網絡細節資料不少，不作贅述，主要總結演化思路和解決問題。python

1、YOLO

一、網絡簡介

YOLO網絡結構由24個卷積層與2個全鏈接層構成，網絡入口爲448x448(v2爲416x416)，圖片進入網絡先通過resize，輸出格式爲：git

其中，S爲劃分網格數，B爲每一個網格負責目標個數，C爲類別個數。B表示每一個小格對應B組可能的框，5表示每一個框的四個座標和一個置信度，C表示類別，同時也說明B個框只能隸屬於同一個類別。github

二、損失函數

損失函數有四部分組成，算法

上文中的紅圈符號表示是否開關，好比第一個符號表示i號格子j號座標框中若是含有obj則爲1，不然爲0網絡

損失函數第一部分的寬高計算加根號，這是由於：一個一樣將一個100x100的目標與一個10x10的目標都預測大了10個像素，預測框爲110 x 110與20 x 20。顯然第一種狀況咱們還能夠失道接受，但第二種狀況至關於把邊界框預測大了一倍，但若是不使用根號函數，那麼損失相同，都爲200，若是使用根號則能夠表示出二者的差別。架構

Ci表示第i個框含有物體的置信度，相似於RCNN中的二分類部分思想，因爲大部分框中沒有物體，爲平衡損失函數，本部分的權重取小爲0.5app

中c爲正確類別則值爲1，不然爲0dom

三、網絡不足

1) 對小物體及鄰近特徵檢測效果差：當一個小格中出現多於兩個小物體或者一個小格中出現多個不一樣物體時效果欠佳。緣由：B表示每一個小格預測邊界框數，而YOLO默認同格子裏全部邊界框爲同種類物體。ide

(2) 圖片進入網絡前會先進行resize爲448 x 448，下降檢測速度(it takes about 10ms in 25ms)，若是直接訓練對應尺寸會有加速空間。函數

(3) 基礎網絡計算量較大

2、YOLO_v2

v2沒有一個明確的主線創新，就是把各類奇技淫巧融入v1中，等到更好的網絡。

使用聚類算法肯定 anchor

做者並無手動設定 anchor，而是在訓練集的 b-box 上用了 k-means 聚類來自動找到 anchor。距離度量若是使用標準的歐氏距離，大盒子會比小盒子產生更多的錯誤。例。所以這裏使用其餘的距離度量公式。聚類的目的是anchor boxes和臨近的ground truth有更大的IOU值，這和anchor box的尺寸沒有直接關係。自定義的距離度量公式：

到聚類中心的距離越小越好，但IOU值是越大越好，因此使用 1 - IOU，這樣就保證距離越小，IOU值越大。

$\Large{\textcircled{\small{1}}}$ 使用的聚類原始數據是隻有標註框的檢測數據集，YOLOv二、v3都會生成一個包含標註框位置和類別的TXT文件，其中每行都包含 $(x_j,y_j,w_j,h_j),j\in\{1,2,...,N\}$ ，即ground truth boxes相對於原圖的座標，是框的中心點，是框的寬和高，N是全部標註框的個數；
$\Large{\textcircled{\small{2}}}$ 首先給定k個聚類中心點 $(W_i,H_i),i\in\{1,2,...,k\}$ ，這裏的是anchor boxes的寬和高尺寸，因爲anchor boxes位置不固定，因此沒有(x,y)的座標，只有寬和高；
$\Large{\textcircled{\small{3}}}$ 計算每一個標註框和每一個聚類中心點的距離 d=1-IOU(標註框,聚類中心)，計算時每一個標註框的中心點都與聚類中心重合，這樣才能計算IOU值，即 $d=1-IOU\left [ (x_j,y_j,w_j,h_j),(x_j,y_j,W_i,H_i) \right ],j\in\{1,2,...,N\},i\in\{1,2,...,k\}$ 。將標註框分配給「距離」最近的聚類中心；
$\Large{\textcircled{\small{4}}}$ 全部標註框分配完畢之後，對每一個簇從新計算聚類中心點，計算方式爲 $W_i^{'}=\frac{1}{N_i}\sum w_{i},H_i^{'}=\frac{1}{N_i}\sum h_{i}$ ，是第i個簇的標註框個數，就是求該簇中全部標註框的寬和高的平均值。
重複第三、4步，直到聚類中心改變量很小。

做者對 k-means 算法取了各類k值，而且畫了一個曲線圖：

最終選擇了k=5，這是在模型複雜度和高召回率之間取了一個折中。

from os import listdir
from os.path import isfile, join
import argparse
#import cv2
import numpy as np
import sys
import os
import shutil
import random 
import math
 
def IOU(x,centroids):
    '''
    :param x: 某一個ground truth的w,h
    :param centroids:  anchor的w,h的集合[(w,h),(),...]，共k個
    :return: 單個ground truth box與全部k個anchor box的IoU值集合
    '''
    IoUs = []
    w, h = x  # ground truth的w,h
    for centroid in centroids:
        c_w,c_h = centroid   #anchor的w,h
        if c_w>=w and c_h>=h:   #anchor包圍ground truth
            iou = w*h/(c_w*c_h)
        elif c_w>=w and c_h<=h:    #anchor寬矮
            iou = w*c_h/(w*h + (c_w-w)*c_h)
        elif c_w<=w and c_h>=h:    #anchor瘦長
            iou = c_w*h/(w*h + c_w*(c_h-h))
        else: #ground truth包圍anchor     means both w,h are bigger than c_w and c_h respectively
            iou = (c_w*c_h)/(w*h)
        IoUs.append(iou) # will become (k,) shape
    return np.array(IoUs)
 
def avg_IOU(X,centroids):
    '''
    :param X: ground truth的w,h的集合[(w,h),(),...]
    :param centroids: anchor的w,h的集合[(w,h),(),...]，共k個
    '''
    n,d = X.shape
    sum = 0.
    for i in range(X.shape[0]):
        sum+= max(IOU(X[i],centroids))  #返回一個ground truth與全部anchor的IoU中的最大值
    return sum/n    #對全部ground truth求平均
 
def write_anchors_to_file(centroids,X,anchor_file,input_shape,yolo_version):
    '''
    :param centroids: anchor的w,h的集合[(w,h),(),...]，共k個
    :param X: ground truth的w,h的集合[(w,h),(),...]
    :param anchor_file: anchor和平均IoU的輸出路徑
    '''
    f = open(anchor_file,'w')
    
    anchors = centroids.copy()
    print(anchors.shape)
 
    if yolo_version=='yolov2':
        for i in range(anchors.shape[0]):
            #yolo中對圖片的縮放倍數爲32倍，因此這裏除以32，
            # 若是網絡架構有改變，根據實際的縮放倍數來
            #求出anchor相對於縮放32倍之後的特徵圖的實際大小（yolov2）
            anchors[i][0]*=input_shape/32.
            anchors[i][1]*=input_shape/32.
    elif yolo_version=='yolov3':
        for i in range(anchors.shape[0]):
            #求出yolov3相對於原圖的實際大小
            anchors[i][0]*=input_shape
            anchors[i][1]*=input_shape
    else:
        print("the yolo version is not right!")
        exit(-1)
 
    widths = anchors[:,0]
    sorted_indices = np.argsort(widths)
 
    print('Anchors = ', anchors[sorted_indices])
        
    for i in sorted_indices[:-1]:
        f.write('%0.2f,%0.2f, '%(anchors[i,0],anchors[i,1]))
 
    #there should not be comma after last anchor, that's why
    f.write('%0.2f,%0.2f\n'%(anchors[sorted_indices[-1:],0],anchors[sorted_indices[-1:],1]))
    
    f.write('%f\n'%(avg_IOU(X,centroids)))
    print()
 
def kmeans(X,centroids,eps,anchor_file,input_shape,yolo_version):
    
    N = X.shape[0] #ground truth的個數
    iterations = 0
    print("centroids.shape",centroids)
    k,dim = centroids.shape  #anchor的個數k以及w,h兩維，dim默認等於2
    prev_assignments = np.ones(N)*(-1)    #對每一個ground truth分配初始標籤
    iter = 0
    old_D = np.zeros((N,k))  #初始化每一個ground truth對每一個anchor的IoU
 
    while True:
        D = []
        iter+=1           
        for i in range(N):
            d = 1 - IOU(X[i],centroids)
            D.append(d)
        D = np.array(D) # D.shape = (N,k)  獲得每一個ground truth對每一個anchor的IoU
        
        print("iter {}: dists = {}".format(iter,np.sum(np.abs(old_D-D))))  #計算每次迭代和前一次IoU的變化值
            
        #assign samples to centroids 
        assignments = np.argmin(D,axis=1)  #將每一個ground truth分配給距離d最小的anchor序號
        
        if (assignments == prev_assignments).all() :  #若是前一次分配的結果和此次的結果相同，就輸出anchor以及平均IoU
            print("Centroids = ",centroids)
            write_anchors_to_file(centroids,X,anchor_file,input_shape,yolo_version)
            return
 
        #calculate new centroids
        centroid_sums=np.zeros((k,dim),np.float)   #初始化以便對每一個簇的w,h求和
        for i in range(N):
            centroid_sums[assignments[i]]+=X[i]         #將每一個簇中的ground truth的w和h分別累加
        for j in range(k):            #對簇中的w,h求平均
            centroids[j] = centroid_sums[j]/(np.sum(assignments==j)+1)
        
        prev_assignments = assignments.copy()     
        old_D = D.copy()  
 
def main(argv):
    parser = argparse.ArgumentParser()
    parser.add_argument('-filelist', default = r'E:\BaiduNetdiskDownload\darknetHG8245\scripts\train.txt',
                        help='path to filelist\n' )
    parser.add_argument('-output_dir', default = r'E:\BaiduNetdiskDownload\darknetHG8245', type = str,
                        help='Output anchor directory\n' )
    parser.add_argument('-num_clusters', default = 0, type = int, 
                        help='number of clusters\n' )
    '''
    須要注意的是yolov2輸出的值比較小是相對特徵圖來講的，
    yolov3輸出值較大是相對原圖來講的，
    因此yolov2和yolov3的輸出是有區別的
    '''
    parser.add_argument('-yolo_version', default='yolov2', type=str,
                        help='yolov2 or yolov3\n')
    parser.add_argument('-yolo_input_shape', default=416, type=int,
                        help='input images shape，multiples of 32. etc. 416*416\n')
    args = parser.parse_args()
    
    if not os.path.exists(args.output_dir):
        os.mkdir(args.output_dir)
 
    f = open(args.filelist)
  
    lines = [line.rstrip('\n') for line in f.readlines()]
    
    annotation_dims = []
 
    for line in lines:
        line = line.replace('JPEGImages','labels')
        line = line.replace('.jpg','.txt')
        line = line.replace('.png','.txt')
        print(line)
        f2 = open(line)
        for line in f2.readlines():
            line = line.rstrip('\n')
            w,h = line.split(' ')[3:]            
            #print(w,h)
            annotation_dims.append((float(w),float(h)))
    annotation_dims = np.array(annotation_dims) #保存全部ground truth框的(w,h)
  
    eps = 0.005
 
    if args.num_clusters == 0:
        for num_clusters in range(1,11): #we make 1 through 10 clusters 
            anchor_file = join( args.output_dir,'anchors%d.txt'%(num_clusters))
 
            indices = [ random.randrange(annotation_dims.shape[0]) for i in range(num_clusters)]
            centroids = annotation_dims[indices]
            kmeans(annotation_dims,centroids,eps,anchor_file,args.yolo_input_shape,args.yolo_version)
            print('centroids.shape', centroids.shape)
    else:
        anchor_file = join( args.output_dir,'anchors%d.txt'%(args.num_clusters))
        indices = [ random.randrange(annotation_dims.shape[0]) for i in range(args.num_clusters)]
        centroids = annotation_dims[indices]
        kmeans(annotation_dims,centroids,eps,anchor_file,args.yolo_input_shape,args.yolo_version)
        print('centroids.shape', centroids.shape)
 
if __name__=="__main__":
    main(sys.argv)

創新點簡述

關於BN做用，YOLOv2在加入BN層以後mAP上升2%
yolov1也在Image-Net預訓練模型上進行fine-tune，可是預訓練時網絡入口爲224 x 224，而fine-tune時爲448 x 448，這會帶來預訓練網絡與實際訓練網絡識別圖像尺寸的不兼容。yolov2直接使用448 x 448的網絡入口進行預訓練，而後在檢測任務上進行訓練，效果獲得3.7%的提高。
yolov2爲了提高小物體檢測效果，減小網絡中pooling層數目，使最終特徵圖尺寸更大，如輸入爲416 x 416，則輸出爲13 x 13 x 125，其中13 x 13爲最終特徵圖，即原圖分格的個數，125爲每一個格子中的邊界框構成(5 x (classes + 5))。須要注意的是，特徵圖尺寸取決於原圖尺寸，但特徵圖尺寸必須爲奇數，以此保存中間有一個位置能看到原圖中心處的目標。
經過預測偏移量而不是座標值可以簡化問題，讓神經網絡學習起來更容易，及anchor的設置是有其優越性的，至於每一個格子中設置多少個anchor(即k等於幾)，做者使用了k-means算法離線對voc及coco數據集中目標的形狀及尺度進行了計算。發現當k = 5時而且選取固定5比例值的時，anchors形狀及尺度最接近voc與coco中目標的形狀。（引入anchors和採用k_means肯定anchors的個數、形狀是兩個創新）
新的主幹網絡：模型的mAP值沒有顯著提高，但計算量減小了：
對細粒度特徵作了增強，我的理解就是resnet的跳層
YOLOv2中使用的Darknet-19網絡結構中只有卷積層和池化層，因此其對輸入圖片的大小沒有限制。YOLOv2採用多尺度輸入的方式訓練，在訓練過程當中每隔10個batches,從新隨機選擇輸入圖片的尺寸，因爲Darknet-19下采樣總步長爲32，輸入圖片的尺寸通常選擇32的倍數{320,352,…,608}。採用Multi-Scale Training, 能夠適應不一樣大小的圖片輸入，當採用低分辨率的圖片輸入時，mAP值略有降低，但速度更快，當採用高分辨率的圖片輸入時，能獲得較高mAP值，但速度有所降低。
本文對anchors的迴歸提出了更好的算法，這部分比較麻煩，貼出一篇講解很透徹的文章，其思想就是Fast RCNN的anchor迴歸值沒有限制，可能出現anchor檢測出很遠的目標box的狀況，效率比較低，做者以爲應該是每個anchor只負責檢測周圍正負一個單位之內的目標box。

除此以外，YOLO_v2的實例YOLO9000在超多類分類（9000類）也作出了實踐性質的創新，感興趣的能夠看一看。

2、YOLO_v3

貼上兩個項目地址：

https://github.com/qqwweee/keras-yolo3
https://github.com/wizyoung/YOLOv3_TensorFlow

延續了v2的思路，繼續修修補補：

簡單說一下網絡結構：yolo_v3也和v2同樣，backbone都會將輸出特徵圖縮小到輸入的1/32，一般都要求輸入圖片是32的倍數。這點能夠對比v2和v3的backbone看看：（DarkNet-19 與 DarkNet-53）

y1,y2和y3的深度都是255，邊長的規律是13:26:52，對於COCO類別而言，有80個種類，因此每一個box應該對每一個種類都輸出一個機率，yolo v3設定的是每一個網格單元預測3個box，因此每一個box須要有(x, y, w, h, confidence)五個基本參數，而後還要有80個類別的機率。因此3*(5 + 80) = 255。

9是做者聚類獲得的預測框數目建議，在使用搭用tiny-darknet的狀況時更改成6。

損失函數

xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[..., 0:2], from_logits=True) 
wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh - raw_pred[..., 2:4]) 
confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[..., 4:5], 
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 from_logits=True) + \ (1 - object_mask) * K.binary_crossentropy(object_mask, raw_pred[..., 4:5], 
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　  from_logits=True) * ignore_mask 
class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[..., 5:], from_logits=True) 
xy_loss = K.sum(xy_loss) / mf 
wh_loss = K.sum(wh_loss) / mf 
confidence_loss = K.sum(confidence_loss) / mf 
class_loss = K.sum(class_loss) / mf 
loss += xy_loss + wh_loss + confidence_loss + class_loss

v3部分參考：http://www.javashuo.com/article/p-cykvmldc-gc.html

$S * S * (B * 5 + C)$