yolov1， yolo v2 和yolo v3系列

時間 2019-11-09

標籤 yolov1 yolov yolo v2 v3 系列简体版

原文原文鏈接

　　目標檢測模型主要分爲two-stage和one-stage， one-stage的表明主要是yolo系列和ssd。簡單記錄下學習yolo系列的筆記。html

1 yolo V1git

　　　yolo v1是2015年的論文 you only look once：unified，real-time object detection 中提出，爲one-stage目標檢測的開山之做。其網絡架構以下：（24個卷積層和兩個全鏈接層，注意最後一個全鏈接層能夠理解爲1*4096到1*1470（7*7*30）的線性變換）github

　　yolo v1的理解主要在於三點：網絡

　　1.1 網格劃分：輸入圖片爲448*448，yolo將其劃爲爲49(7*7)個cell, 每一個cell只負責預測一個物體框，若是這個物體的中心點落在了這個cell中，這個cell就負責預測這個物體架構

　　1.2 預測結果：最後網絡的輸出爲7*7*30，也能夠看作49個1*30的向量，每一個向量的組成以下： (x, y, w, h, confidence) *2 + 20; 即每個向量預測兩個bounding box及對應的置信度，還有物體屬於20個分類（VOC數據集包括20分類）的機率。app

　 1.3 Loss 函數理解：loss函數以下圖所示，下面幾個概念須要理清楚ide

　　　　　　　s2：最後網絡的輸出爲7*7*30，所以49個cell；函數

　　　　　　　B: 每一個cell（1*30）預測了兩個bbox，所以B=2，只有和ground truth具備最大IOU的bbox才參與計算學習

　　　　　　　7*7的正掩膜𝕝𝑖𝑗obj：最開始進行網絡劃分時，ground truth的中心點落在了該cell中，則該cell出值爲1；只有爲1出的cell才參與計算ui

　　　　　　　7*7的反掩膜𝕝𝑖𝑗noobj：正掩膜取反。

　　　　(1) 座標預測損失（coordinate loss）: 上面損失函數的第一部分是對預測bbox的座標損失，以下圖所示，有兩個注意點：一是對寬高取平方根，抑制大物體的loss值，平衡小物體和大物體預測的loss差別；二是採用了權重係數5，由於參與計算正樣本太少(如上面7*7掩膜中只有三個cell的座標參與計算)，增長權重

　　　　（2）置信度損失（Confidence loss）：第二部分是正負樣本bbox的置信度損失，以下圖所示；注意下ground truth的置信度：對於正樣本其置信度爲預測框和ground truth之間的IOU*1, 對於負樣本，置信度爲IOU*0；另外因爲負樣本多餘正樣本，取負樣本的權重係數爲0.5

　　　　（3）分類損失(Classification Loss): 第三部分是預測所屬分類的損失，以下圖所示，預測值爲網絡中softmax計算出，真實值爲標註類別的one-hot編碼（能夠理解爲20分類任務，若爲第五類，則編碼爲00001000000000000000）

　　yolo v1的主要特色

　　　(1) 優勢: one-stage，速度快

　　　缺點:

　　　　(1) 不支持擁擠物體的檢測（劃分網格時一個cell只預測一個物體）

　　　　(2) 對小物體的檢測效果差, 且對新的寬高比物體檢測效果很差

　　　　(3)網絡中沒有使用batch normalization

下面是pytorch的實現的Yolo V1 network 和 loss計算方式：（未經實驗，僅供理解用）

import torch
import torch.nn as nn
from torch.nn import functional
from torch.autograd import Variable
import torchvision.models as models


class YoloLoss(nn.Module):
    def __init__(self, n_batch, B, C, lambda_coord, lambda_noobj, use_gpu=False):
        """

        :param n_batch: number of batches
        :param B: number of bounding boxes
        :param C: number of bounding classes
        :param lambda_coord: factor for loss which contain objects
        :param lambda_noobj: factor for loss which do not contain objects
        """
        super(YoloLoss, self).__init__()
        self.n_batch = n_batch
        self.B = B # assume there are two bounding boxes
        self.C = C
        self.lambda_coord = lambda_coord
        self.lambda_noobj = lambda_noobj
        self.use_gpu = use_gpu

    def compute_iou(self, bbox1, bbox2):
        """
        Compute the intersection over union of two set of boxes, each box is [x1,y1,w,h]
        :param bbox1: (tensor) bounding boxes, size [N,4]
        :param bbox2: (tensor) bounding boxes, size [M,4]
        :return:
        """
        # compute [x1,y1,x2,y2] w.r.t. top left and bottom right coordinates separately
        b1x1y1 = bbox1[:,:2]-bbox1[:,2:]**2 # [N, (x1,y1)=2]
        b1x2y2 = bbox1[:,:2]+bbox1[:,2:]**2 # [N, (x2,y2)=2]
        b2x1y1 = bbox2[:,:2]-bbox2[:,2:]**2 # [M, (x1,y1)=2]
        b2x2y2 = bbox2[:,:2]+bbox2[:,2:]**2 # [M, (x1,y1)=2]
        box1 = torch.cat((b1x1y1.view(-1,2), b1x2y2.view(-1, 2)), dim=1) # [N,4], 4=[x1,y1,x2,y2]
        box2 = torch.cat((b2x1y1.view(-1,2), b2x2y2.view(-1, 2)), dim=1) # [M,4], 4=[x1,y1,x2,y2]
        N = box1.size(0)
        M = box2.size(0)

        tl = torch.max(
            box1[:,:2].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,:2].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )
        br = torch.min(
            box1[:,2:].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,2:].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )

        wh = br - tl  # [N,M,2]
        wh[(wh<0).detach()] = 0
        #wh[wh<0] = 0
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1])  # [N,]
        area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1])  # [M,]
        area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
        area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]

        iou = inter / (area1 + area2 - inter)
        return iou

    def forward(self, pred_tensor, target_tensor):
        """

        :param pred_tensor: [batch,SxSx(Bx5+20))]
        :param target_tensor: [batch,S,S,Bx5+20]
        :return: total loss
        """
        n_elements = self.B * 5 + self.C
        batch = target_tensor.size(0)
        target_tensor = target_tensor.view(batch,-1,n_elements)
        #print(target_tensor.size())
        #print(pred_tensor.size())
        pred_tensor = pred_tensor.view(batch,-1,n_elements)
        coord_mask = target_tensor[:,:,5] > 0
        noobj_mask = target_tensor[:,:,5] == 0
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)

        coord_target = target_tensor[coord_mask].view(-1,n_elements)
        coord_pred = pred_tensor[coord_mask].view(-1,n_elements)
        class_pred = coord_pred[:,self.B*5:]
        class_target = coord_target[:,self.B*5:]
        box_pred = coord_pred[:,:self.B*5].contiguous().view(-1,5)
        box_target = coord_target[:,:self.B*5].contiguous().view(-1,5)

        noobj_target = target_tensor[noobj_mask].view(-1,n_elements)
        noobj_pred = pred_tensor[noobj_mask].view(-1,n_elements)

        # compute loss which do not contain objects
        if self.use_gpu:
            noobj_target_mask = torch.cuda.ByteTensor(noobj_target.size())
        else:
            noobj_target_mask = torch.ByteTensor(noobj_target.size())
        noobj_target_mask.zero_()
        for i in range(self.B):
            noobj_target_mask[:,i*5+4] = 1
        noobj_target_c = noobj_target[noobj_target_mask] # only compute loss of c size [2*B*noobj_target.size(0)]
        noobj_pred_c = noobj_pred[noobj_target_mask]
        noobj_loss = functional.mse_loss(noobj_pred_c, noobj_target_c, size_average=False)

        # compute loss which contain objects
        if self.use_gpu:
            coord_response_mask = torch.cuda.ByteTensor(box_target.size())
            coord_not_response_mask = torch.cuda.ByteTensor(box_target.size())
        else:
            coord_response_mask = torch.ByteTensor(box_target.size())
            coord_not_response_mask = torch.ByteTensor(box_target.size())
        coord_response_mask.zero_()
        coord_not_response_mask = ~coord_not_response_mask.zero_()
        for i in range(0,box_target.size()[0],self.B):
            box1 = box_pred[i:i+self.B]
            box2 = box_target[i:i+self.B]
            iou = self.compute_iou(box1[:, :4], box2[:, :4])
            max_iou, max_index = iou.max(0)
            if self.use_gpu:
                max_index = max_index.data.cuda()
            else:
                max_index = max_index.data
            coord_response_mask[i+max_index]=1
            coord_not_response_mask[i+max_index]=0

        # 1. response loss
        box_pred_response = box_pred[coord_response_mask].view(-1, 5)
        box_target_response = box_target[coord_response_mask].view(-1, 5)
        contain_loss = functional.mse_loss(box_pred_response[:, 4], box_target_response[:, 4], size_average=False)
        loc_loss = functional.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) +\
                   functional.mse_loss(box_pred_response[:, 2:4], box_target_response[:, 2:4], size_average=False)
        # 2. not response loss
        box_pred_not_response = box_pred[coord_not_response_mask].view(-1, 5)
        box_target_not_response = box_target[coord_not_response_mask].view(-1, 5)

        # compute class prediction loss
        class_loss = functional.mse_loss(class_pred, class_target, size_average=False)

        # compute total loss
        total_loss = self.lambda_coord * loc_loss + contain_loss + self.lambda_noobj * noobj_loss + class_loss
        return total_loss



def test():
    voc = False
    vot = 1-voc
    if voc:
        img_folder = '../codedata/voc2012train/JPEGImages'
        file = '../voc2012.txt'
        img_size = 448
        train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, S=7, B=2, C=20, transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        print(target.size())
        target = Variable(target)
        img = Variable(img)
        net = YOLO_V1()
        pred = net(img)
        yololoss = YoloLoss(n_batch=2, B=2, C=20, lambda_coord=5, lambda_noobj=0.5)
        print(pred.size())
        print(target.size())
        loss = yololoss(pred, target)
        print(loss)

    if vot:
        img_folder = './small_train_dataset'
        bboxes = dd.io.load('girl_bbox_4dim.h5')
        learning_rate = 0.0005
        img_size = 224
        num_epochs = 2
        lambda_coord = 5
        lambda_noobj = .5
        n_batch = 5
        S = 7
        B = 2
        C = 1
        train_dataset = VotDataset(img_folder=img_folder, bboxes=bboxes, img_size=img_size, S=S, B=B, C=C,
                                   transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=n_batch, shuffle=False, num_workers=2)
        yololoss = YoloLoss(n_batch=n_batch, B=B, C=C, lambda_coord=5, lambda_noobj=0.5)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        target = Variable(target)
        img = Variable(img)

        model = models.vgg16(pretrained=True)
        model.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 11 * 7 * 7),
            nn.Sigmoid(),
        )
        model.train()

        loss_fn = YoloLoss(n_batch, B, C, lambda_coord, lambda_noobj)
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4)

        use_gpu = False
        for epoch in range(num_epochs):
            for i, (images, target) in enumerate(train_loader):
                images = Variable(images)
                target = Variable(target)
                if use_gpu:
                    images, target = images.cuda(), target.cuda()

                pred = model(images)
                print(pred.size())
                print(target.size())
                loss = loss_fn(pred, target)
                print(i + 1, loss)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                if i == 10:
                    break
            break



if __name__=='__main__':
    from own_yolo_v1.network import *
    from own_yolo_v1.load_dataset import *
    test()

Yolo_loss

import torch.nn as nn


class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()
    def forward(self, x):
        return x.view(x.size(0), -1)


class YOLO_V1(nn.Module):
    def __init__(self):
        super(YOLO_V1, self).__init__()
        C = 20  # number of classes
        print("\n------Initiating YOLO v1------\n")
        self.conv_layer1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=7//2),
            nn.BatchNorm2d(64),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer2 = nn.Sequential(
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(192),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer3 = nn.Sequential(
            nn.Conv2d(in_channels=192, out_channels=128, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer4 = nn.Sequential(
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer5 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=2, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1),
        )
        self.conv_layer6 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1)
        )
        self.flatten = Flatten()
        self.conn_layer1 = nn.Sequential(
            nn.Linear(in_features=7*7*1024, out_features=4096),
            nn.Dropout(),
            nn.LeakyReLU(0.1)
        )
        self.conn_layer2 = nn.Sequential(nn.Linear(in_features=4096, out_features=7 * 7 * (2 * 5 + C)))

    def forward(self, input):
        conv_layer1 = self.conv_layer1(input)
        conv_layer2 = self.conv_layer2(conv_layer1)
        conv_layer3 = self.conv_layer3(conv_layer2)
        conv_layer4 = self.conv_layer4(conv_layer3)
        conv_layer5 = self.conv_layer5(conv_layer4)
        conv_layer6 = self.conv_layer6(conv_layer5)
        flatten = self.flatten(conv_layer6)
        conn_layer1 = self.conn_layer1(flatten)
        output = self.conn_layer2(conn_layer1)
        return output


'''
def test():
    from own_yolo_v1.load_dataset import *
    from torch.autograd import Variable
    img_folder = '../codedata/voc2012train/JPEGImages'
    file = '../voc2012.txt'
    img_size = 448
    train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, transforms=[transforms.ToTensor()])
    train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
    train_iter = iter(train_loader)
    img, target = next(train_iter)
    img = Variable(img)
    net = YOLO_V1()
    output = net(img)
    print(output.size())


if __name__ == '__main__':
    test()
'''

Yolo V1 network

2. Yolo V2

　　　　Yolo V2是在2016年的Yolo9000: Better, Faster, Stronger 中提出的，採用了新的網絡模型，成爲Darknet-19, 包括19個卷積層和5個maxpooling層，相比Yolo V1的計算量減少了33%左右。其結構以下：

在ImageNet上預訓練的結構：

進行檢測任務訓練時模型結構：（引入了不一樣尺度特徵融合）

Yolo V2 主要對Yolo V1進行了五處改進：

　　　　　　(1) 加入Batch Normalization, 去掉dropout

　　　　　　(2) High resolution classifier (高分辨率圖片分類器)

　　　　　　(3) 引入 Anchors

　　　　　　(4) Fine-grained Features (低層和高層特徵融合)

　　　　　　(5) Multi-scale Training （不一樣尺度圖片的訓練）

　　 2.1 High resolution classifier （4% mAP）

　　　　yolo v1中分類器在ImageNet數據集（224*224）上預訓練，而檢測時圖片的大小爲448*448，網絡須要適應新的尺寸，所以yolo V2中又加入了一步finetune，步驟以下：

　　　　a，在ImageNet上預訓練分類器（224*224），大概160個epoch　　

　　　　b，將ImageNet的圖片resize到448*448，再finetune 10個epoch，讓模型適應大圖片

　　　　c，採用上述預訓練的權重，在實際數據集上finetune（416*416），最終輸出爲13*13

　　2.2 Anchors

　　　借鑑Faster RCNN中Anchor的思想，經過kmeans方法在VOC數據集（COCO數據集）上對檢測物體的寬高進行了聚類分析，得出了5個聚類中心，所以選取5個anchor的寬高： (聚類時衡量指標distance = 1-IOU(bbox, cluster))

COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)

　　這樣每一個grid cell將對應5個不一樣寬高的anchor，以下圖所示：(上面給出的寬高是相對於grid cell，對應的實際寬高還須要乘以32)

　　關於預測的bbox的計算：(416*416-------13*13 爲例)

　　　　(1) 輸入圖片尺寸爲416*416, 最後輸出結果爲13*13*125，這裏的125指5*（5 + 20），5表示5個anchor，25表示[x, y, w, h, confidence ] + 20 class ）,即每個anchor預測一組值。

　　　　(2) 對於每一anchor預測的25個值， x, y是相對於該grid cell左上角的偏移值，須要經過sigmoid函數將其處理到0-1之間。如13*13大小的grid，對於index爲（6, 6）的cell，預測的x, y經過sigmoid計算爲xoffset, yoffset, 則對應的實際x = 6 + xoffset, y = 6+yoffset，因爲0<xoffset<1, 0<yoffset<1, 預測的實際x， y老是在（6,6）的cell內。對於預測的w, h是相對於anchor的寬高，還需乘以anchor的(w, h), 就獲得相應的寬高

　　　　(3) 因爲上述尺度是在13*13下的，須要還原爲實際的圖片對應大小，還需乘以縮放倍數32

　　實際計算代碼以下：

　　2.3 Fine-Grained Features

　　　　由上面網絡架構中，能夠看到一條shortcut，將低層的的feature map（26*26*512）和最後輸出的feature map（13*13*1024）進行concat，從而將低層的位置信息特徵和高層的語義特徵進行融合。另外因爲26*26尺度較大，網絡採用Reorg層對其進行了reshape，使其轉變爲13*13，以下圖所示：

　　2.4 Multi-scale Training

　　　　上述網絡架構中，最後一層的（Conv22）爲1*1*125的卷積層代替全鏈接函數，能夠處理任何大小的圖片輸入，所以在訓練時，每10個epoch，做者從[320×320, 352×352, … 608×608](都是32的倍數，最後輸出都降採樣32倍)選一個尺度做爲輸入圖片的尺寸進行訓練，增長模型的魯棒性。（當尺度爲416*416時，輸出爲13*13*125；輸入爲320*320，則輸出爲10*10*125）

Yolo V2的特色：

　　 (1)採用Darknet19網絡結構，層數比Yolo V1更少，且沒有全鏈接層，計算量更少；模型運行更快；

　　(2) 使用卷積代替全連接：解除了輸入大小的限制, 多尺度的訓練使得模型對不一樣尺度的圖片的檢測更加魯棒

　　 (3) 每一個cell採用5個anchor box進行預測，對擁擠和小物體檢測更有效

3. Yolo 9000

　　Yolo 9000是和yolo v2在同一篇文章中提出，是在YOLOv2的基礎上提出的一種能夠檢測超過9000個類別的模型，其主要貢獻點在於提出了一種分類和檢測的聯合訓練策略，具體細節參考：https://zhuanlan.zhihu.com/p/35325884

4. Yolo V3

　　　Yolo V3是在2018年的文章YOLO V3: An Incremental Improvement 中提出，Yolo V3網絡結構爲DarkNet53，以下圖所示：(有ResNet， FPN的思想)。Yolo V3每一個網格單元預測3個anchor box，每一個box須要有(x, y, w, h, confidence)五個基本參數，而後有80個類別（COCO數據集）的機率，因此3*(5 + 80) = 255。（y1， y2，y3的深度都是255）

　　相比於Resnet，Darknet中的殘差結構以下：

　　採用FPN的思想，將不一樣尺度的Feature map進行融合，並在每一個尺度上進行預測，以下圖所示：

　　yolo_v3也和v2同樣，backbone都會將輸出特徵圖縮小到輸入的1/32，一般都要求輸入圖片是32的倍數，Yolo v3中的DarkNet 53 和yolo v2 的DarkNet 19對好比下圖所示：

yolo v3的pytorch實現代碼參考：

from __future__ import division

from models import *
# from utils.logger import *
from utils.utils import *
from utils.datasets import *
from utils.parse_config import *
from test import evaluate

# from terminaltables import AsciiTable

import os
import sys
import time
import datetime
import argparse

import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms
from torch.autograd import Variable
import torch.optim as optim

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=100, help="number of epochs")
    parser.add_argument("--batch_size", type=int, default=8, help="size of each image batch")
    parser.add_argument("--gradient_accumulations", type=int, default=2, help="number of gradient accums before step")
    parser.add_argument("--model_def", type=str, default="config/yolov3.cfg", help="path to model definition file")
    parser.add_argument("--data_config", type=str, default="config/coco.data", help="path to data config file")
    parser.add_argument("--pretrained_weights", type=str, help="if specified starts from checkpoint model")
    parser.add_argument("--n_cpu", type=int, default=8, help="number of cpu threads to use during batch generation")
    parser.add_argument("--img_size", type=int, default=416, help="size of each image dimension")
    parser.add_argument("--checkpoint_interval", type=int, default=1, help="interval between saving model weights")
    parser.add_argument("--evaluation_interval", type=int, default=1, help="interval evaluations on validation set")
    parser.add_argument("--compute_map", default=False, help="if True computes mAP every tenth batch")
    parser.add_argument("--multiscale_training", default=True, help="allow for multi-scale training")
    opt = parser.parse_args()
    print(opt)

    # logger = Logger("logs")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # os.makedirs("output", exist_ok=True)
    # os.makedirs("checkpoints", exist_ok=True)

    # Get data configuration
    data_config = parse_data_config(opt.data_config)
    train_path = data_config["train"]
    valid_path = data_config["valid"]
    class_names = load_classes(data_config["names"])

    # Initiate model
    model = Darknet(opt.model_def).to(device)
    model.apply(weights_init_normal)

    # If specified we start from checkpoint
    if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

    # model = torch.nn.DataParallel(model).cuda()

    # Get dataloader
    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=True,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
    )

    optimizer = torch.optim.Adam(model.parameters())

    metrics = [
        "grid_size",
        "loss",
        "x",
        "y",
        "w",
        "h",
        "conf",
        "cls",
        "cls_acc",
        "recall50",
        "recall75",
        "precision",
        "conf_obj",
        "conf_noobj",
    ]

    for epoch in range(opt.epochs):
        model.train()
        start_time = time.time()
        for batch_i, (img_pth, imgs, targets) in enumerate(dataloader):
            batches_done = len(dataloader) * epoch + batch_i
            # img: (batch_size, channel, height, width)
            # target: (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            imgs = Variable(imgs.to(device))
            targets = Variable(targets.to(device), requires_grad=False)

            loss, outputs = model(imgs, targets)
            loss.backward()

            if batches_done % opt.gradient_accumulations:
                # Accumulates gradient before each step
                optimizer.step()
                optimizer.zero_grad()

            # ----------------
            #   Log progress
            # ----------------

            log_str = "\n---- [Epoch %d/%d, Batch %d/%d] ----\n" % (epoch, opt.epochs, batch_i, len(dataloader))

            # metric_table = [["Metrics", *["YOLO Layer {}".format(i) for i in range(len(model.yolo_layers))]]]

            # Log metrics at each YOLO layer
            for i, metric in enumerate(metrics):
                formats = {m: "%.6f" for m in metrics}
                formats["grid_size"] = "%2d"
                formats["cls_acc"] = "%.2f%%"
                row_metrics = [formats[metric] % yolo.metrics.get(metric, 0) for yolo in model.yolo_layers]
                # metric_table += [[metric, *row_metrics]]

                # Tensorboard logging
                tensorboard_log = []
                for j, yolo in enumerate(model.yolo_layers):
                    for name, metric in yolo.metrics.items():
                        if name != "grid_size":
                            tensorboard_log += [("{}_{}".format(name, j+1), metric)]
                tensorboard_log += [("loss", loss.item())]
                # logger.list_of_scalars_summary(tensorboard_log, batches_done)

            # log_str += AsciiTable(metric_table).table
            log_str += "\nTotal loss {}".format(loss.item())

            # Determine approximate time left for epoch
            epoch_batches_left = len(dataloader) - (batch_i + 1)
            time_left = datetime.timedelta(seconds=epoch_batches_left * (time.time() - start_time) / (batch_i + 1))
            log_str += "\n---- ETA {}".format(time_left)

            print(log_str)

            model.seen += imgs.size(0)

            # if batch_i > 10:
            #     break

        if epoch % opt.evaluation_interval == 0:
            print("\n---- Evaluating Model ----")
            # Evaluate the model on the validation set
            precision, recall, AP, f1, ap_class = evaluate(
                model,
                path=valid_path,
                iou_thres=0.5,
                conf_thres=0.5,
                nms_thres=0.5,
                img_size=opt.img_size,
                batch_size=8,
            )
            evaluation_metrics = [
                ("val_precision", precision.mean()),
                ("val_recall", recall.mean()),
                ("val_mAP", AP.mean()),
                ("val_f1", f1.mean()),
            ]
            # logger.list_of_scalars_summary(evaluation_metrics, epoch)

            # Print class APs and mAP
            ap_table = [["Index", "Class name", "AP"]]
            for i, c in enumerate(ap_class):
                ap_table += [[c, class_names[c], "%.5f" % AP[i]]]
            # print(AsciiTable(ap_table).table)
            print("---- mAP {}".format(AP.mean()))

        if epoch % opt.checkpoint_interval == 0:
            torch.save(model.state_dict(), "checkpoints/yolov3_ckpt_%d.pth" % epoch)

train

# -*- coding: utf-8 -*-
from __future__ import division

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np

from utils.parse_config import *
from utils.utils import build_targets, to_cpu, non_max_suppression

# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def create_modules(module_defs):
    """
    Constructs module list of layer blocks from module configuration in module_defs
    """
    hyperparams = module_defs.pop(0)
    output_filters = [int(hyperparams["channels"])]
    module_list = nn.ModuleList()
    for module_i, module_def in enumerate(module_defs):
        modules = nn.Sequential()

        if module_def["type"] == "convolutional":
            bn = int(module_def["batch_normalize"])
            filters = int(module_def["filters"])
            kernel_size = int(module_def["size"])
            pad = (kernel_size - 1) // 2
            modules.add_module(
                "conv_{}".format(module_i),
                nn.Conv2d(
                    in_channels=output_filters[-1],
                    out_channels=filters,
                    kernel_size=kernel_size,
                    stride=int(module_def["stride"]),
                    padding=pad,
                    bias=not bn,
                ),
            )
            if bn:
                modules.add_module("batch_norm_{}".format(module_i), nn.BatchNorm2d(filters, momentum=0.9, eps=1e-5))
            if module_def["activation"] == "leaky":
                modules.add_module("leaky_{}".format(module_i), nn.LeakyReLU(0.1))

        elif module_def["type"] == "maxpool":
            kernel_size = int(module_def["size"])
            stride = int(module_def["stride"])
            if kernel_size == 2 and stride == 1:
                modules.add_module("_debug_padding_{}".format(module_i), nn.ZeroPad2d((0, 1, 0, 1)))
            maxpool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, padding=int((kernel_size - 1) // 2))
            modules.add_module("maxpool_{}".format(module_i), maxpool)

        elif module_def["type"] == "upsample":
            upsample = Upsample(scale_factor=int(module_def["stride"]), mode="nearest")
            modules.add_module("upsample_{}".format(module_i), upsample)

        elif module_def["type"] == "route":
            layers = [int(x) for x in module_def["layers"].split(",")]
            filters = sum([output_filters[1:][i] for i in layers])
            modules.add_module("route_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "shortcut":
            filters = output_filters[1:][int(module_def["from"])]
            modules.add_module("shortcut_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "yolo":
            anchor_idxs = [int(x) for x in module_def["mask"].split(",")]
            # Extract anchors
            anchors = [int(x) for x in module_def["anchors"].split(",")]
            anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in anchor_idxs]
            num_classes = int(module_def["classes"])
            img_size = int(hyperparams["height"])
            # Define detection layer
            yolo_layer = YOLOLayer(anchors, num_classes, img_size)
            modules.add_module("yolo_{}".format(module_i), yolo_layer)
        # Register module list and number of output filters
        module_list.append(modules)
        output_filters.append(filters)

    return hyperparams, module_list


class Upsample(nn.Module):
    """ nn.Upsample is deprecated """

    def __init__(self, scale_factor, mode="nearest"):
        super(Upsample, self).__init__()
        self.scale_factor = scale_factor
        self.mode = mode

    def forward(self, x):
        x = F.interpolate(x, scale_factor=self.scale_factor, mode=self.mode)
        return x


class EmptyLayer(nn.Module):
    """Placeholder for 'route' and 'shortcut' layers"""

    def __init__(self):
        super(EmptyLayer, self).__init__()


class YOLOLayer(nn.Module):
    """Detection layer"""

    def __init__(self, anchors, num_classes, img_dim=416):
        super(YOLOLayer, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.ignore_thres = 0.5
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCELoss()
        self.obj_scale = 1
        self.noobj_scale = 100
        self.metrics = {}
        self.img_dim = img_dim
        self.grid_size = 0  # grid size

    def compute_grid_offsets(self, grid_size, cuda=True):
        self.grid_size = grid_size
        g = self.grid_size
        FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
        self.stride = self.img_dim / self.grid_size  # 縮小多少倍
        # Calculate offsets for each grid
        # grid_x， grid_y（1, 1, gride, gride）
        self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor)
        self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor)

        # 圖片縮小多少倍，對應的anchors也要縮小相應倍數
        self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors])

        # scaled_anchors shape（3， 2），3個anchors，每一個anchor有w,h兩個量。下面步驟是把這兩個量劃分開
        self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）
        self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）

    def forward(self, x, targets=None, img_dim=None):

        # Tensors for cuda support
        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
        ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor

        self.img_dim = img_dim  # (img_size)
        num_samples = x.size(0)  # (img_batch)
        grid_size = x.size(2)  # (feature_map_size)
        # print x.shape  # (batch_size, 255, grid_size, grid_size)

        prediction = (
            x.view(num_samples, self.num_anchors, 5 + self.num_classes, grid_size, grid_size)
            .permute(0, 1, 3, 4, 2)
            .contiguous()
        )
        # print prediction.shape (batch_size, num_anchors, grid_size, grid_size, 85)

        # Get outputs
        # 這裏的prediction是初步的全部預測，在grid_size*grid_size個網格中，它表示每一個網格都會有num_anchor（3）個anchor框
        # x,y,w,h, pred_conf的shape都是同樣的 (batch_size, num_anchor, gride_size, grid_size)
        x = torch.sigmoid(prediction[..., 0])  # Center x
        y = torch.sigmoid(prediction[..., 1])  # Center y
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height
        pred_conf = torch.sigmoid(prediction[..., 4])  # Conf
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred. (batch_size, num_anchor, gride_size, grid_size, cls)

        # If grid size does not match current we compute new offsets
        # print grid_size, self.grid_size
        if grid_size != self.grid_size:
            self.compute_grid_offsets(grid_size, cuda=x.is_cuda)

        # print self.grid_x, self.grid_y, self.anchor_w, self.anchor_h
        # Add offset and scale with anchors
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        # 針對每一個網格的偏移量，每一個網格的單位長度爲1，而預測的中心點（x，y）是歸一化的（0，1之間），因此能夠直接相加
        pred_boxes[..., 0] = x.data + self.grid_x  # （1, 1, gride, gride）
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w  # # （1， 3， 1， 1）
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

        # (batch_size, num_anchors*grid_size*grid_size, 85)
        output = torch.cat(
            (
                # (batch_size, num_anchors*grid_size*grid_size, 4)
                pred_boxes.view(num_samples, -1, 4) * self.stride,  # 放大到最初輸入的尺寸
                # (batch_size, num_anchors*grid_size*grid_size, 1)
                pred_conf.view(num_samples, -1, 1),
                # (batch_size, num_anchors*grid_size*grid_size, 80)
                pred_cls.view(num_samples, -1, self.num_classes),
            ),
            -1,
        )

        if targets is None:
            return output, 0
        else:
            # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
            # pred_cls => (batch_size, anchor_num, gride, gride, 80)
            # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            # scaled_anchors => (3, 2)
            # print pred_boxes.shape, pred_cls.shape, targets.shape, self.scaled_anchors.shape
            iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
                pred_boxes=pred_boxes,
                pred_cls=pred_cls,
                target=targets,
                anchors=self.scaled_anchors,
                ignore_thres=self.ignore_thres,
            )
            # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU做爲分數，IOU越大，分值越高。
            # class_mask：將預測正確的標記爲1（正確的預測了實體中心點所在的網格座標，哪一個anchor框能夠最匹配實體，以及實體的類別）
            # obj_mask：將目標實體框所對應的anchor標記爲1，目標實體框所對應的anchor與實體一一對應的
            # noobj_mask：將全部與目標實體框IOU小於某一閾值的anchor標記爲1
            # tx, ty, tw, th： 須要擬合目標實體框的座標和尺寸
            # tcls：目標實體框的所屬類別
            # tconf：全部anchor的目標置信度

            # 這裏計算獲得的iou_scores，class_mask，obj_mask，noobj_mask，tx, ty, tw, th和tconf都是（batch, anchor_num, gride, gride）
            # 預測的x,y,w,h,pred_conf也都是（batch, anchor_num, gride, gride）

            # tcls 和 pred_cls 都是（batch, anchor_num, gride, gride，num_class）


            # Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
            # 座標和尺寸的loss計算：
            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            # anchor置信度的loss計算：
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])  # tconf[obj_mask] 全爲1
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])  # tconf[noobj_mask] 全爲0
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            # 類別的loss計算
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])

            # loss彙總
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean()
            conf_obj = pred_conf[obj_mask].mean()
            conf_noobj = pred_conf[noobj_mask].mean()
            conf50 = (pred_conf > 0.5).float()
            iou50 = (iou_scores > 0.5).float()
            iou75 = (iou_scores > 0.75).float()
            detected_mask = conf50 * class_mask * tconf

            obj_mask = obj_mask.float()

            # print type(iou50), type(detected_mask), type(conf50.sum()), type(iou75), type(obj_mask)
            #
            # print iou50.dtype, detected_mask.dtype, conf50.sum().dtype, iou75.dtype, obj_mask.dtype
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

            self.metrics = {
                "loss": to_cpu(total_loss).item(),
                "x": to_cpu(loss_x).item(),
                "y": to_cpu(loss_y).item(),
                "w": to_cpu(loss_w).item(),
                "h": to_cpu(loss_h).item(),
                "conf": to_cpu(loss_conf).item(),
                "cls": to_cpu(loss_cls).item(),
                "cls_acc": to_cpu(cls_acc).item(),
                "recall50": to_cpu(recall50).item(),
                "recall75": to_cpu(recall75).item(),
                "precision": to_cpu(precision).item(),
                "conf_obj": to_cpu(conf_obj).item(),
                "conf_noobj": to_cpu(conf_noobj).item(),
                "grid_size": grid_size,
            }

            return output, total_loss


class Darknet(nn.Module):
    """YOLOv3 object detection model"""

    def __init__(self, config_path, img_size=416):
        super(Darknet, self).__init__()
        self.module_defs = parse_model_config(config_path)
        self.hyperparams, self.module_list = create_modules(self.module_defs)
        self.yolo_layers = [layer[0] for layer in self.module_list if hasattr(layer[0], "metrics")]
        self.img_size = img_size
        self.seen = 0
        self.header_info = np.array([0, 0, 0, self.seen, 0], dtype=np.int32)

    def forward(self, x, targets=None):
        img_dim = x.shape[2]
        loss = 0
        layer_outputs, yolo_outputs = [], []
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if module_def["type"] in ["convolutional", "upsample", "maxpool"]:
                x = module(x)
            elif module_def["type"] == "route":
                x = torch.cat([layer_outputs[int(layer_i)] for layer_i in module_def["layers"].split(",")], 1)
            elif module_def["type"] == "shortcut":
                layer_i = int(module_def["from"])
                x = layer_outputs[-1] + layer_outputs[layer_i]
            elif module_def["type"] == "yolo":
                x, layer_loss = module[0](x, targets, img_dim)
                loss += layer_loss
                yolo_outputs.append(x)
            layer_outputs.append(x)
        yolo_outputs = to_cpu(torch.cat(yolo_outputs, 1))
        return yolo_outputs if targets is None else (loss, yolo_outputs)

    def load_darknet_weights(self, weights_path):
        """Parses and loads the weights stored in 'weights_path'"""

        # Open the weights file
        with open(weights_path, "rb") as f:
            header = np.fromfile(f, dtype=np.int32, count=5)  # First five are header values
            self.header_info = header  # Needed to write header when saving weights
            self.seen = header[3]  # number of images seen during training
            weights = np.fromfile(f, dtype=np.float32)  # The rest are weights

        # Establish cutoff for loading backbone weights
        cutoff = None
        if "darknet53.conv.74" in weights_path:
            cutoff = 75

        ptr = 0
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if i == cutoff:
                break
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                if module_def["batch_normalize"]:
                    # Load BN bias, weights, running mean and running variance
                    bn_layer = module[1]
                    num_b = bn_layer.bias.numel()  # Number of biases
                    # Bias
                    bn_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.bias)
                    bn_layer.bias.data.copy_(bn_b)
                    ptr += num_b
                    # Weight
                    bn_w = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.weight)
                    bn_layer.weight.data.copy_(bn_w)
                    ptr += num_b
                    # Running Mean
                    bn_rm = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_mean)
                    bn_layer.running_mean.data.copy_(bn_rm)
                    ptr += num_b
                    # Running Var
                    bn_rv = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_var)
                    bn_layer.running_var.data.copy_(bn_rv)
                    ptr += num_b
                else:
                    # Load conv. bias
                    num_b = conv_layer.bias.numel()
                    conv_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(conv_layer.bias)
                    conv_layer.bias.data.copy_(conv_b)
                    ptr += num_b
                # Load conv. weights
                num_w = conv_layer.weight.numel()
                conv_w = torch.from_numpy(weights[ptr : ptr + num_w]).view_as(conv_layer.weight)
                conv_layer.weight.data.copy_(conv_w)
                ptr += num_w

    def save_darknet_weights(self, path, cutoff=-1):
        """
            @:param path    - path of the new weights file
            @:param cutoff  - save layers between 0 and cutoff (cutoff = -1 -> all are saved)
        """
        fp = open(path, "wb")
        self.header_info[3] = self.seen
        self.header_info.tofile(fp)

        # Iterate through layers
        for i, (module_def, module) in enumerate(zip(self.module_defs[:cutoff], self.module_list[:cutoff])):
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                # If batch norm, load bn first
                if module_def["batch_normalize"]:
                    bn_layer = module[1]
                    bn_layer.bias.data.cpu().numpy().tofile(fp)
                    bn_layer.weight.data.cpu().numpy().tofile(fp)
                    bn_layer.running_mean.data.cpu().numpy().tofile(fp)
                    bn_layer.running_var.data.cpu().numpy().tofile(fp)
                # Load conv bias
                else:
                    conv_layer.bias.data.cpu().numpy().tofile(fp)
                # Load conv weights
                conv_layer.weight.data.cpu().numpy().tofile(fp)

        fp.close()

models

# coding:utf8
from __future__ import division
import math
import time
import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def to_cpu(tensor):
    return tensor.detach().cpu()


def load_classes(path):
    """
    Loads class labels at 'path'
    """
    fp = open(path, "r")
    names = fp.read().split("\n")[:-1]
    return names


def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)


def rescale_boxes(boxes, current_dim, original_shape):
    """ Rescales bounding boxes to the original shape """
    orig_h, orig_w = original_shape
    # The amount of padding that was added
    pad_x = max(orig_h - orig_w, 0) * (current_dim / max(original_shape))
    pad_y = max(orig_w - orig_h, 0) * (current_dim / max(original_shape))
    # Image height and width after padding is removed
    unpad_h = current_dim - pad_y
    unpad_w = current_dim - pad_x
    # Rescale bounding boxes to dimension of original image
    boxes[:, 0] = ((boxes[:, 0] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 1] = ((boxes[:, 1] - pad_y // 2) / unpad_h) * orig_h
    boxes[:, 2] = ((boxes[:, 2] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 3] = ((boxes[:, 3] - pad_y // 2) / unpad_h) * orig_h
    return boxes


def xywh2xyxy(x):
    y = x.new(x.shape)
    y[..., 0] = x[..., 0] - x[..., 2] / 2
    y[..., 1] = x[..., 1] - x[..., 3] / 2
    y[..., 2] = x[..., 0] + x[..., 2] / 2
    y[..., 3] = x[..., 1] + x[..., 3] / 2
    return y


def ap_per_class(tp, conf, pred_cls, target_cls):
    """ Compute the average precision, given the recall and precision curves.
    Source: https://github.com/rafaelpadilla/Object-Detection-Metrics.
    # Arguments
        tp:    True positives (list).
        conf:  Objectness value from 0-1 (list).
        pred_cls: Predicted object classes (list).
        target_cls: True object classes (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """

    # Sort by objectness
    i = np.argsort(-conf)
    tp, conf, pred_cls = tp[i], conf[i], pred_cls[i]

    # Find unique classes
    unique_classes = np.unique(target_cls)

    # Create Precision-Recall curve and compute AP for each class
    ap, p, r = [], [], []
    for c in tqdm.tqdm(unique_classes, desc="Computing AP"):
        i = pred_cls == c
        n_gt = (target_cls == c).sum()  # Number of ground truth objects
        n_p = i.sum()  # Number of predicted objects

        if n_p == 0 and n_gt == 0:
            continue
        elif n_p == 0 or n_gt == 0:
            ap.append(0)
            r.append(0)
            p.append(0)
        else:
            # Accumulate FPs and TPs
            fpc = (1 - tp[i]).cumsum()
            tpc = (tp[i]).cumsum()

            # Recall
            recall_curve = tpc / (n_gt + 1e-16)
            r.append(recall_curve[-1])

            # Precision
            precision_curve = tpc / (tpc + fpc)
            p.append(precision_curve[-1])

            # AP from recall-precision curve
            ap.append(compute_ap(recall_curve, precision_curve))

    # Compute F1 score (harmonic mean of precision and recall)
    p, r, ap = np.array(p), np.array(r), np.array(ap)
    f1 = 2 * p * r / (p + r + 1e-16)

    return p, r, ap, f1, unique_classes.astype("int32")


def compute_ap(recall, precision):
    """ Compute the average precision, given the recall and precision curves.
    Code originally from https://github.com/rbgirshick/py-faster-rcnn.

    # Arguments
        recall:    The recall curve (list).
        precision: The precision curve (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """
    # correct AP calculation
    # first append sentinel values at the end
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([0.0], precision, [0.0]))

    # compute the precision envelope
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # to calculate area under PR curve, look for points
    # where X axis (recall) changes value
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # and sum (\Delta recall) * prec
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap


def get_batch_statistics(outputs, targets, iou_threshold):
    """ Compute true positives, predicted scores and predicted labels per sample """
    batch_metrics = []

    # print "outputs len: {}".format(len(outputs))
    # print "targets shape: {}".format(targets.shape)
    # outputs: (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    # target:  (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    for sample_i in range(len(outputs)):

        if outputs[sample_i] is None:
            continue

        # output: (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        output = outputs[sample_i]
        # print "output: {}".format(output.shape)

        pred_boxes = output[:, :4]  # 預測框的x,y,w,h
        pred_scores = output[:, 4]  # 預測框的置信度
        pred_labels = output[:, -1]  # 預測框的類別label

        # 長度爲pred_boxes_num的list，初始化爲0，若是預測框和實際框匹配，則設置爲1
        true_positives = np.zeros(pred_boxes.shape[0])

        # 得到真實目標框的類別label
        # annotations = targets[targets[:, 0] == sample_i][:, 1:]
        annotations = targets[targets[:, 0] == sample_i]
        annotations = annotations[:, 1:] if len(annotations) else []
        target_labels = annotations[:, 0] if len(annotations) else []

        if len(annotations):  # len(annotations)>0: 表示這張圖片有真實的目標框
            detected_boxes = []
            target_boxes = annotations[:, 1:]  # 真實目標框的x,y,w,h

            for pred_i, (pred_box, pred_label) in enumerate(zip(pred_boxes, pred_labels)):

                # If targets are found break
                if len(detected_boxes) == len(annotations):
                    break

                # Ignore if label is not one of the target labels
                # 若是該預測框的類別標籤不存在與目標框的類別標籤集合中，則一定是預測錯誤
                if pred_label not in target_labels:
                    continue

                # 將一個預測框與全部真實目標框作IOU計算，並獲取IOU最大的值(iou)，和與之對應的真實目標框的索引號(box_index)
                iou, box_index = bbox_iou(pred_box.unsqueeze(0), target_boxes).max(0)
                # 若是最大IOU大於閾值，則認爲該真實目標框被發現。注意要防止被重複記錄
                if iou >= iou_threshold and box_index not in detected_boxes:
                    true_positives[pred_i] = 1  # 對該預測框設置爲1
                    detected_boxes += [box_index]  # 記錄被發現的實際框索引號，防止預測框重複標記，即一個實際框只能被一個預測框匹配
        # 保存當前圖片被預測的信息
        # true_positives：預測框的正確與否，正確設置爲1，錯誤設置爲0
        # pred_scores：預測框的x,y,w,h
        # pred_labels：預測框的類別標籤
        batch_metrics.append([true_positives, pred_scores, pred_labels])
    return batch_metrics


def bbox_wh_iou(wh1, wh2):
    wh2 = wh2.t()
    w1, h1 = wh1[0], wh1[1]
    w2, h2 = wh2[0], wh2[1]
    # print w1, w2, h1, h2

    inter_area = torch.min(w1, w2) * torch.min(h1, h2)
    union_area = (w1 * h1 + 1e-16) + w2 * h2 - inter_area
    # print inter_area, union_area
    return inter_area / union_area


def bbox_iou(box1, box2, x1y1x2y2=True):
    """
    Returns the IoU of two bounding boxes
    """
    if not x1y1x2y2:
        # Transform from center and width to exact coordinates
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        # Get the coordinates of bounding boxes
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    # get the corrdinates of the intersection rectangle
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    # Intersection area
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(
        inter_rect_y2 - inter_rect_y1 + 1, min=0
    )
    # Union Area
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou


def non_max_suppression(prediction, conf_thres=0.5, nms_thres=0.4):
    """
    Removes detections with lower object confidence score than 'conf_thres' and performs
    Non-Maximum Suppression to further filter detections.
    Returns detections with shape:
        (x1, y1, x2, y2, object_conf, class_score, class_pred)
    """
    # prediction: (batch_size, num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
    # From (center x, center y, width, height) to (x1, y1, x2, y2)
    prediction[..., :4] = xywh2xyxy(prediction[..., :4])
    output = [None for _ in range(len(prediction))]

    for image_i, image_pred in enumerate(prediction):
        # Filter out confidence scores below threshold
        # 獲得置信預測框：過濾anchor置信度小於閾值的預測框
        # print image_pred.shape (num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[image_pred[:, 4] >= conf_thres]
        # print image_pred.shape  (more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)

        # If none are remaining => process next image
        # 基於anchor的置信度過濾完後，看看是否還有保留的預測框，若是都被過濾，則認爲沒有實體目標被檢測到
        if not image_pred.size(0):
            continue

        # Object confidence times class confidence
        # 計算處理：先選取每一個預測框所表明的最大類別值，再將這個值乘以對應的anchor置信度，這樣將類別預測精準度和置信度都考慮在內。
        # 每一個置信預測框都會對應一個score值
        score = image_pred[:, 4] * image_pred[:, 5:].max(1)[0]
        # Sort by it
        # 基於score值，將置信預測框從大到小進行排序
        # image_pred = image_pred[(-score).argsort()]
        # 置信預測：image_pred ==》(more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[torch.sort(-score, dim=0)[1]]
        # image_pred[:, 5:] ==> (more_than_conf_thres_num, cls)
        # 該處理是獲取每一個置信預測框所對應的類別預測分值（class_confs）和類別索引（class_preds）
        class_confs, class_preds = image_pred[:, 5:].max(1, keepdim=True)
        # 將置信預測框的 x,y,w,h,conf，類別預測分值和類別索引關聯到一塊兒
        # detections ==》 (more_than_conf_thres_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        detections = torch.cat((image_pred[:, :5], class_confs.float(), class_preds.float()), 1)

        # Perform non-maximum suppression
        keep_boxes = []
        while detections.size(0):
            # detections[0, :4]是第一個置信預測框，也是當前序列中分值最大的置信預測框
            # 計算當前序列的第一個（分值最大）置信預測框與整個序列預測框的IOU，並將IOU大於閾值的設置爲1，小於的設置爲0。
            large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres
            # 匹配與當前序列的第一個（分值最大）置信預測框具備相同類別標籤的全部預測框（將相同類別標籤的預測框標記爲1）
            label_match = detections[0, -1] == detections[:, -1]

            # Indices of boxes with lower confidence scores, large IOUs and matching labels
            # 與當前序列的第一個（分值最大）置信預測框IOU大，說明這些預測框與其相交面積大，
            # 若是這些預測框的標籤與當前序列的第一個（分值最大）置信預測框的相同，則說明是預測的同一個目標，
            # 對與當前序列第一個（分值最大）置信預測框預測了同一目標的設置爲1（包括當前序列第一個（分值最大）置信預測框自己）。
            invalid = large_overlap & label_match
            # 取出對應置信預測框的置信度，將置信度做爲權重
            weights = detections[invalid, 4:5]

            # Merge overlapping bboxes by order of confidence
            # 把預測爲同一目標的預測框進行合併，合併後認爲是最優的預測框。合併方式以下：
            detections[0, :4] = (weights * detections[invalid, :4]).sum(0) / weights.sum()
            # 保存當前序列中最終識別的預測框
            keep_boxes += [detections[0]]
            # ~invalid表示取反，將以前的0變爲1，即選取剩下的預測框，進行新一輪的計算
            detections = detections[~invalid]
        if keep_boxes:
            # 每張圖片的最終預測框有pred_boxes_num個，output[image_i]的shape：
            # (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
            output[image_i] = torch.stack(keep_boxes)

    # (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    return output


def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
    # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
    # pred_cls => (batch_size, anchor_num, gride, gride, 80)
    # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    # anchors => (3, 2)

    ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0)  # batch num
    nA = pred_boxes.size(1)  # anchor num
    nC = pred_cls.size(-1)  # class num => 80
    nG = pred_boxes.size(2)  # gride

    # Output tensors
    obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)  # (batch_size, anchor_num, gride, gride)
    noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
    tx = FloatTensor(nB, nA, nG, nG).fill_(0)
    ty = FloatTensor(nB, nA, nG, nG).fill_(0)
    tw = FloatTensor(nB, nA, nG, nG).fill_(0)
    th = FloatTensor(nB, nA, nG, nG).fill_(0)
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)  # (batch_size, anchor_num, gride, gride, class_num)

    # Convert to position relative to box
    # 這一步是將x,y,w,h這四個歸一化的變量變爲真正的尺寸，由於當前圖像的尺寸是nG，因此乘以nG。
    # print target[:, 2:6].shape  # (num, 4)
    target_boxes = target[:, 2:6] * nG  # (num, 4)  4=>(center_x, center_y, widht, height)
    gxy = target_boxes[:, :2]  # (num, 2)
    gwh = target_boxes[:, 2:]  # (num, 2)
    # print target_boxes.shape, gxy.shape, gwh.shape

    # Get anchors with best iou
    # 這一步是爲每個目標框從三種anchor框中分配一個最優的.
    # anchor 是設置的錨框，gwh是真實標記的寬高，這裏是比較二者的交集，選出最佳的錨框,由於只是選擇哪一種錨框，不用考慮中心座標。
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])  # (3, num)
    # ious（3，num），該處理是爲每個目標框選取一個IOU最大的anchor框，best_ious表示最大IOU的值，best_n表示最大IOU對應anchor的index
    best_ious, best_n = ious.max(0)  # best_ious 和 best_n 的長度均爲 num， best_n是num個目標框對應的anchor索引

    # Separate target values
    # .t() 表示轉置，（num，2） =》（2，num）
    # （2，num）  2=>(batch_index, cls） =》 b(num)表示對應num個index, target_labels(num)表示對應num個labels
    b, target_labels = target[:, :2].long().t()
    gx, gy = gxy.t()  # gx表示num個x， gy表示num個y
    gw, gh = gwh.t()
    gi, gj = gxy.long().t()  # .long()是把浮點型轉爲整型（去尾），這樣就能夠獲得目標框中心點所在的網格座標


    # ---------------------------獲得目標實體框obj_mask和目標非實體框noobj_mask  start----------------------------
    # Set masks
    # 表示batch中的第b張圖片，其網格座標爲(gj, gi)的單元網格存在目標框的中心點，該目標框所匹配的最優anchor索引爲best_n
    obj_mask[b, best_n, gj, gi] = 1  # 對目標實體框中心點所在的單元網格，其最優anchor設置爲1
    noobj_mask[b, best_n, gj, gi] = 0  # 對目標實體框中心點所在的單元網格，其最優anchor設置爲0 （與obj_mask相反）

    # Set noobj mask to zero where iou exceeds ignore threshold
    # ious.t(): (3, num) => (num, 3)
    # 這裏不一樣與上一個策略，上個策略是找到與目標框最優的anchor框，每一個目標框對應一個anchor框。
    # 這裏不考慮最優問題，只要目標框與anchor的IOU大於閾值，就認爲是有效anchor框，即noobj_mask對應的位置設置爲0
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # 以上操做獲得了目標實體框obj_mask和目標非實體框noobj_mask，目標實體框是與實體一一對應的，一個實體有一個最匹配的目標框；
    # 目標非實體框noobj_mask，該框既不是實體最匹配的，並且還要該框與實體IOU小於閾值，這也是爲了讓正負樣例更加明顯。
    # ---------------------------獲得目標實體框obj_mask和目標非實體框noobj_mask  end------------------------------

    # ---------------------------獲得目標實體框的歸一化座標（tx, ty, tw, th）  start------------------------------
    # Coordinates
    # 將x,y,w,h從新歸一化，
    # 注意：要明白這裏爲何要這麼作，此處的歸一化和傳入target的歸一化方式不同，
    # 傳入target的歸一化是實際的x,y,w,h / img_size. 即實際x,y,w,h在img_size中的比例，
    # 此處的歸一化中，中心座標x,y是基於單元網絡的，w,h是基於anchor框，此處歸一化的x,y,w,h，也是模型要擬合的值。
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
    # ---------------------------獲得目標實體框的歸一化座標（tx, ty, tw, th）  end---------------------------------


    # One-hot encoding of label
    # 表示batch中的第b張圖片，其網格座標爲(gj, gi)的單元網格存在目標框的中心點，該目標框所匹配的最優anchor索引爲best_n，其類別爲target_labels
    tcls[b, best_n, gj, gi, target_labels] = 1

    # Compute label correctness and iou at best anchor
    # class_mask:將預測正確的標記爲1（正確的預測了實體中心點所在的網格座標，哪一個anchor框能夠最匹配實體，以及實體的類別）
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU做爲分數，IOU越大，分值越高。
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
    # tconf：正確的目標實體框，其對應anchor框的置信度爲1，即置信度的標籤，這裏轉爲float，是爲了後面和預測的置信度值作loss計算。
    tconf = obj_mask.float()


    # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU做爲分數，IOU越大，分值越高。
    # class_mask：將預測正確的標記爲1（正確的預測了實體中心點所在的網格座標，哪一個anchor框能夠最匹配實體，以及實體的類別）
    # obj_mask：將目標實體框所對應的anchor標記爲1，目標實體框所對應的anchor與實體一一對應的
    # noobj_mask：將全部與目標實體框IOU小於某一閾值的anchor標記爲1
    # tx, ty, tw, th： 須要擬合目標實體框的座標和尺寸
    # tcls：目標實體框的所屬類別
    # tconf：全部anchor的目標置信度
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf