1. 準備工做

除了須要安裝pytorch和torchvision外，還須要安裝COCO的API pycocotools
windows系統安裝pycocotools的方法：Windows下安裝pycocotools
git

導入相關包和模塊：github

import torch
import os
import numpy as np
import cv2
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
from PIL import Image
from xml.dom.minidom import parse
%matplotlib inline

2. 定義數據集

我使用的是本身使用labelme進行標註後轉爲voc格式的目標檢測數據集，除了背景外有兩種標籤（mark_type_1和mark_type_2），即num_classes=3
個人voc數據集的目錄結構以下圖所示：

windows

Annotations文件夾下的xml標註舉例：網絡

<annotation>
  <folder/>
  <filename>172.26.80.5_01_20191128084208520_TIMING.jpg</filename>
  <database/>
  <annotation/>
  <image/>
  <size>
    <height>1536</height>
    <width>2048</width>
    <depth>3</depth>
  </size>
  <segmented/>
  <object>
    <name>mark_type_1</name>
    <pose/>
    <truncated/>
    <difficult/>
    <bndbox>
      <xmin>341.4634146341463</xmin>
      <ymin>868.2926829268292</ymin>
      <xmax>813.4146341463414</xmax>
      <ymax>986.5853658536585</ymax>
    </bndbox>
  </object>
  <object>
    <name>mark_type_1</name>
    <pose/>
    <truncated/>
    <difficult/>
    <bndbox>
      <xmin>1301.2195121951218</xmin>
      <ymin>815.8536585365853</ymin>
      <xmax>1769.512195121951</xmax>
      <ymax>936.5853658536585</ymax>
    </bndbox>
  </object>
</annotation>

該標註包含了一個類別（mark_type_1）的兩個bboxapp

定義數據集：dom

class MarkDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "JPEGImages"))))
        self.bbox_xml = list(sorted(os.listdir(os.path.join(root, "Annotations"))))
 
    def __getitem__(self, idx):
        # load images and bbox
        img_path = os.path.join(self.root, "JPEGImages", self.imgs[idx])
        bbox_xml_path = os.path.join(self.root, "Annotations", self.bbox_xml[idx])
        img = Image.open(img_path).convert("RGB")        
        
        # 讀取文件，VOC格式的數據集的標註是xml格式的文件
        dom = parse(bbox_xml_path)
        # 獲取文檔元素對象
        data = dom.documentElement
        # 獲取 objects
        objects = data.getElementsByTagName('object')        
        # get bounding box coordinates
        boxes = []
        labels = []
        for object_ in objects:
            # 獲取標籤中內容
            name = object_.getElementsByTagName('name')[0].childNodes[0].nodeValue  # 就是label，mark_type_1或mark_type_2
            labels.append(np.int(name[-1]))  # 背景的label是0，mark_type_1和mark_type_2的label分別是1和2
            
            bndbox = object_.getElementsByTagName('bndbox')[0]
            xmin = np.float(bndbox.getElementsByTagName('xmin')[0].childNodes[0].nodeValue)
            ymin = np.float(bndbox.getElementsByTagName('ymin')[0].childNodes[0].nodeValue)
            xmax = np.float(bndbox.getElementsByTagName('xmax')[0].childNodes[0].nodeValue)
            ymax = np.float(bndbox.getElementsByTagName('ymax')[0].childNodes[0].nodeValue)
            boxes.append([xmin, ymin, xmax, ymax])        
 
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.as_tensor(labels, dtype=torch.int64)        
 
        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((len(objects),), dtype=torch.int64)
 
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        # 因爲訓練的是目標檢測網絡，所以沒有教程中的target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd
 
        if self.transforms is not None:
            # 注意這裏target(包括bbox)也轉換\加強了，和from torchvision import的transforms的不一樣
            # https://github.com/pytorch/vision/tree/master/references/detection 的 transforms.py裏就有RandomHorizontalFlip時target變換的示例
            img, target = self.transforms(img, target)
 
        return img, target
 
    def __len__(self):
        return len(self.imgs)

3. 定義模型

有兩種方式來修改torchvision默認的目標檢測模型：第一種，採用預訓練的模型，在修改網絡最後一層後finetuning微調；第二種，根據須要替換掉模型中的骨幹網絡，如將ResNet替換成MobileNet等。
這兩種方法的具體使用能夠參考最文章開頭的官方教程以及官方教程翻譯，在這裏我選擇了第一種方法。
定義模型能夠簡單的使用：

函數

torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=3, pretrained_backbone=True)

也可參考教程的寫法：

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
 
      
def get_object_detection_model(num_classes):
    # load an object detection model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    
    # replace the classifier with a new one, that has num_classes which is user-defined
    num_classes = 3  # 3 class (mark_type_1，mark_type_2) + background
 
    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
 
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
 
    return model

4. 數據加強

在圖像輸入到網絡前，須要對其進行數據加強。這裏須要注意的是，因爲Faster R-CNN模型自己能夠處理歸一化（默認使用ImageNet的均值和標準差來歸一化）及尺度變化的問題，於是無需在這裏進行mean/std normalization或圖像縮放的操做。
因爲from torchvision import的transforms只能對圖片進行數據加強，而沒法同時改變圖片對應的label標籤，所以咱們選擇使用torchvision Github項目中的一些封裝好的用於模型訓練和測試的函數：https://github.com/pytorch/vision/tree/master/references/detection

其中的engine.py、utils.py、transforms.py、coco_utils.py、coco_eval.py咱們會用到，把這些文件下載下來。我把這些文件放在了正在寫的Faster-RCNN目標檢測模型訓練.ipynb腳本的旁邊

查看下載下來的transforms.py，能夠看到它裏面就有對圖像進行水平翻轉（RandomHorizontalFlip）時target（bbox）變換的示例:

class RandomHorizontalFlip(object):
    def __init__(self, prob):
        self.prob = prob

    def __call__(self, image, target):
        if random.random() < self.prob:
            height, width = image.shape[-2:]
            image = image.flip(-1)
            bbox = target["boxes"]
            bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
            target["boxes"] = bbox
            if "masks" in target:
                target["masks"] = target["masks"].flip(-1)
            if "keypoints" in target:
                keypoints = target["keypoints"]
                keypoints = _flip_coco_person_keypoints(keypoints, width)
                target["keypoints"] = keypoints
        return image, target

由此編寫相應的數據加強函數：

import utils
import transforms as T
from engine import train_one_epoch, evaluate
# utils、transforms、engine就是剛纔下載下來的utils.py、transforms.py、engine.py
 
def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        # 50%的機率水平翻轉
        transforms.append(T.RandomHorizontalFlip(0.5))
 
    return T.Compose(transforms)

5. 訓練模型

至此，數據集、模型、數據加強的部分都已經寫好。在模型初始化、優化器及學習率調整策略選定後，就能夠開始訓練了。在每一個epoch訓練完成後，咱們還要在測試集上對模型的性能進行評價。

from engine import train_one_epoch, evaluate
import utils


root = r'數據集路徑'

# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# 3 classes, mark_type_1，mark_type_2，background
num_classes = 3
# use our dataset and defined transformations
dataset = MarkDataset(root, get_transform(train=True))
dataset_test = MarkDataset(root, get_transform(train=False))

# split the dataset in train and test set
# 個人數據集一共有492張圖，差很少訓練驗證4:1
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-100])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-100:])

# define training and validation data loaders
# 在jupyter notebook裏訓練模型時num_workers參數只能爲0，否則會報錯，這裏就把它註釋掉了
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, # num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=2, shuffle=False, # num_workers=4,
    collate_fn=utils.collate_fn)

# get the model using our helper function
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=False, progress=True, num_classes=num_classes, pretrained_backbone=True)  # 或get_object_detection_model(num_classes)

# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]

# SGD
optimizer = torch.optim.SGD(params, lr=0.0003,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler
# cos學習率
lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=1, T_mult=2)

# let's train it for   epochs
num_epochs = 31

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    # engine.py的train_one_epoch函數將images和targets都.to(device)了
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=50)

    # update the learning rate
    lr_scheduler.step()

    # evaluate on the test dataset    
    evaluate(model, data_loader_test, device=device)    
    
    print('')
    print('==================================================')
    print('')

print("That's it!")

能夠看到第一個epoch的學習率並非設定的0.0003，而是從0開始逐漸增加的，其緣由是在engine.py的train_one_epoch函數中，第一個epoch採用了warmup的學習率：

if epoch == 0:
        warmup_factor = 1. / 1000
        warmup_iters = min(1000, len(data_loader) - 1)

        lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)

此外，因爲個人數據集中的bbox的面積都比較大，所以area= small時的AP和AR都爲-1.000

最後，保存模型：

torch.save(model, r'保存路徑\modelname.pkl')

6. 查看效果

用opencv繪製bbox：

def showbbox(model, img):
    # 輸入的img是0-1範圍的tensor        
    model.eval()
    with torch.no_grad():
        '''
        prediction形如：
        [{'boxes': tensor([[1492.6672,  238.4670, 1765.5385,  315.0320],
        [ 887.1390,  256.8106, 1154.6687,  330.2953]], device='cuda:0'), 
        'labels': tensor([1, 1], device='cuda:0'), 
        'scores': tensor([1.0000, 1.0000], device='cuda:0')}]
        '''
        prediction = model([img.to(device)])
        
    print(prediction)
        
    img = img.permute(1,2,0)  # C,H,W → H,W,C，用來畫圖
    img = (img * 255).byte().data.cpu()  # * 255，float轉0-255
    img = np.array(img)  # tensor → ndarray
    
    for i in range(prediction[0]['boxes'].cpu().shape[0]):
        xmin = round(prediction[0]['boxes'][i][0].item())
        ymin = round(prediction[0]['boxes'][i][1].item())
        xmax = round(prediction[0]['boxes'][i][2].item())
        ymax = round(prediction[0]['boxes'][i][3].item())
        
        label = prediction[0]['labels'][i].item()
        
        if label == 1:
            cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (255, 0, 0), thickness=2)
            cv2.putText(img, 'mark_type_1', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 0, 0),
                               thickness=2)
        elif label == 2:
            cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=2)
            cv2.putText(img, 'mark_type_2', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0),
                               thickness=2)
    
    plt.figure(figsize=(20,15))
    plt.imshow(img)

查看效果：