寫給程序員的機器學習入門 (九) - 對象識別 RCNN 與 Fast-RCNN

時間 2020-11-28

標籤 html python c++ git github 算法微信 app dom 機器學習欄目快樂工作简体版

原文原文鏈接

由於這幾個月飯店生意恢復，加上研究 Faster-RCNN 用掉了不少時間，就沒有更新博客了🐶。這篇開始會介紹對象識別的模型與實現方法，首先會介紹最簡單的 RCNN 與 Fast-RCNN 模型，下一篇會介紹 Faster-RCNN 模型，再下一篇會介紹 YOLO 模型。html

圖片分類與對象識別

在前面的文章中咱們看到了如何使用 CNN 模型識別圖片裏面的物體是什麼類型，或者識別圖片中固定的文字 (即驗證碼)，由於模型會把整個圖片看成輸入並輸出固定的結果，因此圖片中只能有一個主要的物體或者固定數量的文字。python

若是圖片包含了多個物體，咱們想識別有哪些物體，各個物體在什麼位置，那麼只用 CNN 模型是沒法實現的。咱們須要能夠找出圖片哪些區域包含物體而且判斷每一個區域包含什麼物體的模型，這樣的模型稱爲對象識別模型 (Object Detection Model)，最先期的對象識別模型是 RCNN 模型，後來又發展出 Fast-RCNN (SPPnet)，Faster-RCNN ，和 YOLO 等模型。由於對象識別須要處理的數據量多，速度會比較慢 (例如 RCNN 檢測單張圖片包含的物體可能須要幾十秒)，而對象識別一般又要求實時性 (例如來源是攝像頭提供的視頻)，因此如何提高對象識別的速度是一個主要的命題，後面發展出的 Faster-RCNN 與 YOLO 均可以在一秒鐘檢測幾十張圖片。c++

對象識別的應用範圍比較廣，例如人臉識別，車牌識別，自動駕駛等等都用到了對象識別的技術。對象識別是當今機器學習領域的一個前沿，2017 年研發出來的 Mask-RCNN 模型還能夠檢測對象的輪廓。git

由於看上去越神奇的東西實現起來越難，對象識別模型相對於以前介紹的模型難度會高不少，請作好心理準備😱。github

對象識別模型須要的訓練數據

在介紹具體的模型以前，咱們首先看看對象識別模型須要什麼樣的訓練數據：算法

對象識別模型須要給每一個圖片標記有哪些區域，與每一個區域對應的標籤，也就是訓練數據須要是列表形式的。區域的格式一般有兩種，(x, y, w, h) => 左上角的座標與長寬，與 (x1, y1, x2, y2) => 左上角與右下角的座標，這兩種格式能夠互相轉換，處理的時候只須要注意是哪一種格式便可。標籤除了須要識別的各個分類以外，還須要有一個特殊的非對象 (背景) 標籤，表示這個區域不包含任何能夠識別的對象，由於非對象區域一般能夠自動生成，因此訓練數據不須要包含非對象區域與標籤。微信

RCNN

RCNN (Region Based Convolutional Neural Network) 是最先期的對象識別模型，實現比較簡單，能夠分爲如下步驟：app

用某種算法在圖片中選取 2000 個可能出現對象的區域
截取這 2000 個區域到 2000 個子圖片，而後縮放它們到一個固定的大小
用普通的 CNN 模型分別識別這 2000 個子圖片，得出它們的分類
排除標記爲 "非對象" 分類的區域
把剩餘的區域做爲輸出結果

你可能已經從步驟裏看出，RCNN 有幾個大問題😠：dom

結果的精度很大程度取決於選取區域使用的算法
選取區域使用的算法是固定的，不參與學習，若是算法沒有選出某個包含對象區域那麼怎麼學習都沒法識別這個區域出來
慢，賊慢🐢，識別 1 張圖片實際等於識別 2000 張圖片

後面介紹模型結果會解決這些問題，但首先咱們須要理解最簡單的 RCNN 模型，接下來咱們細看一下 RCNN 實現中幾個重要的部分吧。機器學習

選取可能出現對象的區域

選取可能出現對象的區域的算法有不少種，例如滑動窗口法 (Sliding Window) 和選擇性搜索法 (Selective Search)。滑動窗口法很是簡單，決定一個固定大小的區域，而後按必定距離滑動得出下一個區域便可。滑動窗口法實現簡單但選取出來的區域數量很是龐大而且精度很低，因此一般不會使用這種方法，除非物體大小固定而且出現的位置有必定規律。

選擇性搜索法則比較高級，如下是簡單的說明，摘自 opencv 的文章：

你還能夠參考這篇文章或原始論文瞭解具體的計算方法。

若是你以爲難以理解能夠跳過，由於接下來咱們會直接使用 opencv 類庫中提供的選擇搜索函數。並且選擇搜索法精度也不高，後面介紹的模型將會使用更好的方法。

# 使用 opencv 類庫中提供的選擇搜索函數的代碼例子
import cv2

img = cv2.imread("圖片路徑")
s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
s.setBaseImage(img)
s.switchToSelectiveSearchFast()
boxes = s.process() # 可能出現對象的全部區域，會按可能性排序
candidate_boxes = boxes[:2000] # 選取頭 2000 個區域

按重疊率 (IOU) 判斷每一個區域是否包含對象

使用算法選取出來的區域與實際區域一般不會徹底重疊，只會重疊一部分，在學習的過程當中咱們須要根據手頭上的真實區域預先判斷選取出來的區域是否包含對象，再告訴模型預測結果是否正確。判斷選取區域是否包含對象會依據重疊率 (IOU - Intersection Over Union)，所謂重疊率就是兩個區域重疊的面積佔兩個區域合併的面積的比率，以下圖所示。

咱們能夠規定重疊率大於 70% 的候選區域包含對象，重疊率小於 30% 的區域不包含對象，而重疊率介於 30% ~ 70% 的區域不該該參與學習，這是爲了給模型提供比較明確的數據，使得學習效果更好。

計算重疊率的代碼以下，若是兩個區域沒有重疊則重疊率會爲 0：

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

原始論文

若是你想看 RCNN 的原始論文能夠到如下的地址：

https://arxiv.org/pdf/1311.2524.pdf

使用 RCNN 識別圖片中的人臉

好了，到這裏你應該大體瞭解 RCNN 的實現原理，接下來咱們試着用 RCNN 學習識別一些圖片。

由於收集圖片和標記圖片很是累人🤕，爲了偷懶這篇我仍是使用現成的數據集。如下是包含人臉圖片的數據集，而且帶了各我的臉所在的區域的標記，格式是 (x1, y1, x2, y2)。下載須要註冊賬號，但不須要交錢🤒。

https://www.kaggle.com/vin1234/count-the-number-of-faces-present-in-an-image

下載解壓後能夠看到圖片在 train/image_data 下，標記在 bbox_train.csv 中。

例如如下的圖片：

對應 csv 中的如下標記：

Name,width,height,xmin,ymin,xmax,ymax
10001.jpg,612,408,192,199,230,235
10001.jpg,612,408,247,168,291,211
10001.jpg,612,408,321,176,366,222
10001.jpg,612,408,355,183,387,214

數據的意義以下：

Name: 文件名
width: 圖片總體寬度
height: 圖片總體高度
xmin: 人臉區域的左上角的 x 座標
ymin: 人臉區域的左上角的 y 座標
xmax: 人臉區域的右下角的 x 座標
ymax: 人臉區域的右下角的 y 座標

使用 RCNN 學習與識別這些圖片中的人臉區域的代碼以下：

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import pandas
import torchvision
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 各個區域縮放到的圖片大小
REGION_IMAGE_SIZE = (32, 32)
# 分析目標的圖片所在的文件夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 文件
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啓用 GPU 支持
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class MyModel(nn.Module):
    """識別是否人臉 (ResNet-18)"""
    def __init__(self):
        super().__init__()
        # Resnet 的實現
        # 輸出兩個分類 [非人臉, 人臉]
        self.resnet = torchvision.models.resnet18(num_classes=2)

    def forward(self, x):
        # 應用 ResNet
        y = self.resnet(x)
        return y

def save_tensor(tensor, path):
    """保存 tensor 對象到文件"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從文件讀取 tensor 對象"""
    return torch.load(gzip.GzipFile(path, "rb"))

def image_to_tensor(img):
    """轉換 opencv 圖片對象到 tensor 對象"""
    # 注意 opencv 是 BGR，但對訓練沒有影響因此不用轉爲 RGB
    img = cv2.resize(img, dsize=REGION_IMAGE_SIZE)
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得範圍在 0 ~ 1
    return t

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def selective_search(img):
    """計算 opencv 圖片中可能出現對象的區域，只返回頭 2000 個區域"""
    # 算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_labels):
    """準備訓練 - 保存單個批次的數據"""
    # 生成輸入和輸出 tensor 對象
    tensor_in = torch.stack(image_tensors) # 維度: B,C,W,H
    tensor_out = torch.tensor(image_labels, dtype=torch.long) # 維度: B

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(tensor_in.shape[0])
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = (tensor_in[training_indices], tensor_out[training_indices])
    validating_set = (tensor_in[validating_indices], tensor_out[validating_indices])
    testing_set = (tensor_in[testing_indices], tensor_out[testing_indices])

    # 保存到硬盤
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """準備訓練"""
    # 數據集轉換到 tensor 之後會保存在 data 文件夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 加載 csv 文件，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片裏面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 1000
    batch = 0
    image_tensors = []
    image_labels = []
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img = cv2.imread(path) # 加載原始圖片
        candidate_boxes = selective_search(img) # 查找候選區域
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 若是候選區域和任意一個實際區域重疊率大於 70%，則認爲是正樣本
            # 若是候選區域和全部實際區域重疊率都小於 30%，則認爲是負樣本
            # 每一個圖片最多添加正樣本數量 + 10 個負樣本，須要提供足夠多負樣本避免僞陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                # 檢驗計算是否有問題
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_tensors.append(image_to_tensor(child_img))
                image_labels.append(int(result))
                if len(image_tensors) >= batch_size:
                    # 保存批次
                    prepare_save_batch(batch, image_tensors, image_labels)
                    image_tensors.clear()
                    image_labels.clear()
                    batch += 1
    # 保存剩餘的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_labels)

def train():
    """開始訓練"""
    # 建立模型實例
    model = MyModel().to(device)

    # 建立損失計算器
    loss_function = torch.nn.CrossEntropyLoss()

    # 建立參數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_accuracy_history = []
    validating_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_accuracy_highest = -1
    validating_accuracy_highest_epoch = 0

    # 讀取批次的工具函數
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ t.to(device) for t in load_tensor(path) ]

    # 計算正確率的工具函數，正樣本和負樣本的正確率分別計算再平均
    def calc_accuracy(actual, predicted):
        predicted = torch.max(predicted, 1).indices
        acc_positive = ((actual > 0.5) & (predicted > 0.5)).sum().item() / ((actual > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual <= 0.5) & (predicted <= 0.5)).sum().item() / ((actual <= 0.5).sum().item() + 0.00001)
        acc = (acc_positive + acc_negative) / 2
        return acc
 
    # 劃分輸入和輸出的工具函數
    def split_batch_xy(batch, begin=None, end=None):
        # shape = batch_size, channels, width, height
        batch_x = batch[0][begin:end]
        # shape = batch_size, num_labels
        batch_y = batch[1][begin:end]
        return batch_x, batch_y

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改參數
        model.train()
        training_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 切分小批次，有助於泛化模型
            training_batch_accuracy_list = []
            for index in range(0, batch[0].shape[0], 100):
                # 劃分輸入和輸出
                batch_x, batch_y = split_batch_xy(batch, index, index+100)
                # 計算預測值
                predicted = model(batch_x)
                # 計算損失
                loss = loss_function(predicted, batch_y)
                # 從損失自動微分求導函數值
                loss.backward()
                # 使用參數調整器調整參數
                optimizer.step()
                # 清空導函數值
                optimizer.zero_grad()
                # 記錄這一個批次的正確率，torch.no_grad 表明臨時禁用自動微分功能
                with torch.no_grad():
                    training_batch_accuracy_list.append(calc_accuracy(batch_y, predicted))
            # 輸出批次正確率
            training_batch_accuracy = sum(training_batch_accuracy_list) / len(training_batch_accuracy_list)
            training_accuracy_list.append(training_batch_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: batch accuracy: {training_batch_accuracy}")
        training_accuracy = sum(training_accuracy_list) / len(training_accuracy_list)
        training_accuracy_history.append(training_accuracy)
        print(f"training accuracy: {training_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = split_batch_xy(batch)
            predicted = model(batch_x)
            validating_accuracy_list.append(calc_accuracy(batch_y, predicted))
        validating_accuracy = sum(validating_accuracy_list) / len(validating_accuracy_list)
        validating_accuracy_history.append(validating_accuracy)
        print(f"validating accuracy: {validating_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練後仍然沒有刷新記錄
        if validating_accuracy > validating_accuracy_highest:
            validating_accuracy_highest = validating_accuracy
            validating_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest validating accuracy updated")
        elif epoch - validating_accuracy_highest_epoch > 20:
            # 在 20 次訓練後仍然沒有刷新記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest validating accuracy: {validating_accuracy_highest}",
        f"from epoch {validating_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = split_batch_xy(batch)
        predicted = model(batch_x)
        testing_accuracy_list.append(calc_accuracy(batch_y, predicted))
    testing_accuracy = sum(testing_accuracy_list) / len(testing_accuracy_list)
    print(f"testing accuracy: {testing_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_accuracy_history, label="training")
    pyplot.plot(validating_accuracy_history, label="validing")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 建立模型實例，加載訓練好的狀態，而後切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示全部多是人臉的區域
    while True:
        try:
            # 選取可能出現對象的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img = cv2.imread(image_path)
            candidate_boxes = selective_search(img)
            # 構建輸入
            image_tensors = []
            for candidate_box in candidate_boxes:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                image_tensors.append(image_to_tensor(child_img))
            tensor_in = torch.stack(image_tensors).to(device)
            # 預測輸出
            tensor_out = model(tensor_in)
            # 使用 softmax 計算是人臉的機率
            tensor_out = nn.functional.softmax(tensor_out, dim=1)
            tensor_out = tensor_out[:,1].resize(tensor_out.shape[0])
            # 判斷機率大於 99% 的是人臉，添加邊框到圖片並保存
            img_output = img.copy()
            indices = torch.where(tensor_out > 0.99)[0]
            result_boxes = []
            result_boxes_all = []
            for index in indices:
                box = candidate_boxes[index]
                for exists_box in result_boxes_all:
                    # 若是和現存找到的區域重疊度大於 30% 則跳過
                    if calc_iou(exists_box, box) > 0.30:
                        break
                else:
                    result_boxes.append(box)
                result_boxes_all.append(box)
            for box in result_boxes:
                x, y, w, h = box
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函數"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次運行均可以生成相同的隨機數
    # 這是爲了讓過程可重現，你也能夠選擇不這樣作
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令行參數選擇操做
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

和以前文章給出的代碼例子同樣，這份代碼也分爲了 prepare, train, eval 三個部分，其中 prepare 部分負責選取區域，提取正樣本 (包含人臉的區域) 和負樣本 (不包含人臉的區域) 的子圖片；train 使用普通的 resnet 模型學習子圖片；eval 針對給出的圖片選取區域並識別全部區域中是否包含人臉。

除了選取區域和提取子圖片的處理之外，基本上和以前介紹的 CNN 模型同樣吧🥳。

執行如下命令之後：

python3 example.py prepare
python3 example.py train

的最終輸出以下：

epoch: 101, batch: 106: batch accuracy: 0.9999996838862198
epoch: 101, batch: 107: batch accuracy: 0.999218446914751
epoch: 101, batch: 108: batch accuracy: 0.9999996211125055
training accuracy: 0.999441394076678
validating accuracy: 0.9687856357743619
stop training because highest validating accuracy not updated in 20 epoches
highest validating accuracy: 0.9766918253771755 from epoch 80
testing accuracy: 0.9729761086851993

訓練集和驗證集的正確率變化以下：

正確率看起來很高，但這只是針對選取後的區域判斷的正確率，由於選取算法效果比較通常而且樣本數量比較少，因此最終效果不能說使人滿意😕。

執行如下命令，再輸入圖片路徑可使用學習好的模型識別圖片：

python3 example.py eval

如下是部分識別結果：

精度通常般😕。

Fast-RCNN

RCNN 慢的緣由主要是由於識別幾千個子圖片的計算量很是龐大，特別是這幾千個子圖片的範圍不少是重合的，致使了不少重複的計算。Fast-RCNN 着重改善了這一部分，首先會針對整張圖片生成一個與圖片長寬相同 (或者等比例縮放) 的特徵數據，而後再根據可能包含對象的區域截取特徵數據，而後再根據截取後的子特徵數據識別分類。RCNN 與 Fast-RCNN 的區別以下圖所示：

遺憾的是 Fast-RCNN 只是改善了速度，並不會改善正確率。但下面介紹的例子會引入一個比較重要的處理，即調整區域範圍，它可讓模型給出的區域更接近實際的區域。

如下是 Fast-RCNN 模型中的一些處理細節。

縮放來源圖片

在 RCNN 中，傳給 CNN 模型的圖片是通過縮放的子圖片，而在 Fast-RCNN 中咱們須要傳原圖片給 CNN 模型，那麼原圖片也須要進行縮放。縮放使用的方法是填充法，以下圖所示：

縮放圖片使用的代碼以下 (opencv 版)：

IMAGE_SIZE = (128, 88)

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的參數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

縮放圖片後區域的座標也須要轉換，轉換的代碼以下 (都是枯燥的代碼🤒)：

IMAGE_SIZE = (128, 88)

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放後的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放後圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

計算區域特徵

在前面的文章中咱們已經瞭解過，CNN 模型能夠分爲卷積層，池化層和全鏈接層，卷積層，池化層用於抽取圖片中各個區域的特徵，全鏈接層用於把特徵扁平化並交給線性模型處理。在 Fast-RCNN 中，咱們不須要使用整張圖片的特徵，只須要使用部分區域的特徵，因此 Fast-RCNN 使用的 CNN 模型只須要卷積層和池化層 (部分模型池化層能夠省略)，卷積層輸出的通道數量一般會比圖片原有的通道數量多，而且長寬會按原來圖片的長寬等比例縮小，例如原圖的大小是 3,256,256 的時候，通過處理可能會輸出 512,32,32，表明每一個 8x8 像素的區域都對應 512 個特徵。

這篇給出的 Fast-RCN 代碼爲了易於理解，會讓 CNN 模型輸出和原圖如出一轍的大小，這樣抽取區域特徵的時候只須要使用 [] 操做符便可。

抽取區域特徵 (ROI Pooling)

Fast-RCNN 根據整張圖片生成特徵之後，下一步就是抽取區域特徵 (Region of interest Pooling) 了，抽取區域特徵簡單的來講就是根據區域在圖片中的位置，截區域中該位置的數據，而後再縮放到相同大小，以下圖所示：

抽取區域特徵的層又稱爲 ROI 層。

若是特徵的長寬和圖片的長寬相同，那麼截取特徵只須要簡單的 [] 操做，但若是特徵的長寬比圖片的長寬要小，那麼就須要使用近鄰插值法 (Nearest Neighbor Interpolation) 或者雙線插值法 (Bilinear Interpolation) 進行截取，使用雙線插值法進行截取的 ROI 層又稱做 ROI Align。截取之後的縮放可使用 MaxPool，近鄰插值法或雙線插值法等算法。

想更好的理解 ROI Align 與雙線插值法能夠參考這篇文章。

調整區域範圍

在前面已經提到過，使用選擇搜索法等算法選取出來的區域與對象實際所在的區域可能有必定誤差，這個誤差是能夠經過模型來調整的。舉個簡單的例子，若是區域內有臉的左半部分，那麼模型在通過學習後應該能夠判斷出區域應該向右擴展一些。

區域調整能夠分爲四個參數：

對左上角 x 座標的調整
對左上角 y 座標的調整
對長度的調整
對寬度的調整

由於座標和長寬的值大小不必定，例如一樣是臉的左半部分，出如今圖片的左上角和圖片的右下角就會讓 x y 座標不同，若是遠近不一樣那麼長寬也會不同，咱們須要把調整量做標準化，標準化的公式以下：

x1, y1, w1, h1 = 候選區域
x2, y2, w2, h2 = 真實區域
x 偏移 = (x2 - x1) / w1
y 偏移 = (y2 - y1) / h1
w 偏移 = log(w2 / w1)
h 偏移 = log(h2 / h1)

通過標準化後，偏移的值就會做爲比例而不是絕對值，不會受具體座標和長寬的影響。此外，公式中使用 log 是爲了減小偏移的增幅，使得偏移比較大的時候模型仍然能夠達到比較好的學習效果。

計算區域調整偏移和根據偏移調整區域的代碼以下：

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

計算損失

Fast-RCNN 模型會針對各個區域輸出兩個結果，第一個是區域對應的標籤 (人臉，非人臉)，第二個是上面提到的區域偏移，調整參數的時候也須要同時根據這兩個結果調整。實現同時調整多個結果能夠把損失相加起來再計算各個參數的導函數值：

各個區域的特徵 = ROI層(CNN模型(圖片數據))

計算標籤的線性模型(各個區域的特徵) - 真實標籤 = 標籤損失
計算偏移的線性模型(各個區域的特徵) - 真實偏移 = 偏移損失

損失 = 標籤損失 + 偏移損失

有一個須要注意的地方是，在這個例子裏計算標籤損失須要分別根據正負樣本計算，不然模型在通過調整之後只會輸出負結果。這是由於線性模型計算抽取出來的特徵時有可能輸出正 (人臉)，也有可能輸出負 (非人臉)，而 ROI 層抽取的特徵不少是重合的，也就是來源相同，當負樣本比正樣本要多的時候，結果的方向就會更偏向於負，這樣每次調整參數的時候都會向輸出負的方向調整。若是把損失分開計算，那麼不重合的特徵能夠分別向輸出正負的方向調整，從而達到學習的效果。

此外，偏移損失只應該根據正樣本計算，負樣本沒有必要學習偏移。

最終的損失計算處理以下：

各個區域的特徵 = ROI層(CNN模型(圖片數據))

計算標籤的線性模型(各個區域的特徵)[正樣本] - 真實標籤[正樣本] = 正樣本標籤損失
計算標籤的線性模型(各個區域的特徵)[負樣本] - 真實標籤[負樣本] = 負樣本標籤損失
計算偏移的線性模型(各個區域的特徵)[正樣本] - 真實偏移[正樣本] = 正樣本偏移損失

損失 = 正樣本標籤損失 + 負樣本標籤損失 + 正樣本偏移損失

合併結果區域

由於選取區域的算法原本就會返回不少重合的區域，可能會有有好幾個區域同時和真實區域重疊率大於必定值 (70%)，致使這幾個區域都會被認爲是包含對象的區域：

模型通過學習後，針對圖片預測得出結果時也有可能返回這樣的重合區域，合併這樣的區域有幾種方法：

使用最左，最右，最上，或者最下的區域
使用第一個區域 (區域選取算法會按出現對象的可能性排序)
結合全部重合的區域 (若是區域調整效果不行，則可能出現結果區域比真實區域大不少的問題)

上面給出的 RCNN 代碼例子已經使用第二個方法合併結果區域，下面給出的例子也會使用一樣的方法。但下一篇文章的 Faster-RCNN 則會使用第三個方法，由於 Faster-RCNN 的區域調整效果相對比較好。

原始論文

若是你想看 Fast-RCNN 的原始論文能夠到如下的地址：

https://arxiv.org/pdf/1504.08083.pdf

使用 Fast-RCNN 識別圖片中的人臉

代碼時間到了😱，這份代碼會使用 Fast-RCNN 模型來圖片中的人臉，使用的數據集和前面的例子同樣。

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import math
import pandas
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 縮放圖片的大小
IMAGE_SIZE = (256, 256)
# 分析目標的圖片所在的文件夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 文件
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啓用 GPU 支持
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BasicBlock(nn.Module):
    """ResNet 使用的基礎塊"""
    expansion = 1 # 定義這個塊的實際出通道是 channels_out 的幾倍，這裏的實現固定是一倍
    def __init__(self, channels_in, channels_out, stride):
        super().__init__()
        # 生成 3x3 的卷積層
        # 處理間隔 stride = 1 時，輸出的長寬會等於輸入的長寬，例如 (32-3+2)//1+1 == 32
        # 處理間隔 stride = 2 時，輸出的長寬會等於輸入的長寬的一半，例如 (32-3+2)//2+1 == 16
        # 此外 resnet 的 3x3 卷積層不使用偏移值 bias
        self.conv1 = nn.Sequential(
            nn.Conv2d(channels_in, channels_out, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 再定義一個讓輸出和輸入維度相同的 3x3 卷積層
        self.conv2 = nn.Sequential(
            nn.Conv2d(channels_out, channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 讓原始輸入和輸出相加的時候，須要維度一致，若是維度不一致則須要整合
        self.identity = nn.Sequential()
        if stride != 1 or channels_in != channels_out * self.expansion:
            self.identity = nn.Sequential(
                nn.Conv2d(channels_in, channels_out * self.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(channels_out * self.expansion))

    def forward(self, x):
        # x => conv1 => relu => conv2 => + => relu
        # |                              ^
        # |==============================|
        tmp = self.conv1(x)
        tmp = nn.functional.relu(tmp, inplace=True)
        tmp = self.conv2(tmp)
        tmp += self.identity(x)
        y = nn.functional.relu(tmp, inplace=True)
        return y

class MyModel(nn.Module):
    """Fast-RCNN (基於 ResNet-18 的變種)"""
    def __init__(self):
        super().__init__()
        # 記錄上一層的出通道數量
        self.previous_channels_out = 4
        # 把 3 通道轉換到 4 通道，長寬不變
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, self.previous_channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(self.previous_channels_out))
        # 抽取圖片各個區域特徵的 ResNet (除去 AvgPool 和全鏈接層)
        # 和原始的 Resnet 不同的是輸出的長寬和輸入的長寬會相等，以便 ROI 層按區域抽取R徵
        # 此外，爲了可讓模型跑在 4GB 顯存上，這裏減小了模型的通道數量
        self.layer1 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer3 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        self.layer4 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        # ROI 層抽取各個子區域特徵後轉換到固定大小
        self.roi_pool = nn.AdaptiveMaxPool2d((5, 5))
        # 輸出兩個分類 [非人臉, 人臉]
        self.fc_labels_model = nn.Sequential(
            nn.Linear(8*5*5, 32),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(32, 2))
        # 計算區域偏移，分別輸出 x, y, w, h 的偏移
        self.fc_offsets_model = nn.Sequential(
            nn.Linear(8*5*5, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 4))

    def _make_layer(self, block_type, channels_out, num_blocks, stride):
        blocks = []
        # 添加第一個塊
        blocks.append(block_type(self.previous_channels_out, channels_out, stride))
        self.previous_channels_out = channels_out * block_type.expansion
        # 添加剩餘的塊，剩餘的塊固定處理間隔爲 1，不會改變長寬
        for _ in range(num_blocks-1):
            blocks.append(block_type(self.previous_channels_out, self.previous_channels_out, 1))
            self.previous_channels_out *= block_type.expansion
        return nn.Sequential(*blocks)

    def _roi_pooling(self, feature_mapping, roi_boxes):
        result = []
        for box in roi_boxes:
            image_index, x, y, w, h = map(int, box.tolist())
            feature_sub_region = feature_mapping[image_index][:,x:x+w,y:y+h]
            fixed_features = self.roi_pool(feature_sub_region).reshape(-1) # 順道扁平化
            result.append(fixed_features)
        return torch.stack(result)

    def forward(self, x):
        images_tensor = x[0]
        candidate_boxes_tensor = x[1]
        # 轉換出通道
        tmp = self.conv1(images_tensor)
        tmp = nn.functional.relu(tmp)
        # 應用 ResNet 的各個層
        # 結果維度是 B,32,W,H
        tmp = self.layer1(tmp)
        tmp = self.layer2(tmp)
        tmp = self.layer3(tmp)
        tmp = self.layer4(tmp)
        # 使用 ROI 層抽取各個子區域的特徵並轉換到固定大小
        # 結果維度是 B,32*9*9
        tmp = self._roi_pooling(tmp, candidate_boxes_tensor)
        # 根據抽取出來的子區域特徵分別計算分類 (是否人臉) 和區域偏移
        labels = self.fc_labels_model(tmp)
        offsets = self.fc_offsets_model(tmp)
        y = (labels, offsets)
        return y

def save_tensor(tensor, path):
    """保存 tensor 對象到文件"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從文件讀取 tensor 對象"""
    return torch.load(gzip.GzipFile(path, "rb"))

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的參數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

def image_to_tensor(img):
    """轉換 opencv 圖片對象到 tensor 對象"""
    # 注意 opencv 是 BGR，但對訓練沒有影響因此不用轉爲 RGB
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得範圍在 0 ~ 1
    return t

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放後的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放後圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合併部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    # 這裏計算出來的偏移值基於比例，而不受具體位置和大小影響
    # w h 使用 log 是爲了減小過大的值的影響
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

def selective_search(img):
    """計算 opencv 圖片中可能出現對象的區域，只返回頭 2000 個區域"""
    # 算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets):
    """準備訓練 - 保存單個批次的數據"""
    # 按索引值列表生成輸入和輸出 tensor 對象的函數
    def split_dataset(indices):
        image_in = []
        candidate_boxes_in = []
        labels_out = []
        offsets_out = []
        for new_image_index, original_image_index in enumerate(indices):
            image_in.append(image_tensors[original_image_index])
            for box, label, offset in zip(image_candidate_boxes, image_labels, image_box_offsets):
                box_image_index, x, y, w, h = box
                if box_image_index == original_image_index:
                    candidate_boxes_in.append((new_image_index, x, y, w, h))
                    labels_out.append(label)
                    offsets_out.append(offset)
        # 檢查計算是否有問題
        # for box, label in zip(candidate_boxes_in, labels_out):
        #    image_index, x, y, w, h = box
        #    child_img = image_in[image_index][:, x:x+w, y:y+h].transpose(0, 2) * 255
        #    cv2.imwrite(f"{image_index}_{x}_{y}_{w}_{h}_{label}.png", child_img.numpy())
        tensor_image_in = torch.stack(image_in) # 維度: B,C,W,H
        tensor_candidate_boxes_in = torch.tensor(candidate_boxes_in, dtype=torch.float) # 維度: N,5 (index, x, y, w, h)
        tensor_labels_out = torch.tensor(labels_out, dtype=torch.long) # 維度: N
        tensor_box_offsets_out = torch.tensor(offsets_out, dtype=torch.float) # 維度: N,4 (x_offset, y_offset, ..)
        return (tensor_image_in, tensor_candidate_boxes_in), (tensor_labels_out, tensor_box_offsets_out)

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(len(image_tensors))
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = split_dataset(training_indices)
    validating_set = split_dataset(validating_indices)
    testing_set = split_dataset(testing_indices)

    # 保存到硬盤
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """準備訓練"""
    # 數據集轉換到 tensor 之後會保存在 data 文件夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 加載 csv 文件，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片裏面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 50
    max_samples = 10
    batch = 0
    image_tensors = [] # 圖片列表
    image_candidate_boxes = [] # 各個圖片的候選區域列表
    image_labels = [] # 各個圖片的候選區域對應的標籤 (1 人臉 0 非人臉)
    image_box_offsets = [] # 各個圖片的候選區域與真實區域的偏移值
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img_original = cv2.imread(path) # 加載原始圖片
        sh, sw, _ = img_original.shape # 原始圖片大小
        img = resize_image(img_original) # 縮放圖片
        candidate_boxes = selective_search(img) # 查找候選區域
        true_boxes = [ map_box_to_resized_image(b, sw, sh) for b in true_boxes ] # 縮放實際區域
        image_index = len(image_tensors) # 圖片在批次中的索引值
        image_tensors.append(image_to_tensor(img.copy()))
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 若是候選區域和任意一個實際區域重疊率大於 70%，則認爲是正樣本
            # 若是候選區域和全部實際區域重疊率都小於 30%，則認爲是負樣本
            # 每一個圖片最多添加正樣本數量 + 10 個負樣本，須要提供足夠多負樣本避免僞陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                # 檢驗計算是否有問題
                # child_img = img[y:y+h, x:x+w].copy()
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_candidate_boxes.append((image_index, x, y, w, h))
                image_labels.append(int(result))
                if positive_index is not None:
                    image_box_offsets.append(calc_box_offset(
                        candidate_box, true_boxes[positive_index])) # 正樣本添加偏移值
                else:
                    image_box_offsets.append((0, 0, 0, 0)) # 負樣本無偏移
            if positive_samples >= max_samples:
                break
        # 保存批次
        if len(image_tensors) >= batch_size:
            prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)
            image_tensors.clear()
            image_candidate_boxes.clear()
            image_labels.clear()
            image_box_offsets.clear()
            batch += 1
    # 保存剩餘的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)

def train():
    """開始訓練"""
    # 建立模型實例
    model = MyModel().to(device)

    # 建立多任務損失計算器
    celoss = torch.nn.CrossEntropyLoss()
    mseloss = torch.nn.MSELoss()
    def loss_function(predicted, actual):
        # 標籤損失必須根據正負樣本分別計算，不然會致使預測結果老是爲負的問題
        positive_indices = actual[0].nonzero(as_tuple=True)[0] # 正樣本的索引值列表
        negative_indices = (actual[0] == 0).nonzero(as_tuple=True)[0] # 負樣本的索引值列表
        loss1 = celoss(predicted[0][positive_indices], actual[0][positive_indices]) # 正樣本標籤的損失
        loss2 = celoss(predicted[0][negative_indices], actual[0][negative_indices]) # 負樣本標籤的損失
        loss3 = mseloss(predicted[1][positive_indices], actual[1][positive_indices]) # 偏移值的損失，僅針對正樣本計算
        return loss1 + loss2 + loss3

    # 建立參數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_label_accuracy_history = []
    training_offset_accuracy_history = []
    validating_label_accuracy_history = []
    validating_offset_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_label_accuracy_highest = -1
    validating_label_accuracy_highest_epoch = 0
    validating_offset_accuracy_highest = -1
    validating_offset_accuracy_highest_epoch = 0

    # 讀取批次的工具函數
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ [ tt.to(device) for tt in t ] for t in load_tensor(path) ]

    # 計算正確率的工具函數
    def calc_accuracy(actual, predicted):
        # 標籤正確率，正樣本和負樣本的正確率分別計算再平均
        predicted_i = torch.max(predicted[0], 1).indices
        acc_positive = ((actual[0] > 0.5) & (predicted_i > 0.5)).sum().item() / ((actual[0] > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual[0] <= 0.5) & (predicted_i <= 0.5)).sum().item() / ((actual[0] <= 0.5).sum().item() + 0.00001)
        acc_label = (acc_positive + acc_negative) / 2
        # print(acc_positive, acc_negative)
        # 偏移值正確率
        valid_indices = actual[1].nonzero(as_tuple=True)[0]
        if valid_indices.shape[0] == 0:
            acc_offset = 1
        else:
            acc_offset = (1 - (predicted[1][valid_indices] - actual[1][valid_indices]).abs().mean()).item()
            acc_offset = max(acc_offset, 0)
        return acc_label, acc_offset

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改參數
        model.train()
        training_label_accuracy_list = []
        training_offset_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 劃分輸入和輸出
            batch_x, batch_y = batch
            # 計算預測值
            predicted = model(batch_x)
            # 計算損失
            loss = loss_function(predicted, batch_y)
            # 從損失自動微分求導函數值
            loss.backward()
            # 使用參數調整器調整參數
            optimizer.step()
            # 清空導函數值
            optimizer.zero_grad()
            # 記錄這一個批次的正確率，torch.no_grad 表明臨時禁用自動微分功能
            with torch.no_grad():
                training_batch_label_accuracy, training_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            # 輸出批次正確率
            training_label_accuracy_list.append(training_batch_label_accuracy)
            training_offset_accuracy_list.append(training_batch_offset_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: " +
                f"batch label accuracy: {training_batch_label_accuracy}, offset accuracy: {training_batch_offset_accuracy}")
        training_label_accuracy = sum(training_label_accuracy_list) / len(training_label_accuracy_list)
        training_offset_accuracy = sum(training_offset_accuracy_list) / len(training_offset_accuracy_list)
        training_label_accuracy_history.append(training_label_accuracy)
        training_offset_accuracy_history.append(training_offset_accuracy)
        print(f"training label accuracy: {training_label_accuracy}, offset accuracy: {training_offset_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_label_accuracy_list = []
        validating_offset_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = batch
            predicted = model(batch_x)
            validating_batch_label_accuracy, validating_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            validating_label_accuracy_list.append(validating_batch_label_accuracy)
            validating_offset_accuracy_list.append(validating_batch_offset_accuracy)
        validating_label_accuracy = sum(validating_label_accuracy_list) / len(validating_label_accuracy_list)
        validating_offset_accuracy = sum(validating_offset_accuracy_list) / len(validating_offset_accuracy_list)
        validating_label_accuracy_history.append(validating_label_accuracy)
        validating_offset_accuracy_history.append(validating_offset_accuracy)
        print(f"validating label accuracy: {validating_label_accuracy}, offset accuracy: {validating_offset_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練後仍然沒有刷新記錄
        if validating_label_accuracy > validating_label_accuracy_highest:
            validating_label_accuracy_highest = validating_label_accuracy
            validating_label_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest label validating accuracy updated")
        elif validating_offset_accuracy > validating_offset_accuracy_highest:
            validating_offset_accuracy_highest = validating_offset_accuracy
            validating_offset_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest offset validating accuracy updated")
        elif (epoch - validating_label_accuracy_highest_epoch > 20 and
            epoch - validating_offset_accuracy_highest_epoch > 20):
            # 在 20 次訓練後仍然沒有刷新記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest label validating accuracy: {validating_label_accuracy_highest}",
        f"from epoch {validating_label_accuracy_highest_epoch}")
    print(f"highest offset validating accuracy: {validating_offset_accuracy_highest}",
        f"from epoch {validating_offset_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_label_accuracy_list = []
    testing_offset_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = batch
        predicted = model(batch_x)
        testing_batch_label_accuracy, testing_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
        testing_label_accuracy_list.append(testing_batch_label_accuracy)
        testing_offset_accuracy_list.append(testing_batch_offset_accuracy)
    testing_label_accuracy = sum(testing_label_accuracy_list) / len(testing_label_accuracy_list)
    testing_offset_accuracy = sum(testing_offset_accuracy_list) / len(testing_offset_accuracy_list)
    print(f"testing label accuracy: {testing_label_accuracy}, offset accuracy: {testing_offset_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_label_accuracy_history, label="training_label_accuracy")
    pyplot.plot(training_offset_accuracy_history, label="training_offset_accuracy")
    pyplot.plot(validating_label_accuracy_history, label="validing_label_accuracy")
    pyplot.plot(validating_offset_accuracy_history, label="validing_offset_accuracy")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 建立模型實例，加載訓練好的狀態，而後切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示全部多是人臉的區域
    while True:
        try:
            # 選取可能出現對象的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img_original = cv2.imread(image_path) # 加載原始圖片
            sh, sw, _ = img_original.shape # 原始圖片大小
            img = resize_image(img_original) # 縮放圖片
            candidate_boxes = selective_search(img) # 查找候選區域
            # 構建輸入
            image_tensor = image_to_tensor(img).unsqueeze(dim=0).to(device) # 維度: 1,C,W,H
            candidate_boxes_tensor = torch.tensor(
                [ (0, x, y, w, h) for x, y, w, h in candidate_boxes ],
                dtype=torch.float).to(device) # 維度: N,5
            tensor_in = (image_tensor, candidate_boxes_tensor)
            # 預測輸出
            labels, offsets = model(tensor_in)
            labels = nn.functional.softmax(labels, dim=1)
            labels = labels[:,1].resize(labels.shape[0])
            # 判斷機率大於 90% 的是人臉，按偏移值調整區域，添加邊框到圖片並保存
            img_output = img_original.copy()
            for box, label, offset in zip(candidate_boxes, labels, offsets):
                if label.item() <= 0.99:
                    continue
                box = adjust_box_by_offset(box, offset.tolist())
                x, y, w, h = map_box_to_original_image(box, sw, sh)
                if w == 0 or h == 0:
                    continue
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函數"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次運行均可以生成相同的隨機數
    # 這是爲了讓過程可重現，你也能夠選擇不這樣作
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令行參數選擇操做
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

執行如下命令之後：

python3 example.py prepare
python3 example.py train

在 31 輪訓練之後的輸出以下 (由於訓練時間實在長，這裏偷懶了🥺)：

epoch: 31, batch: 112: batch label accuracy: 0.9805490565092065, offset accuracy: 0.9293316006660461
epoch: 31, batch: 113: batch label accuracy: 0.9776784565994586, offset accuracy: 0.9191392660140991
epoch: 31, batch: 114: batch label accuracy: 0.9469732184008024, offset accuracy: 0.9101274609565735
training label accuracy: 0.9707166603858259, offset accuracy: 0.9191886570142663
validating label accuracy: 0.9306134214845806, offset accuracy: 0.9205827381299889
highest offset validating accuracy updated

執行如下命令，再輸入圖片路徑可使用學習好的模型識別圖片：