『計算機視覺』Mask-RCNN_從服裝關鍵點檢測看KeyPoints分支

時間 2019-11-08

標籤計算機視覺 mask rcnn 服裝關鍵檢測 keypoints 分支欄目快樂工作简体版

原文原文鏈接

原論文中提到過Mask_RCNN是能夠進行關鍵點檢測的，不過咱們學習的這個工程並無添加關鍵點檢測分支，而有人基於本工程進行了完善；Mask_RCNN_Humanpose，本文咱們將簡要的瞭解如何將關鍵點識別分支添加進模型，更進一步的，咱們將嘗試使用Mask_RCNN對實際數據進行識別。html

零、配置相關

import os
import numpy as np
import pandas as pd
from PIL import Image

import utils as utils
import model as modellib
from config import Config


PART_INDEX = {'blouse': [0, 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14],
              'outwear': [0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
              'dress': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 17, 18],
              'skirt': [15, 16, 17, 18],
              'trousers': [15, 16, 19, 20, 21, 22, 23]}
PART_STR = ['neckline_left', 'neckline_right',
            'center_front',
            'shoulder_left', 'shoulder_right',
            'armpit_left', 'armpit_right',
            'waistline_left', 'waistline_right',
            'cuff_left_in', 'cuff_left_out',
            'cuff_right_in', 'cuff_right_out',
            'top_hem_left', 'top_hem_right',
            'waistband_left', 'waistband_right',
            'hemline_left', 'hemline_right',
            'crotch',
            'bottom_left_in', 'bottom_left_out',
            'bottom_right_in', 'bottom_right_out']
IMAGE_CATEGORY = ['blouse', 'outwear', 'dress', 'skirt', 'trousers'][0]


class FIConfig(Config):
    """
    Configuration for training on the toy shapes dataset.
    Derives from the base Config class and overrides values specific
    to the toy shapes dataset.
    """
    # Give the configuration a recognizable name
    NAME = "FI"  # <-----數據集名

    # Train on 1 GPU and 8 images per GPU. We can put multiple images on each
    # GPU because the images are small. Batch size is 8 (GPUs * images/GPU).
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1
    NUM_KEYPOINTS = len(PART_INDEX[IMAGE_CATEGORY])  # <-----關鍵點數目
    KEYPOINT_MASK_SHAPE = [56, 56]

    # Number of classes (including background)
    NUM_CLASSES = 1 + 1

    RPN_TRAIN_ANCHORS_PER_IMAGE = 100
    VALIDATION_STPES = 100
    STEPS_PER_EPOCH = 1000
    MINI_MASK_SHAPE = (56, 56)
    KEYPOINT_MASK_POOL_SIZE = 7

    # Pooled ROIs
    POOL_SIZE = 7
    MASK_POOL_SIZE = 14
    MASK_SHAPE = [28, 28]
    WEIGHT_LOSS = True
    KEYPOINT_THRESHOLD = 0.005

常量配置記錄：數據類、關鍵點類、數據類和關鍵點類的對應關係python

config類記錄的大部分爲model設置，無需改動，注意設置一下NAME、NUM_KEYPOINTS匹配上數據集。git

1、數據類創建

一、關鍵點標註形式

回顧一下以前的數據集介紹，在非關鍵點檢測任務中，咱們須要的數據有兩種：github

　　a、原始的圖片文件網絡

　　b、圖片上每一個instance的掩碼app

可是因爲Mask_RCNN會對掩碼進行一次加工，獲取每一個instance的座標框，即實際上還須要：框架

　　c、每一個instance的座標框less

既然這裏要檢測關鍵點，那咱們就須要：dom

　　d、圖像的關鍵點標註

key_points: num_keypoints coordinates and visibility (x,y,v)  [num_person,num_keypoints,3] of num_person

首先咱們須要明確，keypoints從屬於某個instance，即上面的num_person的由來（人體關鍵點檢測爲例，一個instance就是一我的），而一個instance有num_keypoints個關鍵點，每個點由3個值組成：橫座標，縱座標，狀態。其中狀態有三種：該類不存在此關鍵點，被遮擋，可見。對於COCO而言，0表示這個關鍵點沒有標註（這種狀況下x=y=v=0），1表示這個關鍵點標註了可是不可見（被遮擋了），2表示這個關鍵點標註了同時也可見。

在不一樣的數據集上，可能有不一樣的數字來表達這三個點，可是在此框架訓練中，建議統一到COCO的標準，避免過多的修改model代碼（主要是避免修改關鍵點損失函數中的代碼，帶來沒必要要的意外）。

二、服裝關鍵點標註

有了這些基礎，咱們以天池的服飾關鍵點定位數據爲例，看一看如何設計Dataset class。

具體數聽說明自行查閱上面說明，本節重點在介紹Mask RCNN關鍵點加測思路而非數據自己，其文檔以下，咱們設計的Dataset class（見『計算機視覺』Mask-RCNN_訓練網絡其一：數據集與Dataset類）目的就是基於文檔信息爲網絡結構輸送數據。

a、服裝類別和Mask RCNN

值得注意的是，Mask RCNN的分類、檢測、Mask生成任務都是多分類，可是關鍵點識別因爲其自己難度更高（一個類別有衆多關鍵點，不一樣類別關鍵點類型之間關係不大甚至徹底不一樣），因此建議每個大類單獨訓練一個model檢測其關鍵點，實際上pose關鍵點檢測對應過來就是：檢測person這一個類的框、Mask，以及每個instance（每個人）的不一樣部位的關鍵點，實際的class分類值有person和背景兩個類。對應到服飾數據集，咱們須要訓練5次，對框應五種服裝。

b、服裝檢測框

服裝數據標註僅有關鍵點，可是檢測框對於Mask RCNN來講是必要的，由於RPN網絡須要它（RPN以後的迴歸網絡分支能夠註釋掉，可是RPN是網絡的主幹部分，不能註釋），因此咱們採起Mask RCNN工程的檢測框生成思路，利用關鍵點生成檢測框，因爲關鍵點未必在服裝邊緣（通常是在的），咱們的檢測框取大一點，儘可能徹底包含服裝，下面的函數見utils.py腳本（暫不涉及這個函數，只是說到了貼上來而已）。

def extract_keypoint_bboxes(keypoints, image_size):
    """
    :param keypoints: [instances, keypoints_per_instance, 3]
    :param image_size: [w, h]
    :return:
    """
    bboxes = np.zeros([keypoints.shape[0], 4], dtype=np.int32)
    for i in range(keypoints.shape[0]):
        x = keypoints[i, :, 0][keypoints[i, :, 0]>0]
        y = keypoints[i, :, 1][keypoints[i, :, 1]>0]
        x1 = x.min()-10 if x.min()-10>0 else 0
        y1 = y.min()-10 if y.min()-10>0 else 0
        x2 = x.max()+11 if x.max()+11<image_size[0] else image_size[0]
        y2 = y.max()+11 if y.max()+11<image_size[1] else image_size[1]
        bboxes[i] = np.array([y1, x1, y2, x2], np.int32)
    return bboxes

c、Mask說明

服裝數據是沒有Mask信息的，按照Mask RCNN論文的說法，掩碼使用關鍵點位置爲1其餘位置爲0的形式便可，感受不太靠譜，而在COCO數據集裏（即本文參考工程Mask_RCNN_Humanpose），掩碼信息使用的是人的掩碼（見下圖），

我在Dataset class中生成了掩碼信息做爲演示，在build網絡中取消了Mask分支，下圖摘自李沐博士的《手動學習深度學習》，能夠很直觀的理解咱們爲何能夠把Mask分支取消掉。

三、class FIDataset

正如Dataset註釋所說，要想運行本身的數據集，咱們首先要實現一個方法（load_shapes，根據數據集取名便可）收集原始圖像、類別信息，而後實現兩個方法（load_image、load_mask）分別實現獲取單張圖片數據、獲取單張圖片對應的objs的masks和classes，這樣基本完成了數據集類的構建。

對於本數據集，

咱們使用load_FI方法代替load_shapes，調用self.add_class和self.add_image，記錄圖片、類別信息
父類的load_image會去讀取self.image_info中每張圖片的"path"路徑，載入圖片，咱們沒必要重寫，保證在load_FI中錄入了便可

load_mask被load_keupoints取代（Mask_RCNN_Humanpose作了這個改動，並已經捋順了相關調用），其註釋以下，咱們不須要mask信息，返回None佔位便可，後面須要將網絡中有關Mask信息的調用註釋處理掉，這裏先不介紹：

"""
Returns:
key_points: num_keypoints coordinates and visibility (x,y,v)  [num_person,num_keypoints,3] of num_person
masks: A bool array of shape [height, width, instance count] with
    one mask per instance.
class_ids: a 1D array of class IDs of the instance masks, here is always equal to [num_person, 1]
"""

至此咱們介紹了Dataset class的目的，下面給出實現見FI_train.py ，因爲訓練時須要驗證集，而我截至撰文時沒有實現驗證集劃分（用訓練集冒充驗證集），因此load_FI的參數train_data沒有意義，更新會在github上進行，後續本文不予修改：

class FIDataset(utils.Dataset):
    """Generates the shapes synthetic dataset. The dataset consists of simple
    shapes (triangles, squares, circles) placed randomly on a blank surface.
    The images are generated on the fly. No file access required.
    """
    def load_FI(self, train_data=True):
        """Generate the requested number of synthetic images.
        count: number of images to generate.
        height, width: the size of the generated images.
        """
        if train_data:
            csv_data = pd.concat([pd.read_csv('../keypoint_data/train1.csv'),
                                  pd.read_csv('../keypoint_data/train2.csv')],
                                 axis=0,
                                 ignore_index=True  # 忽略索引表示不會直接拼接索引，會從新計算行數索引
                                )
            class_data = csv_data[csv_data.image_category.isin(['blouse'])]

        # Add classes
        self.add_class(source="FI", class_id=1, class_name='blouse')

        # Add images
        for i in range(class_data.shape[0]):
            annotation = class_data.iloc[i]
            img_path = os.path.join("../keypoint_data", annotation.image_id)
            keypoints = np.array([p.split('_')
                                  for p in class_data.iloc[i][2:]], dtype=int)[PART_INDEX[IMAGE_CATEGORY], :]
            keypoints[:, -1] += 1
            self.add_image(source="FI",
                           image_id=i,
                           path=img_path,
                           annotations=keypoints)

    def load_keypoints(self, image_id, with_mask=True):
        """
        Returns:
        key_points: num_keypoints coordinates and visibility (x,y,v)  [num_person,num_keypoints,3] of num_person
        masks: A bool array of shape [height, width, instance count] with
            one mask per instance.
        class_ids: a 1D array of class IDs of the instance masks, here is always equal to [num_person, 1]
        """
        key_points = np.expand_dims(self.image_info[image_id]["annotations"], 0)  # 已知圖中僅有一個對象
        class_ids = np.array([1])

        if with_mask:
            annotations = self.image_info[image_id]["annotations"]
            w, h = image_size(self.image_info[image_id]["path"])
            mask = np.zeros([w, h], dtype=int)
            mask[annotations[:, 1], annotations[:, 0]] = 1
            return key_points.copy(), np.expand_dims(mask, -1), class_ids
        return key_points.copy(), None, class_ids

2、數據類讀取

爲了驗證數據類構建的正確性，咱們能夠直接調用接口model.py中的load_image_gt_keypoints獲取original_image, image_meta, gt_class_id, gt_bbox, gt_keypoint等信息，實際上在真正的訓練中，程序也是經過這個函數完成Dataset class中的數據到model模型之間的傳遞。

def load_image_gt_keypoints(dataset, config, image_id, augment=True,
                  use_mini_mask=False):
    """Load and return ground truth data for an image (image, keypoint_mask, keypoint_weight, mask, bounding boxes).

    augment: If true, apply random image augmentation. Currently, only
        horizontal flipping is offered.
    use_mini_mask: If False, returns full-size masks and keypoints that are the same height
        and width as the original image. These can be big, for example
        1024x1024x100 (for 100 instances). Mini masks are smaller, typically,
        224x224 and are generated by extracting the bounding box of the
        object and resizing it to MINI_MASK_SHAPE.

    Returns:
    image: [height, width, 3]
    shape: the original shape of the image before resizing and cropping.
    keypoints:[num_person, num_keypoint, 3] (x, y, v) v value is as belows:
        0: not visible and without annotations
        1: not visible but with annotations
        2: visible and with annotations
    class_ids: [instance_count] Integer class IDs
    bbox: [instance_count, (y1, x1, y2, x2)]
    mask: [height, width, instance_count]. The height and width are those
        of the image unless use_mini_mask is True, in which case they are
        defined in MINI_MASK_SHAPE.
    """

在visualize.py模塊中，函數display_keypoints能夠對接上面函數的輸出，直接可視化Dataset class經由load_image_gt_keypoints提取的結果（固然，並非直接提取，該函數實際上進行了一系列的圖像預處理，這也增長了咱們可視化驗證正確性的必要），流程代碼以下，見FI_train.py：

config = FIConfig()

import visualize
from model import log

dataset = FIDataset()
dataset.load_FI()
dataset.prepare()
original_image, image_meta, gt_class_id, gt_bbox, gt_keypoint =\
    modellib.load_image_gt_keypoints(dataset, FIConfig, 0)
log("original_image", original_image)
log("image_meta", image_meta)
log("gt_class_id", gt_class_id)
log("gt_bbox", gt_bbox)
log("gt_keypoint", gt_keypoint)
visualize.display_keypoints(original_image,gt_bbox,gt_keypoint,gt_class_id,dataset.class_names)

輸出圖片見下，能夠明確的看見至少進行了padding個flip兩個預處理，並不是重點，不提：

實現了本身的Dataset class以後，使用model.load_image_gt_keypoints和visualize.display_keypoints進行驗證，保證Dataset class的正確性。

3、修改及運行模型

一、運行模型步驟

data_tra = FIDataset()
data_tra.load_FI()
data_tra.prepare()

data_val = FIDataset()
data_val.load_FI()
data_val.prepare()
model = modellib.MaskRCNN(mode='training', config=config, model_dir='./')
model.load_weights('./mask_rcnn_coco.h5', by_name=True,
                   exclude=["mrcnn_class_logits", "mrcnn_bbox_fc", "mrcnn_bbox", "mrcnn_mask"])
model.train(data_tra, data_val,
            learning_rate=config.LEARNING_RATE/10,
            epochs=400, layers='heads')

二、網絡細節修改

服裝關鍵點和Humanpose數據最大的不一樣就在於咱們沒有mask掩碼數據，因此咱們須要對原model進行修改，取消掉設計mask的分支（注意指的是Humanpose代碼，而非原版的Mask RCNN，那個改動起來變化太大：一、須要添加keypoint標註數據的整個預處理分支；二、須要實現model有關keypoint的損失函數在內的所有處理步驟）。

下面給出修改以後的build方法，因爲Mask RCNN將各個分支損失函數直接相加，因此咱們直接註釋掉Mask分支便可，不會影響代碼邏輯（程序能夠直接正常運行）。

    def build(self, mode, config):
        """Build Mask R-CNN architecture.
            input_shape: The shape of the input image.
            mode: Either "training" or "inference". The inputs and
                outputs of the model differ accordingly.
        """
        assert mode in ['training', 'inference']

        # Image size must be dividable by 2 multiple times
        h, w = config.IMAGE_SHAPE[:2]
        if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
            raise Exception("Image size must be dividable by 2 at least 6 times "
                            "to avoid fractions when downscaling and upscaling."
                            "For example, use 256, 320, 384, 448, 512, ... etc. ")

        # Inputs
        input_image = KL.Input(
            shape=config.IMAGE_SHAPE.tolist(), name="input_image")
        input_image_meta = KL.Input(shape=[None], name="input_image_meta")
        if mode == "training":
            # RPN GT
            input_rpn_match = KL.Input(
                shape=[None, 1], name="input_rpn_match", dtype=tf.int32)
            input_rpn_bbox = KL.Input(
                shape=[None, 4], name="input_rpn_bbox", dtype=tf.float32)

            # Detection GT (class IDs, bounding boxes, and masks)
            # 1. GT Class IDs (zero padded)
            input_gt_class_ids = KL.Input(
                shape=[None], name="input_gt_class_ids", dtype=tf.int32)
            # 2. GT Boxes in pixels (zero padded)
            # [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] in image coordinates
            input_gt_boxes = KL.Input(
                shape=[None, 4], name="input_gt_boxes", dtype=tf.float32)
            # Normalize coordinates
            h, w = K.shape(input_image)[1], K.shape(input_image)[2]
            image_scale = K.cast(K.stack([h, w, h, w], axis=0), tf.float32)
            gt_boxes = KL.Lambda(lambda x: x / image_scale,name="gt_boxes")(input_gt_boxes)

            keypoint_scale = K.cast(K.stack([w, h, 1], axis=0), tf.float32)
            input_gt_keypoints = KL.Input(shape=[None, config.NUM_KEYPOINTS, 3])
            gt_keypoints = KL.Lambda(lambda x: x / keypoint_scale, name="gt_keypoints")(input_gt_keypoints)
            # 3. GT Masks (zero padded)
            # [batch, height, width, MAX_GT_INSTANCES]
            # if config.USE_MINI_MASK:
            #     input_gt_masks = KL.Input(
            #         shape=[config.MINI_MASK_SHAPE[0],
            #                config.MINI_MASK_SHAPE[1], None],
            #         name="input_gt_masks", dtype=bool)
            #     # input_gt_keypoint_masks = KL.Input(
            #     #     shape=[config.MINI_MASK_SHAPE[0],
            #     #            config.MINI_MASK_SHAPE[1], None, config.NUM_KEYPOINTS],
            #     #     name="input_gt_keypoint_masks", dtype=bool)
            # else:
            #     input_gt_masks = KL.Input(
            #         shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None],
            #         name="input_gt_masks", dtype=bool)
                # input_gt_keypoint_masks = KL.Input(
                #     shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None, config.NUM_KEYPOINTS],
                #     name="input_gt_keypoint_masks", dtype=bool)

            # input_gt_keypoint_weigths = KL.Input(
            #     shape=[None,config.NUM_KEYPOINTS], name="input_gt_keypoint_weights", dtype=tf.int32)



        # Build the shared convolutional layers.
        # Bottom-up Layers
        # Returns a list of the last layers of each stage, 5 in total.
        # Don't create the thead (stage 5), so we pick the 4th item in the list.
        _, C2, C3, C4, C5 = resnet_graph(input_image, "resnet101", stage5=True)
        # Top-down Layers
        # TODO: add assert to varify feature map sizes match what's in config
        P5 = KL.Conv2D(256, (1, 1), name='fpn_c5p5')(C5)
        P4 = KL.Add(name="fpn_p4add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
            KL.Conv2D(256, (1, 1), name='fpn_c4p4')(C4)])
        P3 = KL.Add(name="fpn_p3add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
            KL.Conv2D(256, (1, 1), name='fpn_c3p3')(C3)])
        P2 = KL.Add(name="fpn_p2add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
            KL.Conv2D(256, (1, 1), name='fpn_c2p2')(C2)])
        # Attach 3x3 conv to all P layers to get the final feature maps.
        P2 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p2")(P2)
        P3 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p3")(P3)
        P4 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p4")(P4)
        P5 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p5")(P5)
        # P6 is used for the 5th anchor scale in RPN. Generated by
        # subsampling from P5 with stride of 2.
        P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)

        # Note that P6 is used in RPN, but not in the classifier heads.
        rpn_feature_maps = [P2, P3, P4, P5, P6]
        mrcnn_feature_maps = [P2, P3, P4, P5]

        # Generate Anchors
        self.anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
                                                      config.RPN_ANCHOR_RATIOS,
                                                      config.BACKBONE_SHAPES,
                                                      config.BACKBONE_STRIDES,
                                                      config.RPN_ANCHOR_STRIDE)

        # RPN Model
        rpn = build_rpn_model(config.RPN_ANCHOR_STRIDE,
                              len(config.RPN_ANCHOR_RATIOS), 256)
        # Loop through pyramid layers
        layer_outputs = []  # list of lists
        for p in rpn_feature_maps:
            layer_outputs.append(rpn([p]))
        # Concatenate layer outputs
        # Convert from list of lists of level outputs to list of lists
        # of outputs across levels.
        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
        output_names = ["rpn_class_logits", "rpn_class", "rpn_bbox"]
        outputs = list(zip(*layer_outputs))
        outputs = [KL.Concatenate(axis=1, name=n)(list(o))
                   for o, n in zip(outputs, output_names)]

        rpn_class_logits, rpn_class, rpn_bbox = outputs

        # Generate proposals
        # Proposals are [batch, N, (y1, x1, y2, x2)] in normalized coordinates
        # and zero padded.
        proposal_count = config.POST_NMS_ROIS_TRAINING if mode == "training"\
            else config.POST_NMS_ROIS_INFERENCE
        rpn_rois = ProposalLayer(proposal_count=proposal_count,
                                 nms_threshold=config.RPN_NMS_THRESHOLD,
                                 name="ROI",
                                 anchors=self.anchors,
                                 config=config)([rpn_class, rpn_bbox])

        if mode == "training":
            # Class ID mask to mark class IDs supported by the dataset the image
            # came from.
            _, _, _, active_class_ids = KL.Lambda(lambda x: parse_image_meta_graph(x),
                                                  mask=[None, None, None, None])(input_image_meta)

            if not config.USE_RPN_ROIS:
                # Ignore predicted ROIs and use ROIs provided as an input.
                input_rois = KL.Input(shape=[config.POST_NMS_ROIS_TRAINING, 4],
                                      name="input_roi", dtype=np.int32)
                # Normalize coordinates to 0-1 range.
                target_rois = KL.Lambda(lambda x: K.cast(
                    x, tf.float32) / image_scale[:4])(input_rois)
            else:
                target_rois = rpn_rois

            # Generate detection targets
            # Subsamples proposals and generates target outputs for training
            # Note that proposal class IDs, gt_boxes and gt_masks are zero
            # padded. Equally, returned rois and targets are zero padded.
            #Every rois corresond to one target
            # rois, target_class_ids, target_bbox, target_mask =\
            #     DetectionTargetLayer(config, name="proposal_targets")([
            #         target_rois, input_gt_class_ids, gt_boxes, input_gt_masks])

            # Generate detection targets
            # Subsamples proposals and generates target outputs for training
            # Note that proposal class IDs, gt_boxes, gt_keypoint_masks and gt_keypoint_weights are zero
            # padded. Equally, returned rois and targets are zero padded.
            rois, target_class_ids, target_bbox, target_keypoint, target_keypoint_weight = \
                DetectionKeypointTargetLayer(config, name="proposal_targets")\
                    ([target_rois, input_gt_class_ids, gt_boxes, gt_keypoints])

            # Network Heads
            # TODO: verify that this handles zero padded ROIs
            mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\
                fpn_classifier_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE,
                                     config.POOL_SIZE, config.NUM_CLASSES)


            # mrcnn_mask = build_fpn_mask_graph(rois, mrcnn_feature_maps,
            #                                   config.IMAGE_SHAPE,
            #                                   config.MASK_POOL_SIZE,
            #                                   config.NUM_CLASSES)

            # shape: batch_size, num_roi, num_keypoint, 56*56
            keypoint_mrcnn_mask = build_fpn_keypoint_graph(rois, mrcnn_feature_maps,
                                              config.IMAGE_SHAPE,
                                              config.KEYPOINT_MASK_POOL_SIZE,
                                              config.NUM_KEYPOINTS)

            # TODO: clean up (use tf.identify if necessary)
            output_rois = KL.Lambda(lambda x: x * 1, name="output_rois")(rois)
            # keypoint_mrcnn_mask = KL.Lambda(lambda x: x * 1, name="keypoint_mrcnn_mask")(keypoint_mrcnn_mask)

            # Losses
            rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
                [input_rpn_match, rpn_class_logits])
            rpn_bbox_loss = KL.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")(
                [input_rpn_bbox, input_rpn_match, rpn_bbox])
            class_loss = KL.Lambda(lambda x: mrcnn_class_loss_graph(*x), name="mrcnn_class_loss")(
                [target_class_ids, mrcnn_class_logits, active_class_ids])
            bbox_loss = KL.Lambda(lambda x: mrcnn_bbox_loss_graph(*x), name="mrcnn_bbox_loss")(
                [target_bbox, target_class_ids, mrcnn_bbox])

            # mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x),
            #                                name="mrcnn_mask_loss")(
            #     [target_mask, target_class_ids, mrcnn_mask])
            keypoint_loss = KL.Lambda(lambda x: keypoint_mrcnn_mask_loss_graph(*x, weight_loss=config.WEIGHT_LOSS), name="keypoint_mrcnn_mask_loss")(
                [target_keypoint, target_keypoint_weight, target_class_ids, keypoint_mrcnn_mask])
            """
            target_keypoints: [batch, TRAIN_ROIS_PER_IMAGE, NUM_KEYPOINTS)
                 Keypoint labels cropped to bbox boundaries and resized to neural
                 network output size. Maps keypoints from the half-open interval [x1, x2) on continuous image
                coordinates to the closed interval [0, HEATMAP_SIZE - 1]

            target_keypoint_weights: [batch, TRAIN_ROIS_PER_IMAGE, NUM_KEYPOINTS), bool type
                 Keypoint_weights, 0: isn't visible, 1: visilble
            """

            # test_target_keypoint_mask = test_keypoint_mrcnn_mask_loss_graph(target_keypoint, target_keypoint_weight,
            #                                                        target_class_ids, keypoint_mrcnn_mask)

            # keypoint_weight_loss = KL.Lambda(lambda x: keypoint_weight_loss_graph(*x), name="keypoint_weight_loss")(
            #     [target_keypoint_weight, keypoint_weight_logits, target_class_ids])



            # Model generated
            # batch_images, batch_image_meta, batch_rpn_match, batch_rpn_bbox, batch_gt_class_ids, \
            # batch_gt_boxes, batch_gt_keypoint, batch_gt_masks
            inputs = [input_image, input_image_meta,
                      input_rpn_match, input_rpn_bbox, input_gt_class_ids, input_gt_boxes, input_gt_keypoints]
            if not config.USE_RPN_ROIS:
                inputs.append(input_rois)

            # add "test_target_keypoint_mask" in the output for test the keypoint loss function
            outputs = [rpn_class_logits, rpn_class, rpn_bbox,
                       mrcnn_class_logits, mrcnn_class, mrcnn_bbox, keypoint_mrcnn_mask,
                       rpn_rois, output_rois,
                       rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, keypoint_loss]
                       # +  test_target_keypoint_mask for test the keypoint loss graph

            model = KM.Model(inputs, outputs, name='mask_keypoint_mrcnn')
        else:
            # Network Heads
            # Proposal classifier and BBox regressor heads
            mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\
                fpn_classifier_graph(rpn_rois, mrcnn_feature_maps, config.IMAGE_SHAPE,
                                     config.POOL_SIZE, config.NUM_CLASSES)

            # Detections
            # output is
            #   detections: [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates
            #   keypoint_weights: [batch, num_detections, num_keypoints]
            detections = DetectionLayer(config, name="mrcnn_detection")(
                [rpn_rois, mrcnn_class, mrcnn_bbox,input_image_meta])

            # Convert boxes to normalized coordinates
            # TODO: let DetectionLayer return normalized coordinates to avoid
            #       unnecessary conversions
            h, w = config.IMAGE_SHAPE[:2]
            detection_boxes = KL.Lambda(
                lambda x: x[..., :4] / np.array([h, w, h, w]))(detections)

            # Create masks for detections
            mrcnn_mask = build_fpn_mask_graph(detection_boxes, mrcnn_feature_maps,
                                              config.IMAGE_SHAPE,
                                              config.MASK_POOL_SIZE,
                                              config.NUM_CLASSES)
            keypoint_mrcnn = build_fpn_keypoint_graph(detection_boxes, mrcnn_feature_maps,
                                                           config.IMAGE_SHAPE,
                                                           config.KEYPOINT_MASK_POOL_SIZE,
                                                           config.NUM_KEYPOINTS)

            #shape: Batch, N_ROI, Number_Keypoint, height*width
            keypoint_mcrcnn_prob = KL.Activation("softmax", name="mrcnn_prob")(keypoint_mrcnn)
            model = KM.Model([input_image, input_image_meta],
                             [detections, mrcnn_class, mrcnn_bbox, rpn_rois, rpn_class, rpn_bbox, mrcnn_mask, keypoint_mcrcnn_prob],
                             name='keypoint_mask_rcnn')

        # Add multi-GPU support.
        if config.GPU_COUNT > 1:
            from parallel_model import ParallelModel
            model = ParallelModel(model, config.GPU_COUNT)

        return model

在model.compile方法中，咱們能夠看到有關損失函數添加的細節：

# Add Losses
# First, clear previously set losses to avoid duplication
self.keras_model._losses = []
self.keras_model._per_input_losses = {}
loss_names = ["rpn_class_loss", "rpn_bbox_loss",
              "mrcnn_class_loss", "mrcnn_bbox_loss", "keypoint_mrcnn_mask_loss"]
for name in loss_names:
    layer = self.keras_model.get_layer(name)
    if layer.output in self.keras_model.losses:
       continue
    self.keras_model.add_loss(
       tf.reduce_mean(layer.output, keepdims=True))

# Add L2 Regularization
# Skip gamma and beta weights of batch normalization layers.
reg_losses = [keras.regularizers.l2(self.config.WEIGHT_DECAY)(w) / tf.cast(tf.size(w), tf.float32)
              for w in self.keras_model.trainable_weights
              if 'gamma' not in w.name and 'beta' not in w.name]
 self.keras_model.add_loss(tf.add_n(reg_losses))

至此，keypoints檢測分支添加完畢，直接訓練便可。

三、keypoint損失函數

本損失函數也是原版Mask RCNN沒有實現，經由Humanpose工程實現的，咱們無需改動。

其原理就是將true propose 的目標中的可見關鍵點進行（稀疏）交叉熵計算，之因此強調是稀疏交叉熵，由於每個關鍵點其使用一個56*56的向量表示，大部分位置爲0，僅關鍵點位置爲1。

def keypoint_mrcnn_mask_loss_graph(target_keypoints, target_keypoint_weights,
                                   target_class_ids, pred_keypoints_logit,
                                   weight_loss = True, mask_shape=[56,56],
                                   number_point=13):
    """Mask softmax cross-entropy loss for the keypoint head.
    積極區域的關鍵點才參與loss計算
        真實目標類別 target_class_ids 大於0的位置
    可見點才參與loss運算
        真實關鍵點權重 target_keypoint_weights 爲1的位置
    target_keypoints：     真實關鍵點座標
    pred_keypoints_logit： 預測出關鍵點生成的熱圖

    target_keypoints: [batch, TRAIN_ROIS_PER_IMAGE, NUM_KEYPOINTS)
         Keypoint labels cropped to bbox boundaries and resized to neural
         network output size. Maps keypoints from the half-open interval [x1, x2) on continuous image
        coordinates to the closed interval [0, HEATMAP_SIZE - 1]

    target_keypoint_weights: [batch, TRAIN_ROIS_PER_IMAGE, NUM_KEYPOINTS), bool type
         Keypoint_weights, 0: isn't visible, 1: visilble

    target_class_ids: [batch, TRAIN_ROIS_PER_IMAGE]. Integer class IDs.

    pred_keypoints_logit: [batch_size, num_roi, num_keypoint, 56*56)

    """

    # Reshape for simplicity. Merge first two dimensions into one.
    #shape:[N]
    target_class_ids = K.reshape(target_class_ids, (-1,))
    # Only positive person ROIs contribute to the loss. And only
    # the people specific mask of each ROI.
    positive_people_ix = tf.where(target_class_ids > 0)[:, 0]
    positive_people_ids = tf.cast(
        tf.gather(target_class_ids, positive_people_ix), tf.int64)

    ###Step 1 Get the positive target and predict keypoint masks
        # reshape target_keypoint_weights to [N, num_keypoints]
    target_keypoint_weights = K.reshape(target_keypoint_weights, (-1, number_point))  # 點的可見度
        # reshape target_keypoint_masks to [N, num_keypoints]
    target_keypoints = K.reshape(target_keypoints, (
        -1,  number_point))  # 點的座標

    # reshape pred_keypoint_masks to [N, number_point, 56*56]
    pred_keypoints_logit = K.reshape(pred_keypoints_logit,
                                    (-1, number_point, mask_shape[0]*mask_shape[1]))  # 推薦區特徵圖

        # Gather the keypoint masks (target and predict) that contribute to loss
        # shape: [N_positive, number_point]
    positive_target_keypoints = tf.cast(tf.gather(target_keypoints, positive_people_ix),tf.int32)
        # shape: [N_positive, number_point, 56*56]
    positive_pred_keypoints_logit = tf.gather(pred_keypoints_logit, positive_people_ix)
        # positive target_keypoint_weights to[N_positive, number_point]
    positive_keypoint_weights = tf.cast(
        tf.gather(target_keypoint_weights, positive_people_ix), tf.float32)

    loss = K.switch(tf.size(positive_target_keypoints) > 0,
                    lambda: tf.nn.sparse_softmax_cross_entropy_with_logits(logits=positive_pred_keypoints_logit,
                                                                           labels=positive_target_keypoints),
                    lambda: tf.constant(0.0))
    loss = loss * positive_keypoint_weights

    if(weight_loss):
        loss = K.switch(tf.reduce_sum(positive_keypoint_weights) > 0,
                        lambda: tf.reduce_sum(loss) / tf.reduce_sum(positive_keypoint_weights),
                        lambda: tf.constant(0.0)
                        )
    else:
        loss = K.mean(loss)
    loss = tf.reshape(loss,[1,1])

    return loss

咱們隨機選擇一張圖片，運行demo_detect.ipynb腳本查看訓練效果：