Mask R-CNN

時間 2019-11-13

標籤 mask cnn 简体版

原文原文鏈接

https://www.cnblogs.com/kk17/p/9991446.htmlhtml

AROI部分理解:https://blog.csdn.net/qinghuaci666/article/details/80900882 https://blog.csdn.net/yiyouxian/article/details/79221830python

[Network Architecture]Mask R-CNN論文解析（轉）

前言

最近有一個idea須要去驗證，比較忙，看完Mask R-CNN論文了，最近會去研究Mask R-CNN的代碼，論文解析轉載網上的兩篇博客
技術挖掘者
 remanentedios

文章1

論文題目：Mask R-CNN
git

論文連接：論文連接github

論文代碼：Facebook代碼連接；Tensorflow版本代碼連接； Keras and TensorFlow版本代碼連接；MxNet版本代碼連接算法

1、Mask R-CNN是什麼，能夠作哪些任務？promise

圖1 Mask R-CNN總體架構markdown

Mask R-CNN是一個實例分割（Instance segmentation）算法，能夠用來作「目標檢測」、「目標實例分割」、「目標關鍵點檢測」。網絡

1. 實例分割（Instance segmentation）和語義分割（Semantic segmentation）的區別與聯繫架構

聯繫：語義分割和實例分割都是目標分割中的兩個小的領域，都是用來對輸入的圖片作分割處理；

區別：

圖2 實例分割與語義分割區別

1. 一般意義上的目標分割指的是語義分割，語義分割已經有很長的發展歷史，已經取得了很好地進展，目前有不少的學者在作這方面的研究；然而實例分割是一個從目標分割領域獨立出來的一個小領域，是最近幾年才發展起來的，與前者相比，後者更加複雜，當前研究的學者也比較少，是一個有研究空間的熱門領域，如圖1所示，這是一個正在探索中的領域；

圖3 實例分割與語義分割區別

2. 觀察圖3中的c和d圖，c圖是對a圖進行語義分割的結果，d圖是對a圖進行實例分割的結果。二者最大的區別就是圖中的"cube對象"，在語義分割中給了它們相同的顏色，而在實例分割中卻給了不一樣的顏色。即實例分割須要在語義分割的基礎上對同類物體進行更精細的分割。

注：不少博客中都沒有徹底理解清楚這個問題，不少人將這個算法看作語義分割，其實它是一個實例分割算法。

2. Mask R-CNN能夠完成的任務

圖4 Mask R-CNN進行目標檢測與實例分割

圖5 Mask R-CNN進行人體姿態識別

總之，Mask R-CNN是一個很是靈活的框架，能夠增長不一樣的分支完成不一樣的任務，能夠完成目標分類、目標檢測、語義分割、實例分割、人體姿式識別等多種任務，真不愧是一個好算法！

3. Mask R-CNN預期達到的目標

高速
高準確率（高的分類準確率、高的檢測準確率、高的實例分割準確率等）
簡單直觀
易於使用

4. 如何實現這些目標

高速和高準確率：爲了實現這個目的，做者選用了經典的目標檢測算法Faster-rcnn和經典的語義分割算法FCN。Faster-rcnn能夠既快又準的完成目標檢測的功能；FCN能夠精準的完成語義分割的功能，這兩個算法都是對應領域中的經典之做。Mask R-CNN比Faster-rcnn複雜，可是最終仍然能夠達到5fps的速度，這和原始的Faster-rcnn的速度至關。因爲發現了ROI Pooling中所存在的像素誤差問題，提出了對應的ROIAlign策略，加上FCN精準的像素MASK，使得其能夠得到高準確率。

簡單直觀：整個Mask R-CNN算法的思路很簡單，就是在原始Faster-rcnn算法的基礎上面增長了FCN來產生對應的MASK分支。即Faster-rcnn + FCN，更細緻的是 RPN + ROIAlign + Fast-rcnn + FCN。

易於使用：整個Mask R-CNN算法很是的靈活，能夠用來完成多種任務，包括目標分類、目標檢測、語義分割、實例分割、人體姿態識別等多個任務，這將其易於使用的特色展示的淋漓盡致。我不多見到有哪一個算法有這麼好的擴展性和易用性，值得咱們學習和借鑑。除此以外，咱們能夠更換不一樣的backbone architecture和Head Architecture來得到不一樣性能的結果。

2、Mask R-CNN框架解析

圖6 Mask R-CNN算法框架

1. Mask R-CNN算法步驟

首先，輸入一幅你想處理的圖片，而後進行對應的預處理操做，或者預處理後的圖片；
而後，將其輸入到一個預訓練好的神經網絡中（ResNeXt等）得到對應的feature map；
接着，對這個feature map中的每一點設定預約個的ROI，從而得到多個候選ROI；
接着，將這些候選的ROI送入RPN網絡進行二值分類（前景或背景）和BB迴歸，過濾掉一部分候選的ROI；
接着，對這些剩下的ROI進行ROIAlign操做（即先將原圖和feature map的pixel對應起來，而後將feature map和固定的feature對應起來）；
最後，對這些ROI進行分類（N類別分類）、BB迴歸和MASK生成（在每個ROI裏面進行FCN操做）。

2. Mask R-CNN架構分解

在這裏，我將Mask R-CNN分解爲以下的3個模塊，Faster-rcnn、ROIAlign和FCN。而後分別對這3個模塊進行講解，這也是該算法的核心。

3. Faster-rcnn（該算法請參考該連接，我進行了詳細的分析）

4. FCN

圖7 FCN網絡架構

FCN算法是一個經典的語義分割算法，能夠對圖片中的目標進行準確的分割。其整體架構如上圖所示，它是一個端到端的網絡，主要的模快包括卷積和去卷積，即先對圖像進行卷積和池化，使其feature map的大小不斷減少；而後進行反捲積操做，即進行插值操做，不斷的增大其feature map，最後對每個像素值進行分類。從而實現對輸入圖像的準確分割。具體的細節請參考該連接。

5. ROIPooling和ROIAlign的分析與比較

圖8 ROIPooling和ROIAlign的比較

如圖8所示，ROI Pooling和ROIAlign最大的區別是：前者使用了兩次量化操做，然後者並無採用量化操做，使用了線性插值算法，具體的解釋以下所示。

圖9 ROI Pooling技術

如圖9所示，爲了獲得固定大小（7X7）的feature map，咱們須要作兩次量化操做：1）圖像座標 — feature map座標，2）feature map座標 — ROI feature座標。咱們來講一下具體的細節，如圖咱們輸入的是一張800x800的圖像，在圖像中有兩個目標（貓和狗），狗的BB大小爲665x665，通過VGG16網絡後，咱們能夠得到對應的feature map，若是咱們對卷積層進行Padding操做，咱們的圖片通過卷積層後保持原來的大小，可是因爲池化層的存在，咱們最終得到feature map 會比原圖縮小必定的比例，這和Pooling層的個數和大小有關。在該VGG16中，咱們使用了5個池化操做，每一個池化操做都是2Pooling，所以咱們最終得到feature map的大小爲800/32 x 800/32 = 25x25（是整數），可是將狗的BB對應到feature map上面，咱們獲得的結果是665/32 x 665/32 = 20.78 x 20.78，結果是浮點數，含有小數，可是咱們的像素值可沒有小數，那麼做者就對其進行了量化操做（即取整操做），即其結果變爲20 x 20，在這裏引入了第一次的量化偏差；然而咱們的feature map中有不一樣大小的ROI，可是咱們後面的網絡卻要求咱們有固定的輸入，所以，咱們須要將不一樣大小的ROI轉化爲固定的ROI feature，在這裏使用的是7x7的ROI feature，那麼咱們須要將20 x 20的ROI映射成7 x 7的ROI feature，其結果是 20 /7 x 20/7 = 2.86 x 2.86，一樣是浮點數，含有小數點，咱們採起一樣的操做對其進行取整吧，在這裏引入了第二次量化偏差。其實，這裏引入的偏差會致使圖像中的像素和特徵中的像素的誤差，即將feature空間的ROI對應到原圖上面會出現很大的誤差。緣由以下：好比用咱們第二次引入的偏差來分析，原本是2,86，咱們將其量化爲2，這期間引入了0.86的偏差，看起來是一個很小的偏差呀，可是你要記得這是在feature空間，咱們的feature空間和圖像空間是有比例關係的，在這裏是1:32，那麼對應到原圖上面的差距就是0.86 x 32 = 27.52。這個差距不小吧，這仍是僅僅考慮了第二次的量化偏差。這會大大影響整個檢測算法的性能，所以是一個嚴重的問題。好的，應該解釋清楚了吧，好累！

圖10 ROIAlign技術

如圖10所示，爲了獲得爲了獲得固定大小（7X7）的feature map，ROIAlign技術並無使用量化操做，即咱們不想引入量化偏差，好比665 / 32 = 20.78，咱們就用20.78，不用什麼20來替代它，好比20.78 / 7 = 2.97，咱們就用2.97，而不用2來代替它。這就是ROIAlign的初衷。那麼咱們如何處理這些浮點數呢，咱們的解決思路是使用「雙線性插值」算法。雙線性插值是一種比較好的圖像縮放算法，它充分的利用了原圖中虛擬點（好比20.56這個浮點數，像素位置都是整數值，沒有浮點值）四周的四個真實存在的像素值來共同決定目標圖中的一個像素值，便可以將20.56這個虛擬的位置點對應的像素值估計出來。厲害哈。如圖11所示，藍色的虛線框表示卷積後得到的feature map，黑色實線框表示ROI feature，最後須要輸出的大小是2x2，那麼咱們就利用雙線性插值來估計這些藍點（虛擬座標點，又稱雙線性插值的網格點）處所對應的像素值，最後獲得相應的輸出。這些藍點是2x2Cell中的隨機採樣的普通點，做者指出，這些採樣點的個數和位置不會對性能產生很大的影響，你也能夠用其它的方法得到。而後在每個橘紅色的區域裏面進行max pooling或者average pooling操做，得到最終2x2的輸出結果。咱們的整個過程當中沒有用到量化操做，沒有引入偏差，即原圖中的像素和feature map中的像素是徹底對齊的，沒有誤差，這不只會提升檢測的精度，同時也會有利於實例分割。這麼細心，作科研就應該關注細節，細節決定成敗。

we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use x=16 instead of [x=16]). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed。

圖11 雙線性插值

6. LOSS計算與分析

因爲增長了mask分支，每一個ROI的Loss函數以下所示：

其中Lcls和Lbox和Faster r-cnn中定義的相同。對於每個ROI，mask分支有Km*m維度的輸出，其對K個大小爲m*m的mask進行編碼，每個mask有K個類別。咱們使用了per-pixel sigmoid，而且將Lmask定義爲the average binary cross-entropy loss 。對應一個屬於GT中的第k類的ROI，Lmask僅僅在第k個mask上面有定義（其它的k-1個mask輸出對整個Loss沒有貢獻）。咱們定義的Lmask容許網絡爲每一類生成一個mask，而不用和其它類進行競爭；咱們依賴於分類分支所預測的類別標籤來選擇輸出的mask。這樣將分類和mask生成分解開來。這與利用FCN進行語義分割的有所不一樣，它一般使用一個per-pixel sigmoid和一個multinomial cross-entropy loss ，在這種狀況下mask之間存在競爭關係；而因爲咱們使用了一個per-pixel sigmoid 和一個binary loss ，不一樣的mask之間不存在競爭關係。經驗代表，這能夠提升實例分割的效果。

一個mask對一個目標的輸入空間佈局進行編碼，與類別標籤和BB偏置不一樣，它們一般須要經過FC層而致使其以短向量的形式輸出。咱們能夠經過由卷積提供的像素和像素的對應關係來得到mask的空間結構信息。具體的來講，咱們使用FCN從每個ROI中預測出一個m*m大小的mask，這使得mask分支中的每一個層可以明確的保持m×m空間佈局，而不將其摺疊成缺乏空間維度的向量表示。和之前用fc層作mask預測的方法不一樣的是，咱們的實驗代表咱們的mask表示須要更少的參數，並且更加準確。這些像素到像素的行爲須要咱們的ROI特徵，而咱們的ROI特徵一般是比較小的feature map，其已經進行了對其操做，爲了一致的較好的保持明確的單像素空間對應關係，咱們提出了ROIAlign操做。

3、Mask R-CNN細節分析

1. Head Architecture

圖12 Head Architecture

如上圖所示，爲了產生對應的Mask，文中提出了兩種架構，即左邊的Faster R-CNN/ResNet和右邊的Faster R-CNN/FPN。對於左邊的架構，咱們的backbone使用的是預訓練好的ResNet，使用了ResNet倒數第4層的網絡。輸入的ROI首先得到7x7x1024的ROI feature，而後將其升維到2048個通道（這裏修改了原始的ResNet網絡架構），而後有兩個分支，上面的分支負責分類和迴歸，下面的分支負責生成對應的mask。因爲前面進行了屢次卷積和池化，減少了對應的分辨率，mask分支開始利用反捲積進行分辨率的提高，同時減小通道的個數，變爲14x14x256，最後輸出了14x14x80的mask模板。而右邊使用到的backbone是FPN網絡，這是一個新的網絡，經過輸入單一尺度的圖片，最後能夠對應的特徵金字塔，若是想要了解它的細節，請參考該連接。獲得證明的是，該網絡能夠在必定程度上面提升檢測的精度，當前不少的方法都用到了它。因爲FPN網絡已經包含了res5，能夠更加高效的使用特徵，所以這裏使用了較少的filters。該架構也分爲兩個分支，做用於前者相同，可是分類分支和mask分支和前者相比有很大的區別。多是由於FPN網絡能夠在不一樣尺度的特徵上面得到許多有用信息，所以分類時使用了更少的濾波器。而mask分支中進行了屢次卷積操做，首先將ROI變化爲14x14x256的feature，而後進行了5次相同的操做（不清楚這裏的原理，期待着你的解釋），而後進行反捲積操做，最後輸出28x28x80的mask。即輸出了更大的mask，與前者相比能夠得到更細緻的mask。

圖13 BB輸出的mask結果

如上圖所示，圖像中紅色的BB表示檢測到的目標，咱們能夠用肉眼能夠觀察到檢測結果並非很好，即整個BB稍微偏右，左邊的一部分像素並無包括在BB以內，可是右邊顯示的最終結果卻很完美。

2. Equivariance in Mask R-CNN
Equivariance 指隨着輸入的變化輸出也會發生變化。

圖14 Equivariance 1

即全卷積特徵（Faster R-CNN網絡）和圖像的變換具備同變形，即隨着圖像的變換，全卷積的特徵也會發生對應的變化；

圖15 Equivariance2

在ROI上面的全卷積操做（FCN網絡）和在ROI中的變換具備同變性；

圖16 Equivariance3

ROIAlign操做保持了ROI變換先後的同變性；

圖17 ROI中的全卷積

圖18 ROIAlign的尺度同變性

圖19 Mask R-CNN中的同變性總結

3. 算法實現細節

圖20 算法實現細節

觀察上圖，咱們能夠獲得如下的信息：

Mask R-CNN中的超參數都是用了Faster r-cnn中的值，機智，省時省力，效果還好，別人已經替你調節過啦，哈哈哈；
使用到的預訓練網絡包括ResNet50、ResNet10一、FPN，都是一些性能很好地網絡，尤爲是FPN，後面會有分析；
對於過大的圖片，它會將其裁剪成800x800大小，圖像太大的話會大大的增長計算量的；
利用8個GPU同時訓練，開始的學習率是0.01，通過18k次將其衰減爲0.001，ResNet50-FPN網絡訓練了32小時，ResNet101-FPN訓練了44小時；
在Nvidia Tesla M40 GPU上面的測試時間是195ms/張；
使用了MS COCO數據集，將120k的數據集劃分爲80k的訓練集、35k的驗證集和5k的測試集；

4、性能比較

1. 定量結果分析

表1 ROI Pool和ROIAlign性能比較

由前面的分析，咱們就能夠定性的獲得一個結論，ROIAlign會使得目標檢測的效果有很大的性能提高。根據上表，咱們進行定量的分析，結果代表，ROIAlign使得mask的AP值提高了10.5個百分點，使得box的AP值提高了9.5個百分點。

表2 Multinomial和Binary loss比較

根據上表的分析，咱們知道Mask R-CNN利用兩個分支將分類和mask生成解耦出來，而後利用Binary Loss代替Multinomial Loss，使得不一樣類別的mask之間消除了競爭。依賴於分類分支所預測的類別標籤來選擇輸出對應的mask。使得mask分支不須要進行從新的分類工做，使得性能獲得了提高。

表3 MLP與FCN mask性能比較

如上表所示，MLP即利用FC來生成對應的mask，而FCN利用Conv來生成對應的mask，僅僅從參數量上來說，後者比前者少了不少，這樣不只會節約大量的內存空間，同時會加速整個訓練過程（所以須要進行推理、更新的參數更少啦）。除此以外，因爲MLP得到的特徵比較抽象，使得最終的mask中丟失了一部分有用信息，咱們能夠直觀的從右邊看到差異。從定性角度來說，FCN使得mask AP值提高了2.1個百分點。

表4 實例分割的結果

表5 目標檢測的結果

觀察目標檢測的表格，咱們能夠發現使用了ROIAlign操做的Faster R-CNN算法性能獲得了0.9個百分點，Mask R-CNN比最好的Faster R-CNN高出了2.6個百分點。

2. 定性結果分析

圖21 實例分割結果1

圖22 實例分割結果2

圖23 人體姿式識別結果

圖24 失敗檢測案例1

圖25 失敗檢測案例2

5、總結

Mask R-CNN論文的主要貢獻包括如下幾點：

分析了ROI Pool的不足，提高了ROIAlign，提高了檢測和實例分割的效果；
將實例分割分解爲分類和mask生成兩個分支，依賴於分類分支所預測的類別標籤來選擇輸出對應的mask。同時利用Binary Loss代替Multinomial Loss，消除了不一樣類別的mask之間的競爭，生成了準確的二值mask；
並行進行分類和mask生成任務，對模型進行了加速。

參考文獻：

[1] 何鎧明大神在ICCV2017上在的Slides，視頻連接

[2] Ardian Umam對Mask R-CNN的講解，視頻連接

注意事項：

[1] 該博客是本人原創博客，若是您對該博客感興趣，想要轉載該博客，請與我聯繫（qq郵箱：1575262785@qq.com）,我會在第一時間回覆你們，謝謝你們。

[2] 因爲我的能力有限，該博客可能存在不少的問題，但願你們可以提出改進意見。

[3] 若是您在閱讀本博客時遇到不理解的地方，但願能夠聯繫我，我會及時的回覆您，和您交流想法和意見，謝謝。

文章2

寫在前面：通過了10多天對RCNN家族的目標檢測算法的探究，從一個小白到了入門階段，以爲有必要記錄下這些天學習的知識，若有理解的不到位的地方，還望各位大佬指教。文章代碼量比較大，詳細的看可能須要一段的時間，等畢設開題答辯完了以後有時間我再修改修改，望諒解。

MASK RCNN 算法介紹：

Mask-RCNN 是何凱明大神繼Faster-RCNN後的又一力做，集成了物體檢測和實例分割兩大功能，而且在性能上上也超過了Faster-RCNN。

總體框架：

圖1. Mask-RCNN 總體架構

爲了可以造成必定的對比，把Faster-RCNN的框架也展現出來，直接貼論文中的原圖

是在predict中用，及其

圖2.Faster-RCNN 總體架構

對比兩張圖能夠很明顯的看出，在Faster-RCNN的基礎之上，Mask-RCNN加入了Mask branch（FCN）用於生成物體的掩模(object mask), 同時把RoI pooling 修改爲爲了RoI Align 用於處理mask與原圖中物體不對齊的問題。由於在提取feature maps的主幹conv layers中很差把FPN的結構繪製進去，全部在架構中就沒有體現出了FPN的做用，將在後面講述。

各大部件原理講解

遵循自下而上的原則，依次的從backbone，FPN，RPN，anchors，RoIAlign，classification，box regression，mask這幾個方面講解。

backbone

backbone是一系列的卷積層用於提取圖像的feature maps，好比能夠是VGG16，VGG19，GooLeNet，ResNet50，ResNet101等，這裏主要講解的是ResNet101的結構。

ResNet（深度殘差網絡）實際上就是爲了可以訓練更加深層的網絡提供了有利的思路，畢竟以前一段時間裏面一直相信深度學習中網絡越深獲得的效果會更加的好，可是在構建了太深層以後又會使得網絡退化。ResNet使用了跨層鏈接，使得訓練更加容易。

圖3.ResNet的一個block

網絡試圖讓一個block的輸出爲f(x) + x，其中的f(x)爲殘差，當網絡特別深的時候殘差f(x)會趨近於0(我也沒整明白爲何會趨近於0，大佬是這麼說的....)，從而f(x) + x就等於了x，即實現了恆等變換，無論訓練多深性能起碼不會變差。

在網絡中只存在兩種類型的block，在構建ResNet中一直是這兩種block在交替或者循環的使用，全部接下來介紹一下這兩種類型的block（indetity block， conv block）：

圖4. 跳過三個卷積的identity block

圖中能夠看出該block中直接把開端的x接入到第三個卷積層的輸出，因此該x也被稱爲shortcut，至關於捷徑似得。注意主路上第三個卷積層使用激活層，在相加以後才進行了ReLU的激活。

圖5. 跳過三個卷積並在shortcut上存在卷積的conv block

與identity block實際上是差很少的，只是在shortcut上加了一個卷積層再進行相加。注意主路上的第三個卷積層和shortcut上的卷積層都沒激活，而是先相加再進行激活的。

其實在做者的代碼中，主路中的第一個和第三個卷積都是1*1的卷積（改變的只有feature maps的通道大小，不改變長和寬），爲了降維從而實現卷積運算的加速；注意須要保持shortcut和主路最後一個卷積層的channel要相同纔可以進行相加。

下面展現一下ResNet101的總體框架：

圖6.ResNet101總體架構

從圖中能夠得知ResNet分爲了5個stage，C1-C5分別爲每一個Stage的輸出，這些輸出在後面的FPN中會使用到。你能夠數數，看看是否是總共101層，數的時候除去BatchNorm層。注：stage4中是由一個conv_block和22個identity_block，若是要改爲ResNet50網絡的話只須要調整爲5個identity_block.

ResNet101的介紹算是告一個段落了。

FPN（Feature Pyramid Network）

FPN的提出是爲了實現更好的feature maps融合，通常的網絡都是直接使用最後一層的feature maps，雖然最後一層的feature maps 語義強，可是位置和分辨率都比較低，容易檢測不到比較小的物體。FPN的功能就是融合了底層到高層的feature maps ，從而充分的利用了提取到的各個階段的Z徵（ResNet中的C2-C5 ）。

圖7.FPN特徵融合圖

來講可能這

圖8.特徵融合圖7中+的意義解釋圖

從圖中能夠看出+的意義爲：左邊的底層特徵層經過1*1的卷積獲得與上一層特徵層相同的通道數；上層的特徵層經過上採樣獲得與下一層特徵層同樣的長和寬再進行相加，從而獲得了一個融合好的新的特徵層。舉個例子說就是：C4層通過1*1卷積獲得與P5相同的通道，P5通過上採樣後獲得與C4相同的長和寬，最終二者進行相加，獲得了融合層P4，其餘的以此類推。

注：P2-P5是未來用於預測物體的bbox，box-regression，mask的，而P2-P6是用於訓練RPN的，即P6只用於RPN網絡中。

anchors

anchors英文翻譯爲錨點、錨框，是用於在feature maps的像素點上產生一系列的框，各個框的大小由scale和ratio這兩個參數來肯定的，好比scale =[128]，ratio=[0.5,1,1.5] ，則每一個像素點能夠產生3個不一樣大小的框。這個三個框是由保持框的面積不變，來經過ratio的值來改變其長寬比，從而產生不一樣大小的框。

假設咱們如今繪製feature maps上一個像素點的anchors，則能獲得下圖：

圖9.一個像素點上的anchors

因爲使用到了FPN，在論文中也有說到每層的feature map 的scale是保持不變的，只是改變每層的ratio，且越深scale的值就越小，由於越深的話feature map就越小。論文中提供的每層的scale爲(32, 64, 128, 256, 512)，ratio爲(0.5, 1, 2),全部每一層的每個像素點都會產生3個錨框，而總共會有15種不一樣大小的錨框。

對於圖像的中心點會有15個不一樣大小錨框，以下圖：

圖10.圖像中心點的錨框展現

RPN（Region Proposal Network）

RNP顧名思義：區域推薦的網絡，用於幫助網絡推薦感興趣的區域，也是Faster-RCNN中重要的一部分。

圖11. 論文中RPN介紹圖

1. conv feature map：上文中的P2-P6

2. kk anchor boxes：在每一個sliding window的點上的初始化的參考區域。每一個sliding window的點上取得anchor boxes都同樣。只要知道sliding window的點的座標，就能夠計算出每一個anchor box的具體座標。每一個特徵層的k=3k，先肯定一個base anchor，如P6大小爲32×3216×16，保持面積不變使其長寬比爲(0.5,1,2)(0.5,1,2)，獲得3個anchors。
3. intermediate layer：做者代碼中使用的是512d的conv中間層，再經過1×11×1的卷積得到2k2k scores和4k4k cordinates。做者在文中解釋爲用全卷積方式替代全鏈接。
4. 2k2k scores：對於每一個anchor，用了softmax layer的方式，會或得兩個置信度。一個置信度是前景，一個置信度是背景

5. 4k4k cordinates：每一個窗口的座標。這個座標並非anchor的絕對座標，而是與ground_truth誤差的迴歸。

在做者代碼中RPN的網絡具體結構以下：

圖12. RPN結構圖

注：在開始看做者代碼的時候也是有些蒙圈的，爲何給RPN只傳入了feature map和k值就能夠，而沒有給出以前建立好的anchors，後來才明白做者在數據產生那一塊作了修改，他在產生數據的時候就給每個建立好的anchors標註好了是positive仍是negative以及須要迴歸的box值，全部只須要訓練RPN就行了。

RoIAlign

Mask-RCNN中提出了一個新的idea就是RoIAlign，其實RoIAlign就是在RoI pooling上稍微改動過來的，可是爲何在模型中不能使用RoI pooling呢？如今咱們來直觀的解釋一下。

圖13. RoIAlign與RoIpooling對比

能夠看出來在RoI pooling中出現了兩次的取整，雖然在feature maps上取整看起來只是小數級別的數，可是當把feature map還原到原圖上時就會出現很大的誤差，好比第一次的取整是捨去了0.78，還原到原圖時是0.78*32=25,第一次取整就存在了25個像素點的誤差，在第二次的取整後的誤差更加的大。對於分類和物體檢測來講可能這不是一個很大的偏差，可是對於實例分割而言，這是一個很是大的誤差，由於mask出現沒對齊的話在視覺上是很明顯的。而RoIAlign的提出就是爲了解決這個問題，解決不對齊的問題。

RoIAlign的思想其實很簡單，就是取消了取整的這種粗暴作法，而是經過雙線性插值（聽我師姐說好像有一篇論文用到了積分，並且性能獲得了必定的提升）來獲得固定四個點座標的像素值，從而使得不連續的操做變得連續起來，返回到原圖的時候偏差也就更加的小。

1.劃分7*7的bin(能夠直接精確的映射到feature map上來劃分bin，不用第一次ROI的量化）

圖14. ROI分割7*7的bin

2.接着是對每個bin中進行雙線性插值，獲得四個點（在論文中也說到過插值一個點的效果其實和四個點的效果是同樣的，在代碼中做者爲了方便也就採用了插值一個點）

圖15.插值示意圖

3.經過插完值以後再進行max pooling獲得最終的7*7的ROI，即完成了RoIAlign的過程。是否是以爲大佬提出來的高大上名字的方法仍是挺簡單的。

classifier

其中包括了物體檢測最終的classes和bounding boxes。該部分是利用了以前檢測到了ROI進行分類和迴歸（是分別對每個ROI進行）。

圖16. classifier的結構

論文中提到用1024個神經元的全鏈接網絡，可是在代碼中做者用卷積深度爲1024的卷積層來代替這個全鏈接層。

mask

mask的預測也是在ROI以後的，經過FCN（Fully Convolution Network）來進行的。注意這個是實現的語義分割而不是實例分割。由於每一個ROI只對應一個物體，只需對其進行語義分割就好，至關於了實例分割了，這也是Mask-RCNN與其餘分割框架的不一樣，是先分類再分割。

圖17. mask的結構

對於每個ROI的mask都有80類，由於coco上的數據集是80個類別，而且這樣作的話是爲了減弱類別間的競爭，從而獲得更加好的結果。

該模型的訓練和預測是分開的，不是套用同一個流程。在訓練的時候，classifier和mask都是同時進行的；在預測的時候，顯示獲得classifier的結果，而後再把此結果傳入到mask預測中獲得mask，有必定的前後順序。

Mask-RCNN 代碼實現

文中代碼的做者是Matterport：代碼github地址，文中詳細的介紹了各個部分，以及給了demo和各個實現的步驟及其可視化。

代碼整體框架

先貼出我對做者代碼流程的理解，及其畫出的流程圖。

圖18.代碼中training的流程圖

圖19.代碼中predict的流程圖

兩張流程圖其實已經把做者的代碼各個關鍵部分都展現出來了，並寫出了哪些層是在training中用，哪些層是在predict中用，及其層的輸出和須要的輸入。能夠清晰的看出training和predict過程是存在較大的差別的，也是以前說過的，training的時候mask與classifier是並行的，predict時候是先classifier再mask，而且兩個模型的輸入輸出差別也較大。

已經有一篇博客寫的很好，對做者代碼的那幾個ipynb都運行了一遍，而且加上了本身的理解。很是的感謝那位博主，以前在探究Mask-RCNN的時候那邊博文對個人幫助很大，有興趣的能夠看看那片博文：博文連接

我這裏就主要的介紹一下做者中的幾個.py文件：visualize.py，utils.py，model.py，最後再實現一下如何使用該代碼處理視頻

由於代碼量比較大，我就挑一些本人認爲重要的代碼貼出來。

visualize.py

##利用不一樣的顏色爲每一個instance標註出mask，根據box的座標在instance的周圍畫上矩形 ##根據class_ids來尋找到對於的class_names。三個步驟中的任何一個均可以去掉，好比把mask部分 ##去掉，那就只剩下box和label。同時能夠篩選出class_ids從而顯示制定類別的instance顯示,下面 ##這段就是用來顯示人的，其實也就把人的id選出來，而後記錄它們在輸入ids中的相對位置，從而獲得 ##相對應的box與mask的準確順序 def display_instances_person(image, boxes, masks, class_ids, class_names, scores=None, title="", figsize=(16, 16), ax=None): """ the funtion perform a role for displaying the persons who locate in the image boxes: [num_instance, (y1, x1, y2, x2, class_id)] in image coordinates. masks: [height, width, num_instances] class_ids: [num_instances] class_names: list of class names of the dataset scores: (optional) confidence scores for each box figsize: (optional) the size of the image. """ #compute the number of person temp = [] for i, person in enumerate(class_ids): if person == 1: temp.append(i) else: pass person_number = len(temp) person_site = {} for i in range(person_number): person_site[i] = temp[i] NN = boxes.shape[0] # Number of person'instances #N = boxes.shape[0] N = person_number if not N: print("\n*** No person to display *** \n") else: # assert boxes.shape[0] == masks.shape[-1] == class_ids.shape[0] pass if not ax: _, ax = plt.subplots(1, figsize=figsize) # Generate random colors colors = random_colors(NN) # Show area outside image boundaries. height, width = image.shape[:2] ax.set_ylim(height + 10, -10) ax.set_xlim(-10, width + 10) ax.axis('off') ax.set_title(title) masked_image = image.astype(np.uint32).copy() for a in range(N): color = colors[a] i = person_site[a] # Bounding box if not np.any(boxes[i]): # Skip this instance. Has no bbox. Likely lost in image cropping. continue y1, x1, y2, x2 = boxes[i] p = patches.Rectangle((x1, y1), x2 - x1, y2 - y1, linewidth=2, alpha=0.7, linestyle="dashed", edgecolor=color, facecolor='none') ax.add_patch(p) # Label class_id = class_ids[i] score = scores[i] if scores is not None else None label = class_names[class_id] x = random.randint(x1, (x1 + x2) // 2) caption = "{} {:.3f}".format(label, score) if score else label ax.text(x1, y1 + 8, caption, color='w', size=11, backgroundcolor="none") # Mask mask = masks[:, :, i] masked_image = apply_mask(masked_image, mask, color) # Mask Polygon # Pad to ensure proper polygons for masks that touch image edges. padded_mask = np.zeros( (mask.shape[0] + 2, mask.shape[1] + 2), dtype=np.uint8) padded_mask[1:-1, 1:-1] = mask contours = find_contours(padded_mask, 0.5) for verts in contours: # Subtract the padding and flip (y, x) to (x, y) verts = np.fliplr(verts) - 1 p = Polygon(verts, facecolor="none", edgecolor=color) ax.add_patch(p) ax.imshow(masked_image.astype(np.uint8)) plt.show()

utils.py

##由於一個自定義層的輸入的batch只能爲1，因此須要把input分紅batch爲1的輸入， ##而後經過graph_fn計算出output，最終再合在一塊，即間接的實現了計算了一個batch的操做 # ## Batch Slicing # Some custom layers support a batch size of 1 only, and require a lot of work # to support batches greater than 1. This function slices an input tensor # across the batch dimension and feeds batches of size 1. Effectively, # an easy way to support batches > 1 quickly with little code modification. # In the long run, it's more efficient to modify the code to support large # batches and getting rid of this function. Consider this a temporary solution def batch_slice(inputs, graph_fn, batch_size, names=None): """Splits inputs into slices and feeds each slice to a copy of the given computation graph and then combines the results. It allows you to run a graph on a batch of inputs even if the graph is written to support one instance only. inputs: list of tensors. All must have the same first dimension length graph_fn: A function that returns a TF tensor that's part of a graph. batch_size: number of slices to divide the data into. names: If provided, assigns names to the resulting tensors. """ if not isinstance(inputs, list): inputs = [inputs] outputs = [] for i in range(batch_size): inputs_slice = [x[i] for x in inputs] output_slice = graph_fn(*inputs_slice) if not isinstance(output_slice, (tuple, list)): output_slice = [output_slice] outputs.append(output_slice) # Change outputs from a list of slices where each is # a list of outputs to a list of outputs and each has # a list of slices outputs = list(zip(*outputs)) if names is None: names = [None] * len(outputs) result = [tf.stack(o, axis=0, name=n) for o, n in zip(outputs, names)] if len(result) == 1: result = result[0] return result

############################################################ # Anchors ############################################################ ##對特徵圖上的pixel產生anchors，根據anchor_stride來肯定pixel產生anchors的密度 ##便是每一個像素點產生anchors，仍是每兩個產生，以此類推 def generate_anchors(scales, ratios, shape, feature_stride, anchor_stride): """ scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128] ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2] shape: [height, width] spatial shape of the feature map over which to generate anchors. feature_stride: Stride of the feature map relative to the image in pixels. anchor_stride: Stride of anchors on the feature map. For example, if the value is 2 then generate anchors for every other feature map pixel. """ # Get all combinations of scales and ratios scales, ratios = np.meshgrid(np.array(scales), np.array(ratios)) scales = scales.flatten() ratios = ratios.flatten() # Enumerate heights and widths from scales and ratios heights = scales / np.sqrt(ratios) widths = scales * np.sqrt(ratios) # Enumerate shifts in feature space shifts_y = np.arange(0, shape[0], anchor_stride) * feature_stride shifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y) # Enumerate combinations of shifts, widths, and heights box_widths, box_centers_x = np.meshgrid(widths, shifts_x) box_heights, box_centers_y = np.meshgrid(heights, shifts_y) # Reshape to get a list of (y, x) and a list of (h, w) box_centers = np.stack( [box_centers_y, box_centers_x], axis=2).reshape([-1, 2]) box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2]) # Convert to corner coordinates (y1, x1, y2, x2) boxes = np.concatenate([box_centers - 0.5 * box_sizes, box_centers + 0.5 * box_sizes], axis=1) return boxes #調用generate_anchors()爲每一層的feature map都生成anchors，最終在合成在一塊。本身層中的scale是相同的 def generate_pyramid_anchors(scales, ratios, feature_shapes, feature_strides, anchor_stride): """Generate anchors at different levels of a feature pyramid. Each scale is associated with a level of the pyramid, but each ratio is used in all levels of the pyramid. Returns: anchors: [N, (y1, x1, y2, x2)]. All generated anchors in one array. Sorted with the same order of the given scales. So, anchors of scale[0] come first, then anchors of scale[1], and so on. """ # Anchors # [anchor_count, (y1, x1, y2, x2)] anchors = [] for i in range(len(scales)): anchors.append(generate_anchors(scales[i], ratios, feature_shapes[i], feature_strides[i], anchor_stride)) return np.concatenate(anchors, axis=0)

model.py

###創建ResNet101網絡的架構，其中identity_block和conv_block就是上文中講解。 def resnet_graph(input_image, architecture, stage5=False): assert architecture in ["resnet50", "resnet101"] # Stage 1 x = KL.ZeroPadding2D((3, 3))(input_image) x = KL.Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x) x = BatchNorm(axis=3, name='bn_conv1')(x) x = KL.Activation('relu')(x) C1 = x = KL.MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x) # Stage 2 x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1)) x = identity_block(x, 3, [64, 64, 256], stage=2, block='b') C2 = x = identity_block(x, 3, [64, 64, 256], stage=2, block='c') # Stage 3 x = conv_block(x, 3, [128, 128, 512], stage=3, block='a') x = identity_block(x, 3, [128, 128, 512], stage=3, block='b') x = identity_block(x, 3, [128, 128, 512], stage=3, block='c') C3 = x = identity_block(x, 3, [128, 128, 512], stage=3, block='d') # Stage 4 x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a') block_count = {"resnet50": 5, "resnet101": 22}[architecture] for i in range(block_count): x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i)) C4 = x # Stage 5 if stage5: x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a') x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b') C5 = x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c') else: C5 = None return [C1, C2, C3, C4, C5]

Proposal Layer：

class ProposalLayer(KE.Layer): """ Inputs: rpn_probs: [batch, anchors, (bg prob, fg prob)] rpn_bbox: [batch, anchors, (dy, dx, log(dh), log(dw))] Returns: Proposals in normalized coordinates [batch, rois, (y1, x1, y2, x2)] """ def __init__(self, proposal_count, nms_threshold, anchors, config=None, **kwargs): """ anchors: [N, (y1, x1, y2, x2)] anchors defined in image coordinates """ super(ProposalLayer, self).__init__(**kwargs) self.config = config self.proposal_count = proposal_count self.nms_threshold = nms_threshold self.anchors = anchors.astype(np.float32) def call(self, inputs): ###實現了將傳入的anchors，及其scores、deltas進行topK的推薦和nms的推薦，最終輸出 ###數量爲proposal_counts的proposals。其中的scores和deltas都是RPN網絡中獲得的 # Box Scores. Use the foreground class confidence. [Batch, num_rois, 1] scores = inputs[0][:, :, 1] # Box deltas [batch, num_rois, 4] deltas = inputs[1] deltas = deltas * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4]) # Base anchors anchors = self.anchors # Improve performance by trimming to top anchors by score # and doing the rest on the smaller subset. pre_nms_limit = min(6000, self.anchors.shape[0]) ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True, name="top_anchors").indices scores = utils.batch_slice([scores, ix], lambda x, y: tf.gather(x, y), self.config.IMAGES_PER_GPU) deltas = utils.batch_slice([deltas, ix], lambda x, y: tf.gather(x, y), self.config.IMAGES_PER_GPU) anchors = utils.batch_slice(ix, lambda x: tf.gather(anchors, x), self.config.IMAGES_PER_GPU, names=["pre_nms_anchors"]) # Apply deltas to anchors to get refined anchors. # [batch, N, (y1, x1, y2, x2)] ##利用deltas在anchors上，獲得精化的boxs boxes = utils.batch_slice([anchors, deltas], lambda x, y: apply_box_deltas_graph(x, y), self.config.IMAGES_PER_GPU, names=["refined_anchors"]) # Clip to image boundaries. [batch, N, (y1, x1, y2, x2)] height, width = self.config.IMAGE_SHAPE[:2] window = np.array([0, 0, height, width]).astype(np.float32) boxes = utils.batch_slice(boxes, lambda x: clip_boxes_graph(x, window), self.config.IMAGES_PER_GPU, names=["refined_anchors_clipped"]) # Filter out small boxes # According to Xinlei Chen's paper, this reduces detection accuracy # for small objects, so we're skipping it. # Normalize dimensions to range of 0 to 1. normalized_boxes = boxes / np.array([[height, width, height, width]]) # Non-max suppression def nms(normalized_boxes, scores): indices = tf.image.non_max_suppression( normalized_boxes, scores, self.proposal_count, self.nms_threshold, name="rpn_non_max_suppression") proposals = tf.gather(normalized_boxes, indices) # Pad if needed padding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0) ##填充到與proposal_count的數量同樣，往下填充。 proposals = tf.pad(proposals, [(0, padding), (0, 0)]) return proposals proposals = utils.batch_slice([normalized_boxes, scores], nms, self.config.IMAGES_PER_GPU) return proposals

RoIAlign Layer:

class PyramidROIAlign(KE.Layer): """Implements ROI Pooling on multiple levels of the feature pyramid. Params: - pool_shape: [height, width] of the output pooled regions. Usually [7, 7] - image_shape: [height, width, chanells]. Shape of input image in pixels Inputs: - boxes: [batch, num_boxes, (y1, x1, y2, x2)] in normalized coordinates. Possibly padded with zeros if not enough boxes to fill the array. - Feature maps: List of feature maps from different levels of the pyramid. Each is [batch, height, width, channels] Output: Pooled regions in the shape: [batch, num_boxes, height, width, channels]. The width and height are those specific in the pool_shape in the layer constructor. """ def __init__(self, pool_shape, image_shape, **kwargs): super(PyramidROIAlign, self).__init__(**kwargs) self.pool_shape = tuple(pool_shape) self.image_shape = tuple(image_shape) def call(self, inputs): ##計算在不一樣層的ROI下的ROIalig pooling，應該是計算了每個lever的全部channels # Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords boxes = inputs[0] # Feature Maps. List of feature maps from different level of the # feature pyramid. Each is [batch, height, width, channels] feature_maps = inputs[1:] # Assign each ROI to a level in the pyramid based on the ROI area. y1, x1, y2, x2 = tf.split(boxes, 4, axis=2) h = y2 - y1 w = x2 - x1 # Equation 1 in the Feature Pyramid Networks paper. Account for # the fact that our coordinates are normalized here. # e.g. a 224x224 ROI (in pixels) maps to P4 ###計算ROI屬於FPN中的哪個level image_area = tf.cast( self.image_shape[0] * self.image_shape[1], tf.float32) roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area))) roi_level = tf.minimum(5, tf.maximum( 2, 4 + tf.cast(tf.round(roi_level), tf.int32))) roi_level = tf.squeeze(roi_level, 2) # Loop through levels and apply ROI pooling to each. P2 to P5. pooled = [] box_to_level = [] for i, level in enumerate(range(2, 6)): ##應該是一個二維的array，存儲這哪一層的哪些box的indicies ix = tf.where(tf.equal(roi_level, level)) level_boxes = tf.gather_nd(boxes, ix) # Box indicies for crop_and_resize. ##應該是隻存儲當前lever的box的indices box_indices = tf.cast(ix[:, 0], tf.int32) # Keep track of which box is mapped to which level box_to_level.append(ix) # Stop gradient propogation to ROI proposals level_boxes = tf.stop_gradient(level_boxes) box_indices = tf.stop_gradient(box_indices) # 由於插值一個點和四個點的性能影響不大故插一個點 pooled.append(tf.image.crop_and_resize( feature_maps[i], level_boxes, box_indices, self.pool_shape, method="bilinear")) # Pack pooled features into one tensor pooled = tf.concat(pooled, axis=0) # Pack box_to_level mapping into one array and add another # column representing the order of pooled boxes box_to_level = tf.concat(box_to_level, axis=0) box_range = tf.expand_dims(tf.range(tf.shape(box_to_level)[0]), 1) box_to_level = tf.concat([tf.cast(box_to_level, tf.int32), box_range], axis=1) # Rearrange pooled features to match the order of the original boxes # Sort box_to_level by batch then box index # TF doesn't have a way to sort by two columns, so merge them and sort. sorting_tensor = box_to_level[:, 0] * 100000 + box_to_level[:, 1] ix = tf.nn.top_k(sorting_tensor, k=tf.shape( box_to_level)[0]).indices[::-1] ix = tf.gather(box_to_level[:, 2], ix) pooled = tf.gather(pooled, ix) # Re-add the batch dimension pooled = tf.expand_dims(pooled, 0) return pooled

Detection_Target_Layer的主要函數：Detection_targets_graph()

#根據proposal和gt_box的overlap來肯定正樣本和負樣本，並按照sample_ratio和train_anchor_per_image #的大小進行採樣，最終得出rois(n&p),class_id,delta,masks，其中進行了padding def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config): #Subsamples 抽樣 """Generates detection targets for one image. Subsamples proposals and generates target class IDs, bounding box deltas, and masks for each. Inputs: proposals: [N, (y1, x1, y2, x2)] in normalized coordinates. Might be zero padded if there are not enough proposals. gt_class_ids: [MAX_GT_INSTANCES] int class IDs gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates. gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type. Returns: Target ROIs and corresponding class IDs, bounding box shifts, and masks. rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded. deltas: [TRAIN_ROIS_PER_IMAGE, NUM_CLASSES, (dy, dx, log(dh), log(dw))] Class-specific bbox refinments. masks: [TRAIN_ROIS_PER_IMAGE, height, width). Masks cropped to bbox boundaries and resized to neural network output size. Note: Returned arrays might be zero padded if not enough target ROIs. """ # Assertions asserts = [ tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals], name="roi_assertion"), ] with tf.control_dependencies(asserts): proposals = tf.identity(proposals) # Remove zero padding proposals, _ = trim_zeros_graph(proposals, name="trim_proposals") gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes") gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros, name="trim_gt_class_ids") gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2, name="trim_gt_masks") # Handle COCO crowds # A crowd box in COCO is a bounding box around several instances. Exclude # them from training. A crowd box is given a negative class ID. crowd_ix = tf.where(gt_class_ids < 0)[:, 0] non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0] crowd_boxes = tf.gather(gt_boxes, crowd_ix) crowd_masks = tf.gather(gt_masks, crowd_ix, axis=2) gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix) gt_boxes = tf.gather(gt_boxes, non_crowd_ix) gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2) # Compute overlaps matrix [proposals, gt_boxes] overlaps = overlaps_graph(proposals, gt_boxes) # Compute overlaps with crowd boxes [anchors, crowds] crowd_overlaps = overlaps_graph(proposals, crowd_boxes) crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1) no_crowd_bool = (crowd_iou_max < 0.001) # Determine postive and negative ROIs roi_iou_max = tf.reduce_max(overlaps, axis=1) # 1. Positive ROIs are those with >= 0.5 IoU with a GT box positive_roi_bool = (roi_iou_max >= 0.5) positive_indices = tf.where(positive_roi_bool)[:, 0] # 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds. negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0] # Subsample ROIs. Aim for 33% positive # Positive ROIs positive_count = int(config.TRAIN_ROIS_PER_IMAGE * config.ROI_POSITIVE_RATIO) positive_indices = tf.random_shuffle(positive_indices)[:positive_count] positive_count = tf.shape(positive_indices)[0] # Negative ROIs. Add enough to maintain positive:negative ratio. r = 1.0 / config.ROI_POSITIVE_RATIO negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count negative_indices = tf.random_shuffle(negative_indices)[:negative_count] # Gather selected ROIs positive_rois = tf.gather(proposals, positive_indices) negative_rois = tf.gather(proposals, negative_indices) # Assign positive ROIs to GT boxes. positive_overlaps = tf.gather(overlaps, positive_indices) ##最終須要的indices roi_gt_box_assignment = tf.argmax(positive_overlaps, axis=1) roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment) roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment) # Compute bbox refinement for positive ROIs deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes) deltas /= config.BBOX_STD_DEV # Assign positive ROIs to GT masks # Permute masks to [N, height, width, 1] transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1) # Pick the right mask for each ROI roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment) # Compute mask targets boxes = positive_rois if config.USE_MINI_MASK: # Transform ROI coordinates from normalized image space # to normalized mini-mask space. y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1) gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1) gt_h = gt_y2 - gt_y1 gt_w = gt_x2 - gt_x1 y1 = (y1 - gt_y1) / gt_h x1 = (x1 - gt_x1) / gt_w y2 = (y2 - gt_y1) / gt_h x2 = (x2 - gt_x1) / gt_w boxes = tf.concat([y1, x1, y2, x2], 1) box_ids = tf.range(0, tf.shape(roi_masks)[0]) masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes, box_ids, config.MASK_SHAPE) # Remove the extra dimension from masks. masks = tf.squeeze(masks, axis=3) # Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with # binary cross entropy loss. masks = tf.round(masks) ##進行填充 # Append negative ROIs and pad bbox deltas and masks that # are not used for negative ROIs with zeros. rois = tf.concat([positive_rois, negative_rois], axis=0) N = tf.shape(negative_rois)[0] P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0) rois = tf.pad(rois, [(0, P), (0, 0)]) roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)]) roi_gt_cliass_ids = tf.pad(roi_gt_class_ids, [(0, N + P)]) deltas = tf.pad(deltas, [(0, N + P), (0, 0)]) masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)]) return rois, roi_gt_class_ids, deltas, masks

DetectionLayer的主要函數：refine_detetions()

#根據rios和probs(每一個ROI都有總類別個數的probs)和deltas進行檢測的精化，獲得固定數量的精化目標。 def refine_detections(rois, probs, deltas, window, config): """Refine classified proposals and filter overlaps and return final detections. #輸入爲N個rois、N個具備num_classes的probs，scores由probs得出 Inputs: rois: [N, (y1, x1, y2, x2)] in normalized coordinates probs: [N, num_classes]. Class probabilities. deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific bounding box deltas. window: (y1, x1, y2, x2) in image coordinates. The part of the image that contains the image excluding the padding. Returns detections shaped: [N, (y1, x1, y2, x2, class_id, score)] """ # Class IDs per ROI class_ids = np.argmax(probs, axis=1) # Class probability of the top class of each ROI class_scores = probs[np.arange(class_ids.shape[0]), class_ids] # Class-specific bounding box deltas deltas_specific = deltas[np.arange(deltas.shape[0]), class_ids] # Apply bounding box deltas # Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates refined_rois = utils.apply_box_deltas( rois, deltas_specific * config.BBOX_STD_DEV) # Convert coordiates to image domain # TODO: better to keep them normalized until later height, width = config.IMAGE_SHAPE[:2] refined_rois *= np.array([height, width, height, width]) # Clip boxes to image window refined_rois = clip_to_window(window, refined_rois) # Round and cast to int since we're deadling with pixels now refined_rois = np.rint(refined_rois).astype(np.int32) # TODO: Filter out boxes with zero area # Filter out background boxes keep = np.where(class_ids > 0)[0] # Filter out low confidence boxes if config.DETECTION_MIN_CONFIDENCE: keep = np.intersect1d( keep, np.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[0]) #留下既知足是前景又知足scores大於MIN_CONFIDENCE的 # Apply per-class NMS pre_nms_class_ids = class_ids[keep] pre_nms_scores = class_scores[keep] pre_nms_rois = refined_rois[keep] nms_keep = [] #分類別的進行NMS。 for class_id in np.unique(pre_nms_class_ids): # Pick detections of this class ixs = np.where(pre_nms_class_ids == class_id)[0] # Apply NMS class_keep = utils.non_max_suppression( pre_nms_rois[ixs], pre_nms_scores[ixs], config.DETECTION_NMS_THRESHOLD) # Map indicies class_keep = keep[ixs[class_keep]] nms_keep = np.union1d(nms_keep, class_keep) keep = np.intersect1d(keep, nms_keep).astype(np.int32) # Keep top detections roi_count = config.DETECTION_MAX_INSTANCES top_ids = np.argsort(class_scores[keep])[::-1][:roi_count] keep = keep[top_ids] # Arrange output as [N, (y1, x1, y2, x2, class_id, score)] # Coordinates are in image domain. result = np.hstack((refined_rois[keep], class_ids[keep][..., np.newaxis], class_scores[keep][..., np.newaxis])) return result

像RPN、fpn_classifier_graph、bulid_fpn_mask_graph等網絡結構都和論文中描述的同樣，這裏就不貼出代碼贅述了。

由於論文中添加了mask這個分支，這裏就單獨的把mask的loss代碼貼出來，也是以前Faster-RCNN中沒有的。

##根據預測的mask和真實的mask來計算binary_cross_entropy的loss，且只有positive ROIS 貢獻 ##loss，且每個ROI只能對應一個類別的mask（由於防止種類競爭，每一個ROIS預測了num_class個的MASK） def mrcnn_mask_loss_graph(target_masks, target_class_ids, pred_masks): """Mask binary cross-entropy loss for the masks head. target_masks: [batch, num_rois, height, width]. A float32 tensor of values 0 or 1. Uses zero padding to fill array. target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded. pred_masks: [batch, proposals, height, width, num_classes] float32 tensor with values from 0 to 1. """ # Reshape for simplicity. Merge first two dimensions into one. target_class_ids = K.reshape(target_class_ids, (-1,)) mask_shape = tf.shape(target_masks) target_masks = K.reshape(target_masks, (-1, mask_shape[2], mask_shape[3])) pred_shape = tf.shape(pred_masks) #shape:[batch*proposal, height, width, number_class] pred_masks = K.reshape(pred_masks, (-1, pred_shape[2], pred_shape[3], pred_shape[4])) # Permute predicted masks to [N, num_classes, height, width] pred_masks = tf.transpose(pred_masks, [0, 3, 1, 2]) # Only positive ROIs contribute to the loss. And only # the class specific mask of each ROI. positive_ix = tf.where(target_class_ids > 0)[:, 0] positive_class_ids = tf.cast( tf.gather(target_class_ids, positive_ix), tf.int64) indices = tf.stack([positive_ix, positive_class_ids], axis=1) # Gather the masks (predicted and true) that contribute to loss y_true = tf.gather(target_masks, positive_ix) y_pred = tf.gather_nd(pred_masks, indices) # Compute binary cross entropy. If no positive ROIs, then return 0. # shape: [batch, roi, num_classes] loss = K.switch(tf.size(y_true) > 0, K.binary_crossentropy(target=y_true, output=y_pred), tf.constant(0.0)) loss = K.mean(loss) loss = K.reshape(loss, [1, 1]) return loss

在Date Generate 這一塊中，含有三個主要的函數:

第一個是load_image_gt(dataset, config, image_id, augment=False,use_mini_mask=False) 該函數繼承了utils.py中的Dataset類，主要的功能是根據image_id來讀取圖片的gt_masks,gt_boxes,instances,gt_class_ids。不熟悉的能夠看看Dataset父類中的函數。

第二個是build_detection_target(),這個函數的做用其實和DetectionTargetLayer的做用差很少，可是他是用來幫助咱們讀者可視化的時候調用的，或者用來在不使用RPN的狀況下來調試和訓練Mask-RCNN網絡的。

第三個是bulid_rpn_target(image_shape, anchors, gt_class_ids, gt_boxes, config),利用overlap的大小給anchors尋找對應的gt_boxs，對anchors再進行抽樣，去除一半以上的positive anchors再計算須要留下的negative anchors，最終計算留下的positive anchors與所對應的gt_box的deltas，返回的rpn_match中-1是negative，0是neutral，1是positive，這個在data_generator()中有用處。

接下來是這塊的主角data_generator()，是一個數據的生成器，用於產生數據，用於以後的訓練和各層之間的調用等。能夠留意一下這個生成器的返回值。

###產生一系列的數據的generator def data_generator(dataset, config, shuffle=True, augment=True, random_rois=0, batch_size=1, detection_targets=False): """ - images: [batch, H, W, C] - image_meta: [batch, size of image meta] - rpn_match: [batch, N] Integer (1=positive anchor, -1=negative, 0=neutral) - rpn_bbox: [batch, N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas. - gt_class_ids: [batch, MAX_GT_INSTANCES] Integer class IDs - gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] - gt_masks: [batch, height, width, MAX_GT_INSTANCES]. The height and width are those of the image unless use_mini_mask is True, in which case they are defined in MINI_MASK_SHAPE. outputs list: Usually empty in regular training. But if detection_targets is True then the outputs list contains target class_ids, bbox deltas, and masks. """ b = 0 # batch item index image_index = -1 image_ids = np.copy(dataset.image_ids) error_count = 0 # Anchors # [anchor_count, (y1, x1, y2, x2)] anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES, config.RPN_ANCHOR_RATIOS, config.BACKBONE_SHAPES, config.BACKBONE_STRIDES, config.RPN_ANCHOR_STRIDE) # Keras requires a generator to run indefinately. while True: try: ##只有在epoch的時候進行打亂 # Increment index to pick next image. Shuffle if at the start of an epoch. image_index = (image_index + 1) % len(image_ids) if shuffle and image_index == 0: np.random.shuffle(image_ids) #利用第一個函數獲得該圖像所對應的全部groundtruth值 # Get GT bounding boxes and masks for image. image_id = image_ids[image_index] image, image_meta, gt_class_ids, gt_boxes, gt_masks = \ load_image_gt(dataset, config, image_id, augment=augment, use_mini_mask=config.USE_MINI_MASK) # Skip images that have no instances. This can happen in cases # where we train on a subset of classes and the image doesn't # have any of the classes we care about. if not np.any(gt_class_ids > 0): continue # RPN Targets ##返回錨點中positive，neutral，negative分類信息和positive的anchors與gt_boxes的delta rpn_match, rpn_bbox = build_rpn_targets(image.shape, anchors, gt_class_ids, gt_boxes, config) # Mask R-CNN Targets if random_rois: rpn_rois = generate_random_rois( image.shape, random_rois, gt_class_ids, gt_boxes) if detection_targets: rois, mrcnn_class_ids, mrcnn_bbox, mrcnn_mask =\ build_detection_targets( rpn_rois, gt_class_ids, gt_boxes, gt_masks, config) # Init batch arrays if b == 0: batch_image_meta = np.zeros( (batch_size,) + image_meta.shape, dtype=image_meta.dtype) batch_rpn_match = np.zeros( [batch_size, anchors.shape[0], 1], dtype=rpn_match.dtype) batch_rpn_bbox = np.zeros( [batch_size, config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4], dtype=rpn_bbox.dtype) batch_images = np.zeros( (batch_size,) + image.shape, dtype=np.float32) batch_gt_class_ids = np.zeros( (batch_size, config.MAX_GT_INSTANCES), dtype=np.int32) batch_gt_boxes = np.zeros( (batch_size, config.MAX_GT_INSTANCES, 4), dtype=np.int32) if config.USE_MINI_MASK: batch_gt_masks = np.zeros((batch_size, config.MINI_MASK_SHAPE[0], config.MINI_MASK_SHAPE[1], config.MAX_GT_INSTANCES)) else: batch_gt_masks = np.zeros( (batch_size, image.shape[0], image.shape[1], config.MAX_GT_INSTANCES)) if random_rois: batch_rpn_rois = np.zeros( (batch_size, rpn_rois.shape[0], 4), dtype=rpn_rois.dtype) if detection_targets: batch_rois = np.zeros( (batch_size,) + rois.shape, dtype=rois.dtype) batch_mrcnn_class_ids = np.zeros( (batch_size,) + mrcnn_class_ids.shape, dtype=mrcnn_class_ids.dtype) batch_mrcnn_bbox = np.zeros( (batch_size,) + mrcnn_bbox.shape, dtype=mrcnn_bbox.dtype) batch_mrcnn_mask = np.zeros( (batch_size,) + mrcnn_mask.shape, dtype=mrcnn_mask.dtype) #超過了config中instance的最大數量則進行刪減。 # If more instances than fits in the array, sub-sample from them. if gt_boxes.shape[0] > config.MAX_GT_INSTANCES: ids = np.random.choice( np.arange(gt_boxes.shape[0]), config.MAX_GT_INSTANCES, replace=False) gt_class_ids = gt_class_ids[ids] gt_boxes = gt_boxes[ids] gt_masks = gt_masks[:, :, ids] ##把每張圖片的信息添加到一個batch中，直到滿爲止 # Add to batch batch_image_meta[b] = image_meta batch_rpn_match[b] = rpn_match[:, np.newaxis] batch_rpn_bbox[b] = rpn_bbox batch_images[b] = mold_image(image.astype(np.float32), config) batch_gt_class_ids[b, :gt_class_ids.shape[0]] = gt_class_ids batch_gt_boxes[b, :gt_boxes.shape[0]] = gt_boxes batch_gt_masks[b, :, :, :gt_masks.shape[-1]] = gt_masks if random_rois: batch_rpn_rois[b] = rpn_rois if detection_targets: batch_rois[b] = rois batch_mrcnn_class_ids[b] = mrcnn_class_ids batch_mrcnn_bbox[b] = mrcnn_bbox batch_mrcnn_mask[b] = mrcnn_mask b += 1 # Batch full? if b >= batch_size: inputs = [batch_images, batch_image_meta, batch_rpn_match, batch_rpn_bbox, batch_gt_class_ids, batch_gt_boxes, batch_gt_masks] outputs = [] if random_rois: inputs.extend([batch_rpn_rois]) if detection_targets: inputs.extend([batch_rois]) # Keras requires that output and targets have the same number of dimensions batch_mrcnn_class_ids = np.expand_dims( batch_mrcnn_class_ids, -1) outputs.extend( [batch_mrcnn_class_ids, batch_mrcnn_bbox, batch_mrcnn_mask]) yield inputs, outputs # start a new batch b = 0 except (GeneratorExit, KeyboardInterrupt): raise except: # Log it and skip the image logging.exception("Error processing image {}".format( dataset.image_info[image_id])) error_count += 1 if error_count > 5: raise

接下里最重要的一個步驟就是構建Mask-RCNN的模型，又論文咱們也知道，訓練和預測須要分開的構建，由於二者存在差別的。這一段能夠對着那幾個流程圖看看。

def build(self, mode, config): """Build Mask R-CNN architecture. input_shape: The shape of the input image. mode: Either "training" or "inference". The inputs and outputs of the model differ accordingly. """ assert mode in ['training', 'inference'] # Image size must be dividable by 2 multiple times h, w = config.IMAGE_SHAPE[:2] if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6): raise Exception("Image size must be dividable by 2 at least 6 times " "to avoid fractions when downscaling and upscaling." "For example, use 256, 320, 384, 448, 512, ... etc. ") ##構建全部須要的輸入，而且都爲神經網絡的輸入，可用KL.INPUT來轉化 # Inputs input_image = KL.Input( shape=config.IMAGE_SHAPE.tolist(), name="input_image") input_image_meta = KL.Input(shape=[None], name="input_image_meta") if mode == "training": # RPN GT input_rpn_match = KL.Input( shape=[None, 1], name="input_rpn_match", dtype=tf.int32) input_rpn_bbox = KL.Input( shape=[None, 4], name="input_rpn_bbox", dtype=tf.float32) # Detection GT (class IDs, bounding boxes, and masks) # 1. GT Class IDs (zero padded) input_gt_class_ids = KL.Input( shape=[None], name="input_gt_class_ids", dtype=tf.int32) # 2. GT Boxes in pixels (zero padded) # [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)] in image coordinates input_gt_boxes = KL.Input( shape=[None, 4], name="input_gt_boxes", dtype=tf.float32) # Normalize coordinates h, w = K.shape(input_image)[1], K.shape(input_image)[2] image_scale = K.cast(K.stack([h, w, h, w], axis=0), tf.float32) gt_boxes = KL.Lambda(lambda x: x / image_scale)(input_gt_boxes) # 3. GT Masks (zero padded) # [batch, height, width, MAX_GT_INSTANCES] if config.USE_MINI_MASK: input_gt_masks = KL.Input( shape=[config.MINI_MASK_SHAPE[0], config.MINI_MASK_SHAPE[1], None], name="input_gt_masks", dtype=bool) else: input_gt_masks = KL.Input( shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None], name="input_gt_masks", dtype=bool) ##實現FPN的多層特徵融合 # Build the shared convolutional layers. # Bottom-up Layers # Returns a list of the last layers of each stage, 5 in total. # Don't create the thead (stage 5), so we pick the 4th item in the list. _, C2, C3, C4, C5 = resnet_graph(input_image, "resnet101", stage5=True) # Top-down Layers # TODO: add assert to varify feature map sizes match what's in config P5 = KL.Conv2D(256, (1, 1), name='fpn_c5p5')(C5) P4 = KL.Add(name="fpn_p4add")([ KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5), KL.Conv2D(256, (1, 1), name='fpn_c4p4')(C4)]) P3 = KL.Add(name="fpn_p3add")([ KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4), KL.Conv2D(256, (1, 1), name='fpn_c3p3')(C3)]) P2 = KL.Add(name="fpn_p2add")([ KL.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3), KL.Conv2D(256, (1, 1), name='fpn_c2p2')(C2)]) # Attach 3x3 conv to all P layers to get the final feature maps. P2 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p2")(P2) P3 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p3")(P3) P4 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p4")(P4) P5 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p5")(P5) # P6 is used for the 5th anchor scale in RPN. Generated by # subsampling from P5 with stride of 2. P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5) # Note that P6 is used in RPN, but not in the classifier heads. rpn_feature_maps = [P2, P3, P4, P5, P6] mrcnn_feature_maps = [P2, P3, P4, P5] # Generate Anchors self.anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES, config.RPN_ANCHOR_RATIOS, config.BACKBONE_SHAPES, config.BACKBONE_STRIDES, config.RPN_ANCHOR_STRIDE) #構建RPN 網絡，用來接受上一級的feature maps #BACKBONE_SHAPES:[N,2] # RPN Model :RPN_ANCHOR_STRIDE爲產生anchors的pixels，len(config.RPN_ANCHOR_RATIOS)爲每一個pixels產生anchors的數量 #256爲接受feature maps的channel rpn = build_rpn_model(config.RPN_ANCHOR_STRIDE, len(config.RPN_ANCHOR_RATIOS), 256) # Loop through pyramid layers layer_outputs = [] # list of lists for p in rpn_feature_maps: layer_outputs.append(rpn([p])) # Concatenate layer outputs # Convert from list of lists of level outputs to list of lists # of outputs across levels. # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]] output_names = ["rpn_class_logits", "rpn_class", "rpn_bbox"] outputs = list(zip(*layer_outputs)) outputs = [KL.Concatenate(axis=1, name=n)(list(o)) for o, n in zip(outputs, output_names)] ## rpn_class_logits, rpn_class, rpn_bbox = outputs ##利用proposal_layer來產生一系列的ROIS，輸入爲RPN網絡中獲得的輸出：rpn_class, rpn_bbox # Generate proposals # Proposals are [batch, N, (y1, x1, y2, x2)] in normalized coordinates # and zero padded. proposal_count = config.POST_NMS_ROIS_TRAINING if mode == "training"\ else config.POST_NMS_ROIS_INFERENCE rpn_rois = ProposalLayer(proposal_count=proposal_count, nms_threshold=config.RPN_NMS_THRESHOLD, name="ROI", anchors=self.anchors, config=config)([rpn_class, rpn_bbox]) if mode == "training": # Class ID mask to mark class IDs supported by the dataset the image # came from. #active_class_ids表示的是當前數據集下含有的class類別 _, _, _, active_class_ids = KL.Lambda(lambda x: parse_image_meta_graph(x), mask=[None, None, None, None])(input_image_meta) if not config.USE_RPN_ROIS: # Ignore predicted ROIs and use ROIs provided as an input. input_rois = KL.Input(shape=[config.POST_NMS_ROIS_TRAINING, 4], name="input_roi", dtype=np.int32) # Normalize coordinates to 0-1 range. target_rois = KL.Lambda(lambda x: K.cast( x, tf.float32) / image_scale[:4])(input_rois) else: target_rois = rpn_rois # Generate detection targets # Subsamples proposals and generates target outputs for training # Note that proposal class IDs, gt_boxes, and gt_masks are zero # padded. Equally, returned rois and targets are zero padded. rois, target_class_ids, target_bbox, target_mask =\ DetectionTargetLayer(config, name="proposal_targets")([ target_rois, input_gt_class_ids, gt_boxes, input_gt_masks]) # Network Heads # TODO: verify that this handles zero padded ROIs mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\ fpn_classifier_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.POOL_SIZE, config.NUM_CLASSES) mrcnn_mask = build_fpn_mask_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.MASK_POOL_SIZE, config.NUM_CLASSES) # TODO: clean up (use tf.identify if necessary) output_rois = KL.Lambda(lambda x: x * 1, name="output_rois")(rois) # Losses rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")( [input_rpn_match, rpn_class_logits]) rpn_bbox_loss = KL.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")( [input_rpn_bbox, input_rpn_match, rpn_bbox]) class_loss = KL.Lambda(lambda x: mrcnn_class_loss_graph(*x), name="mrcnn_class_loss")( [target_class_ids, mrcnn_class_logits, active_class_ids]) bbox_loss = KL.Lambda(lambda x: mrcnn_bbox_loss_graph(*x), name="mrcnn_bbox_loss")( [target_bbox, target_class_ids, mrcnn_bbox]) mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x), name="mrcnn_mask_loss")( [target_mask, target_class_ids, mrcnn_mask]) # Model inputs = [input_image, input_image_meta, input_rpn_match, input_rpn_bbox, input_gt_class_ids, input_gt_boxes, input_gt_masks] if not config.USE_RPN_ROIS: inputs.append(input_rois) outputs = [rpn_class_logits, rpn_class, rpn_bbox, mrcnn_class_logits, mrcnn_class, mrcnn_bbox, mrcnn_mask, rpn_rois, output_rois, rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss] model = KM.Model(inputs, outputs, name='mask_rcnn') else: # Network Heads # Proposal classifier and BBox regressor heads mrcnn_class_logits, mrcnn_class, mrcnn_bbox =\ fpn_classifier_graph(rpn_rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.POOL_SIZE, config.NUM_CLASSES) # Detections # output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates detections = DetectionLayer(config, name="mrcnn_detection")( [rpn_rois, mrcnn_class, mrcnn_bbox, input_image_meta]) # Convert boxes to normalized coordinates # TODO: let DetectionLayer return normalized coordinates to avoid # unnecessary conversions h, w = config.IMAGE_SHAPE[:2] detection_boxes = KL.Lambda( lambda x: x[..., :4] / np.array([h, w, h, w]))(detections) # Create masks for detections mrcnn_mask = build_fpn_mask_graph(detection_boxes, mrcnn_feature_maps, config.IMAGE_SHAPE, config.MASK_POOL_SIZE, config.NUM_CLASSES) model = KM.Model([input_image, input_image_meta], [detections, mrcnn_class, mrcnn_bbox, mrcnn_mask, rpn_rois, rpn_class, rpn_bbox], name='mask_rcnn') # Add multi-GPU support. if config.GPU_COUNT > 1: from parallel_model import ParallelModel model = ParallelModel(model, config.GPU_COUNT)

構建完了以後，其餘的編譯和訓練函數的編寫就比較簡單並且好理解了，就不貼上來了。

其實看到這裏我以爲代碼mask-RCNN的框架和一些具體的細節應該是瞭解了挺多了，但就我我的而言的話，這些代碼我是看了2到3遍纔看懂的，只能應了一句話，好事多磨.....

最後我把我處理視頻的代碼貼上來，其實處理視頻就是把視頻切割成幀，而後用模型處理，再合成爲視頻，但這樣確實很耗時間。

import os import sys import random import math import numpy as np import skimage.io import matplotlib import matplotlib.pyplot as plt import cv2 import coco import utils import model as modellib import video_visualize %matplotlib inline # Root directory of the project ROOT_DIR = os.getcwd() # Directory to save logs and trained model MODEL_DIR = os.path.join(ROOT_DIR, "logs") # Local path to trained weights file COCO_MODEL_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5") # Download COCO trained weights from Releases if needed if not os.path.exists(COCO_MODEL_PATH): utils.download_trained_weights(COCO_MODEL_PATH) # Directory of images to run detection on IMAGE_DIR = os.path.join(ROOT_DIR, "images") class InferenceConfig(coco.CocoConfig):     # Set batch size to 1 since we'll be running inference on     # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU     GPU_COUNT = 1     IMAGES_PER_GPU = 1 config = InferenceConfig() config.display() # Create model object in inference mode. model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config) # Load weights trained on MS-COCO model.load_weights(COCO_MODEL_PATH, by_name=True) # COCO Class names # Index of the class in the list is its ID. For example, to get ID of # the teddy bear class, use: class_names.index('teddy bear') class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',                'bus', 'train', 'truck', 'boat', 'traffic light',                'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',                'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',                'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',                'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',                'kite', 'baseball bat', 'baseball glove', 'skateboard',                'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',                'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',                'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',                'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',                'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',                'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',                'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',                'teddy bear', 'hair drier', 'toothbrush'] #處理視頻須要用到的文件及其文件夾 input_path = os.path.join(ROOT_DIR, "luozx") frame_interval = 1 ##列出全部的視頻文件名字 filenames = os.listdir(input_path) ##獲得文件夾的名字 video_prefix = input_path.split(os.sep)[-1] frame_path = "{}_frame".format(input_path) if not os.path.exists(frame_path):     os.mkdir(frame_path)     #讀取圖片而且保存其每一幀 cap = cv2.VideoCapture() # for filename in filenames: for filename in filenames: # if 1 == 1: #     filename = 'huan.mp4'     filepath = os.sep.join([input_path, filename])     flag = cap.open(filepath)     assert flag == True     ##獲取視頻幀     n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))     #或者直接用n_frames = cap.     print(n_frames)     if n_frames > 800:         n_frames = 800          #     for i in range(42): #         cap.read()     for i in range(n_frames):         ret, frame = cap.read()         #assert ret == True         if i % frame_interval == 0:             #存儲圖片的路徑及其名字             imagename = '{}_{}_{:0>6d}.jpg'.format(video_prefix, filename.split('.')[0], i)             imagepath = os.sep.join([frame_path, imagename])             print("export{}!".format(imagepath))             cv2.imwrite(imagepath, frame)          fps = cap.get(5) cap.release() ##處理視頻中的每一幀圖片而且進行保留  for i in range(n_frames):     #find the direction of the images     imagename = '{}_{}_{:0>6d}.jpg'.format(video_prefix, filename.split('.')[0], i)     imagepath = os.sep.join([frame_path, imagename])     print(imagepath)     #load the image     image = skimage.io.imread(imagepath)          # Run detection     results = model.detect([image], verbose=1)     r = results[0]          # save the dealed image     video_visualize.save_dealed_image(filename, video_prefix, i, image, r['rois'], r['masks'], r['class_ids'],                       class_names, r['scores'], title="",                       figsize=(16, 16), ax=None) ##其中video_visaulize.save_dealed_imag函數就是把display_instance()函數小小的改動了一下，存儲了一下處理完後的相片。 ##把處理完的圖像進行視頻合成 #把處理好的每一幀再合成視頻 import os import cv2 import skimage.io fps = 22 n_frames = 200 ROOT = os.getcwd() save_path = os.path.join(ROOT,"save_images") fourcc = cv2.VideoWriter_fourcc(*'MJPG') #get the width and height of processed image imagepath = "/home/xiongliang/python/python_project/Mask_RCNN/save_images/luozx_promise_000001.jpg" image = skimage.io.imread(imagepath) width, height, _ = image.shape videoWriter = cv2.VideoWriter("save_video.mp4", fourcc, fps, (width, height)) video_prefix = "luozx" filename = "promise.mp4" for i in range(int(n_frames)):     imagename = '{}_{}_{:0>6d}.jpg'.format(video_prefix, filename.split('.')[0], i)     imagepath = os.sep.join([save_path, imagename])     frame = cv2.imread(imagepath)     videoWriter.write(frame) videoWriter.release() ###對視頻進行播放 ROOT = os.getcwd() path = os.path.join(ROOT, "save_video.mp4") cap = cv2.VideoCapture(path) assert cap.isOpened() == True while(cap.isOpened()):     ret, frame = cap.read()     cv2.imshow('frame',frame)     if cv2.waitKey(1) & 0xFF == ord('q'):  # 適當調整等待時間         break

1. mask rcnn
2. Mask Rcnn
3. MASK RCNN
4. Mask-RCNN
5. 【Mask RCNN】《Mask R-CNN》
6. 【Mask RCNN】Mask RCNN論文筆記
7. 乾貨，RCNN/FASTER RCNN/FAST RCNN/MASK RCNN
8. Mask RCNN筆記
9. 6.6 Mask RCNN
10. 淺談Mask RCNN
更多相關文章...
• PHP umask() 函數 - PHP參考手冊
• Lua 調試(Debug) - Lua 教程

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。