本文來自《MobiFace: A Lightweight Deep Learning Face Recognition on Mobile Devices》,時間線爲2018年11月。是做者分別來自CMU和uark學校。git
隨着DCNN的普及,在目標檢測,目標分割等領域都有不小的進步,然而其較高準確度背後倒是大量的參數和計算量。如AlexNet須要61百萬參數量,VGG16須要138百萬參數量,Resnet-50須要25百萬參數量。Densenet190(k=40)須要40百萬參數量。雖然這些網絡如今看來都不算很深的網絡,但是仍是須要200MB和500MB的內存。所以,這樣的模型一般是不能部署在移動端或者嵌入式端的。因此最近在圖像分類和目標檢測領域中也有很多壓縮模型被提出來,如剪枝[13,14,32],逐深度卷積[18,38],二值網絡[3,4,22,36],mimic網絡[31,44]。這些網絡能夠在沒有損失較多準確度的基礎上對inference速度進行加速。然而這些模型沒有應用在人臉識別領域上。相對於目標檢測和人臉分類,人臉識別問題一般須要必定數量的層去提取夠魯棒的辨識性的人臉特徵,畢竟人臉模板都同樣(兩個眼睛,一個嘴巴)。網絡
本文做者提出一個輕量級可是高性能的深度神經網絡,以此讓人臉識別能部署在移動設備上。相比於其餘網絡,MobiNet優點有:架構
- 讓MobileNet架構變得更輕量級,提出的MobiNet模型能夠很好的部署在移動設備上;
- 提出的MobiNet能夠end-to-end的優化;
- 將MobiNet與基於mobile的網絡和大規模深度網絡在人臉識別數據上進行對比。
目前爲止,已經有很多輕量級深度網絡的設計方案,如binarized networks, quantized networks, mimicked networks, designed compact modules 和 pruned networks。本文主要關注最後兩種設計方案。app
Designed compact modules
經過整合小的模型或者緊湊的模塊和層,能夠減小權重的數量,有助於減小內存使用和inference階段的時間消耗。MobileNet提出一個逐深度分離的卷積模塊來代替傳統的卷積層,以此明顯減小參數量。逐深度卷積操做首先出如今Sifre[41]論文中,而後用在[2,18,38]網絡中。在Mobilenet[18]中,空間輸入經過一個3x3空間可分通道濾波器進行卷積生成獨立的特徵,而後接一個逐點(1x1)卷積操做以今生成新的特徵。經過這個策略代替傳統的卷積操做,使得MobileNet只有4.2百萬的參數量和569百萬的MAdds。在Imagenet上得到70.6%的結果(VGG16結果是71.5%)。爲了提高MobileNet在多任務和benchmark上的性能。Sandler等人提出一個倒置殘差和線性botleneck(inverted residuals and linear bottlenecks),叫MobileNet-v2。倒置殘差相似[16]中的殘差bottleneck,可是中間特徵能夠關於輸入通道的數量擴展到一個特定比例。線性bottleneck是不帶有ReLU層的塊。MobileNetv2將以前準確度提高到72%,而只須要3.4百萬參數量和300百萬MAdds。雖然逐深度可分卷積被證明頗有效,[18,38]仍然在iphone和安卓上佔用不少內存和計算力。而本文發出的時間上,做者並未找到逐深度卷積在CPU上有很好的框架(tf,pytorch,caffe,mxnet)實現。爲了減小MobileNet的計算量,FD-Mobilenet中引入快速下采樣策略。受到MobileNet-v2的結構啓發,MobileFaceNet經過引入類似的網絡結構,並經過將全局平均池化層替換成全局逐深度卷積層來減小參數量。框架
Pruned networks
DNN一直受到參數量巨大和內存消耗不少的困擾。[14]提出一個深度壓縮模型經過絕對值去剪枝那些不重要的鏈接,在Alexnet和VGG16上得到了9x和13x的加速,且並未有多少準確度損失。[32]使用BN中的縮放因子(而不是權重的絕對值)對網絡進行瘦身。這些縮放因子經過L1-懲罰進行稀疏訓練。在VGG16,DenseNet,ResNet中Slimming networks [32]基於CIFAR數據集得到比原始網絡更好的準確度。然而,每一個剪枝後的鏈接索引須要存在內存中,這拉低了訓練和測試的速度。iphone
帶有擴展層的bottleneck殘差塊(Bottleneck Residual block with the expansion layers)
[37]中引入bottlenect殘差塊,該塊包含三個主要的變換操做,兩個線性變換和一個非線性逐通道變換:ide
- 非線性變換學習複雜的映射函數;
- 在內層中增長了feature map的數量;
- 經過shortcut鏈接去學習殘差。
給定一個輸入\(\mathbf{x}\)和對應size爲\(h\times w\times k\),一個bottleneck殘差塊能夠表示爲:
\[F(\mathbf{x})=[F_1\cdot F_2 \cdot F_3](\mathbf{x})\]
其中,\(F_1:R^{w\times h\times k}\mapsto R^{w\times h\times tk}\),\(F_3:R^{w\times h\times k}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times k_1}\)都是經過1x1卷積實現的線性函數,t表示擴展因子。\(F_2:R^{w\times h \times tk}\mapsto R^{\frac{w}{s}\times \frac{h}{s}\times tk}\)是非線性映射函數,經過三個操做組合實現的:ReLU,3x3逐深度卷積(stride=s),和ReLU。
在bottleneck塊中採用了殘差學習鏈接,以此阻止變換中的流行塌陷和增長特徵embedding的表徵能力[37]>函數
快速下采樣
基於有限的計算資源,緊湊的網絡應該最大化輸入圖像轉換到輸出特徵中的信息變換,同時避免高代價的計算,如較大的feature map空間維度(分辨率)。在大規模深度網絡中,信息流是經過較慢的下采樣策略實現的,如空間維度在層之間是緩慢變小的。而輕量級網絡不能這樣。
所謂快速下采樣,就是在特徵embedding過程的最初階段連續使用下采樣步驟,以免feature map的大空間維度,而後在後面的階段上,增長更多feature map來保證信息流的傳遞。要注意的是,雖然增長更多feature map,會致使通道數量的上升,可是由於自己feature map的分辨率夠小,因此增長的計算代價不大。性能
MobiFace網絡,給定輸入人臉圖像size爲112x112x3,該輕量級網絡意在最大化信息流變換同時減小計算量。基於上述分析,帶有擴展層的參數botteneck塊(Residual Bottleneck block with expansion layers)能夠做爲MobiFace的構建塊。表1給出了MobiFace的主要結構。
學習
- 一個3x3的卷積層;
- 一個3x3的逐深度分離卷積層(depthwise separable convolutional layer);
- 一系列bottleneck塊和殘差bottleneck塊;
- 一個1x1卷積層;
- 一個全鏈接層。
其中殘差bottleneck塊和bottleneck塊很像,除了殘差bottleneck塊會添加shortcut方式以鏈接1×1卷積層的輸入和輸出。並且在bottleneck 塊中stride=2,而在殘差bottleneck塊中每層stride=1。
MobiFace經過引入快速下采樣策略,快速減小層/塊的空間維度。能夠發現原本輸入大小爲112x112x3,在前兩層就減小了一半,而且在後面7個bottleneck塊中就減小了8x之多。擴展因子保持爲2,而通道數在每一個bottleneck塊後就翻倍了。
除了標記爲「linear」的卷積層以外,在每一個卷積層以後應用BN和非線性激活函數。本文中,主要用PReLU而不是ReLU。在Mobiface最後一層,不採用全局平均池化層,而是採用全鏈接層。由於全局平均池化是無差異對待每一個神經元(而中間區域神經元的重要性要大於邊緣區域神經元),FC層能夠針對不一樣神經元學到不一樣權重,從而將此類信息嵌入到最後的特徵向量中。
先基於提煉後的MS-Celeb-1M數據集(3.8百萬張圖片,85個ID)進行訓練,而後在LFW和MegaFace數據集上進行評估結果。
在預處理階段,採用MTCNN模型進行人臉檢測和5個關鍵點檢測。而後將其對齊到112x112x3上。而後經過減去127.5併除以128進行歸一化。在訓練階段,經過SGD進行訓練,batchsize爲1024,動量爲0.9.學習率在40K,60K,80K處分別除以10。一共迭代100K次。
表2給出了在LFW上的benckmark。
reference: [1] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. arXiv preprint arXiv:1804.07573, 2018. [2] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1800–1807. IEEE Computer Society, 2017. [3] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016. [4] M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, pages 3123–3131, 2015. [5] J. Deng, W. Dong, R. Socher, L. jia Li, K. Li, and L. Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009. [6] C. N. Duong, K. Luu, K. Quach, and T. Bui. Beyond principal components: Deep boltzmann machines for face modeling. In CVPR, 2015. [7] C. N. Duong, K. Luu, K. Quach, and T. Bui. Longitudinal face modeling via temporal deep restricted boltzmann machines. In CVPR, 2016. [8] C. N. Duong, K. Luu, K. Quach, and T. Bui. Deep appearance models: A deep boltzmann machine approach for face modeling. Intl Journal of Computer Vision (IJCV), 2018. [9] C. N. Duong, K. G. Quach, K. Luu, T. H. N. Le, and M. Savvides. Temporal non-volume preserving approach to facial age-progression and age-invariant face recognition. In ICCV, 2017. [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014. [11] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pages 87–102. Springer, 2016. [12] M. S. H. N. Le, R. Gummadi. Deep recurrent level set for segmenting brain tumors. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 646–653. Springer, 2018. [13] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015. [14] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press. [15] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2017. [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261–2269, 2017. [21] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008. [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016. [23] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pages 448–456. JMLR.org, 2015. [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678. ACM, 2014. [25] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4873–4882, 2016. [26] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. [28] H. N. Le, C. N. Duong, K. Luu, and M. Savvides. Deep contextual recurrent residual networks for scene labeling. In Journal of Pattern Recognition, 2018. [29] H. N. Le, K. G. Quach, K. Luu, and M. Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. In Trans. on Image Processing (TIP), 2018. [30] H. N. Le, C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Robust hand detection in vehicles. In Intl. Conf. on Pattern Recognition (ICPR), 2016. [31] Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. 2017 IEEE Conference on CVPR, pages 7341–7349, 2017. [32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. 2017 IEEE International Conference on Computer Vision (ICCV), pages 2755–2763, 2017. [33] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR. [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017. [35] Z. Qin, Z. Zhang, X. Chen, C. Wang, and Y. Peng. Fd-mobilenet: Improved mobilenet with a fast downsampling strategy. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1363–1367. IEEE, 2018. [36] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 525–542. Springer, 2016. [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018. [38] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018. [39] M. W. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, 2007. [40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [41] L. Sifre. Rigid-motion scattering for image classification, 2014. [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014. [43] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [44] Y. Wei, X. Pan, H. Qin, and J. Yan. Quantization mimic: Towards very tiny cnn for object detection. CoRR, abs/1805.02152, 2018. [45] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, 13(11):2884–2896, 2018. [46] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016. [47] Y. Zheng, C. Zhu, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Towards a deep learning framework for unconstrained face detection. In BTAS, 2016. [48] C. Zhu, Y. Ran, K. Luu, and M. Savvides. Seeing small faces from robust anchor’s perspective. In CVPR, 2018. [49] C. Zhu, Y. Zheng, K. Luu, H. N. Le, C. Bhagavatula, and M. Savvides. Weakly supervised facial analysis with dense hyper-column features. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2016. [50] C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Enhancing interior and exterior deep facial features for face detection in the wild. In Intl Conf. on Automatic Face and Gesture Recognition (FG), 2018.