CV：翻譯並解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章

時間 2021-08-14

標籤 git 算法數據庫 express 編程 api 網絡架構 app less 欄目 Git 简体版

原文原文鏈接

CV：翻譯並解讀2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章git

導讀：人工智能領域，最新計算機視覺文章歷史綜述以及觀察，深度卷積神經網絡的最新架構綜述。算法

原做者數據庫

Asifullah Khan1, 2*, Anabia Sohail1, 2, Umme Zahoora1, and Aqsa Saeed Qureshi1
1 Pattern Recognition Lab, DCIS, PIEAS, Nilore, Islamabad 45650, Pakistan
2 Deep Learning Lab, Center for Mathematical Sciences, PIEAS, Nilore, Islamabad 45650, Pakistan
asif@pieas.edu.pkexpress

目錄編程

Abstractapi

一、Introduction網絡

2 Basic CNN Components架構

2.1 Convolutional Layerapp

2.2 Pooling Layerless

2.3 Activation Function

2.4 Batch Normalization

2.5 Dropout

2.6 Fully Connected Layer

3 Architectural Evolution of Deep CNN

3.1 Late 1980s-1999: Origin of CNN

3.2 Early 2000: Stagnation of CNN

3.3 2006-2011: Revival of CNN

3.4 2012-2014: Rise of CNN

3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNN

Abstract

Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art performance on various competitive benchmarks. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages (hidden layers) that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs, and recently very interesting deep CNN architectures are reported. The recent race in developing deep CNNs shows that the innovative architectural ideas, as well as parameter optimization, can improve CNN performance. In this regard, different ideas in the CNN design have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and restructuring of the processing units. However, the major improvement in representational capacity of the deep CNN is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is receiving substantial attention. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention. Additionally, this survey also covers the elementary understanding of CNN components and sheds light on its current challenges and applications.

深度卷積神經網絡(CNNs)是一種特殊類型的神經網絡，在各類競爭性基準測試中表現出了最早進的性能。深度CNN強大的學習能力很大程度上是因爲它使用了多個特徵提取階段(隱含層)，能夠從數據中自動學習表示。大量數據的可用性和硬件處理單元的改進加速了CNNs的研究，而且，最近報道了很是有意思的深度CNN架構。最近開發深度CNNs的競賽代表，創新的架構思想和參數優化能夠提升CNN的性能。爲此，在CNN的設計中探索了不一樣的思路，如使用不一樣的激活和丟失函數、參數優化、正則化以及處理單元的重組。然而，深度CNN的表明性能力的主要提升是經過處理單元的重組實現的。特別是，使用一個塊做爲一個結構單元而不是一層的想法正在獲得大量的關注。所以，本次調查的重點是最近報道的深度CNN架構的內在分類，所以，將CNN架構的最新創新分爲七個不一樣的類別。這七個類別分別基於空間開發、深度、多路徑、寬度、特徵地圖開發、通道提高和注意力機制。此外，本調查還涵蓋了對CNN組件的基本理解，並闡明瞭其當前的挑戰和應用。

Keywords: Deep Learning, Convolutional Neural Networks, Architecture, Representational Capacity, Residual Learning, and Channel Boosted CNN.

關鍵詞：深度學習，卷積神經網絡，架構，表徵能力，殘差學習，通道提高的CNN

一、Introduction

Machine Learning (ML) algorithms belong to a specialized area in Artificial Intelligence (AI), which endows intelligence to computers by learning the underlying relationships among the data and making decisions without being explicitly programmed. Different ML algorithms have been developed since the late 1990s, for the emulation of human sensory responses such as speech and vision, but they have generally failed to achieve human-level satisfaction [1]–[6]. The challenging nature of Machine Vision (MV) tasks gives rise to a specialized class of Neural Networks (NN), known as Convolutional Neural Network (CNN) [7].	機器學習(ML)算法屬於人工智能(AI)的一個專門領域，它經過學習數據之間的基本關係並在沒有顯示編程的狀況下作出決策，從而賦予計算機智能。自20世紀90年代末以來，針對語音、視覺等人類感官反應的仿真，人們開發了各類各樣的ML算法，但廣泛未能達到人的滿意程度[1]-[6]。因爲機器視覺(MV)任務的挑戰性，產生了一類專門的神經網絡(NN)，稱爲卷積神經網絡(CNN)[7]。
CNNs are considered as one of the best techniques for learning image content and have shown state-of-the-art results on image recognition, segmentation, detection, and retrieval related tasks [8], [9]. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing competitions are employing deep CNN based models.	CNNs被認爲是學習圖像內容的最佳技術之一，在圖像識別、分割、檢測和檢索相關任務[8]、[9]方面已經取得了最新的成果。CNN的成功吸引了學術界之外的關注。在業界，谷歌、微軟、AT&T、NEC、Facebook等公司都創建了活躍的研究小組，探索CNN[10]的新架構。目前，大多數圖像處理競賽的領跑者，都在使用基於深度CNN的模型。
The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers [11]. Each layer performs multiple transformations using a bank of convolutional kernels (filters) [12]. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. This non-linearity generates different patterns of activations for different responses and thus facilitates in learning of semantic differences in images. Output of the non-linear function is usually followed by subsampling, which helps in summarizing the results and also makes the input invariant to geometrical distortions [12], [13].	CNN的拓撲結構分爲多個學習階段，包括卷積層、非線性處理單元和子採樣層的組合[11]。每一層使用一組卷積核（濾波器）執行多重變換[12]。卷積操做經過將圖像分割成小塊（相似於人眼視網膜）來提取局部相關特徵，使其可以學習合適的特徵。卷積核的輸出被分配給非線性處理單元，這不只有助於學習抽象，並且在特徵空間中嵌入非線性。這種非線性會爲不一樣的反應產生不一樣的激活模式，從而有助於學習圖像中的語義差別。非線性函數的輸出一般隨後是子採樣，這有助於總結結果，並使輸入對幾何畸變保持不變[12]，[13]。
The architectural design of CNN was inspired by Hubel and Wiesel’s work and thus largely follows the basic structure of primate’s visual cortex [14], [15]. CNN first came to limelight through the work of LeCuN in 1989 for the processing of grid-like topological data (images and time series data) [7], [16]. The popularity of CNN is largely due to its hierarchical feature extraction ability. Hierarchical organization of CNN emulates the deep and layered learning process of the Neocortex in the human brain, which automatically extract features from the underlying data [17]. The staging of learning process in CNN shows quite resemblance with primate’s ventral pathway of visual cortex (V1-V2-V4-IT/VTC) [18]. The visual cortex of primates first receives input from the retinotopic area, where multi-scale highpass filtering and contrast normalization is performed by the lateral geniculate nucleus. After this, detection is performed by different regions of the visual cortex categorized as V1, V2, V3, and V4. In fact, V1 and V2 portion of visual cortex are similar to convolutional, and subsampling layers, whereas inferior temporal region resembles the higher layers of CNN, which makes inference about the image [19]. During training, CNN learns through backpropagation algorithm, by regulating the change in weights with respect to the input. Minimization of a cost function by CNN using backpropagation algorithm is similar to the response based learning of human brain. CNN has the ability to extract low, mid, and high-level features. High level features (more abstract features) are a combination of lower and mid-level features. With the automatic feature extraction ability, CNN reduces the need for synthesizing a separate feature extractor [20]. Thus, CNN can learn good internal representation from raw pixels with diminutive processing.	CNN的架構設計靈感來自於Hubel和Wiesel的工做，所以很大程度上遵循了靈長類動物視覺皮層的基本結構[14]，[15]。CNN最先是在1989年經過LeCuN的工做引發了人們的注意，它處理了網格狀的拓撲數據（圖像和時間序列數據）[7]，[16]。CNN的流行很大程度上是因爲它的層次特徵提取能力。CNN的分層組織模擬人腦皮層的深層和分層學習過程，它自動從底層數據中提取特徵[17]。CNN中學習過程的分期與靈長類視覺皮層腹側通路（V1-V2-V4-IT/VTC）很是類似[18]。靈長類動物的視覺皮層首先接收來自視黃醇區的輸入，在視黃醇區，外側膝狀體核進行多尺度高通濾波和對比度歸一化。以後，由視覺皮層的不一樣區域進行檢測，這些區域分爲V一、V二、V3和V4。事實上，視覺皮層的V1和V2部分與卷積層和亞採樣層類似，而顳下區與CNN的高層類似，後者對圖像進行推斷[19]。在訓練過程當中，CNN經過反向傳播算法學習，經過調節輸入權重的變化。使用反向傳播算法的CNN最小化代價函數相似於基於響應的人腦學習。CNN可以提取低、中、高級特徵。高級特徵（更抽象的特徵）是低級和中級特徵的組合。具備自動特徵提取功能，CNN減小了合成單獨特徵提取器的須要[20]。所以，CNN能夠經過較小的處理從原始像素中學習良好的內部表示。
The main boom in the use of CNN for image classification and segmentation occurred after it was observed that the representational capacity of a CNN can be enhanced by increasing its depth [21]. Deep architectures have an advantage over shallow architectures, when dealing with complex learning problems. Stacking of multiple linear and non-linear processing units in a layer wise fashion provides deep networks the ability to learn complex representations at different levels of abstraction. In addition, advancements in hardware and thus the availability of high computing resources is also one of the main reasons of the recent success of deep CNNs. Deep CNN architectures have shown significant performance of improvements over shallow and conventional vision based models. Apart from its use in supervised learning, deep CNNs have potential to learn useful representation from large scale of unlabeled data. Use of the multiple mapping functions by CNN enables it to improve the extraction of invariant representations and consequently, makes it capable to handle recognition tasks of hundreds of categories. Recently, it is shown that different level of features including both low and high-level can be transferred to a generic recognition task by exploiting the concept of Transfer Learning (TL) [22]–[24]. Important attributes of CNN are hierarchical learning, automatic feature extraction, multi-tasking, and weight sharing [25]–[27].	CNN用於圖像分類和分割的主要興起發生在觀察到CNN的表示能力能夠經過增長其深度來加強以後[21]。在處理複雜的學習問題時，深度架構比淺層架構具備優點。以分層方式堆疊多個線性和非線性處理單元，使深層網絡可以在不一樣抽象級別學習複雜表示。此外，硬件的進步以及高計算資源的可用性也是deep CNNs最近成功的主要緣由之一。深度CNN架構已經顯示出比淺層和傳統的基於視覺的模型有顯著改進的性能。除了在監督學習中的應用外，深度CNN還具備從大規模未標記數據中學習有用表示的潛力。利用CNN的多重映射函數，提升了不變量表示的提取效率，使其可以處理數百個類別的識別任務。近年來，研究代表，利用遷移學習（TL）[22]-[24]的概念，能夠將包括低層和高層特徵在內的不一樣層次的特徵，轉化爲通常的識別任務。CNN的重要特性是分層學習、自動特徵提取、多任務處理和權重共享[25]-[27]。

Various improvements in CNN learning strategy and architecture were performed to make CNN scalable to large and complex problems. These innovations can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that CNN based applications became prevalent after the exemplary performance of AlexNet on ImageNet dataset [21]. Thus major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Similarly, Zeiler and Fergus [28] introduced the concept of layer-wise visualization of features, which shifted the trend towards extraction of features at low spatial resolution in deep architecture such as VGG [29]. Nowadays, most of the new architectures are built upon the principle of simple and homogenous topology introduced by VGG. On the other hand, Google group introduced an interesting idea of split, transform, and merge, and the corresponding block is known as inception block. The inception block for the very first time gave the concept of branching within a layer, which allows abstraction of features at different spatial scales [30]. In 2015, the concept of skip connections introduced by ResNet [31] for the training of deep CNNs got famous, and afterwards, this concept was used by most of the succeeding Nets, such as Inception-ResNet, WideResNet, ResNext, etc [32]–[34].	在CNN學習策略和體系結構方面進行了各類改進，使CNN可以擴展到大型複雜問題。這些創新可分爲參數優化、正則化、結構重構等。然而，據觀察，在AlexNet在ImageNet數據集上的示範性能以後，基於CNN的應用變得廣泛[21]。所以，自2012年以來，CNN提出了重大創新，主要歸功於處理單元的重組和新區塊的設計。相似地，Zeiler和Fergus[28]引入了特徵分層可視化的概念，這改變了深度架構（如VGG[29]）中以低空間分辨率提取特徵的趨勢。目前，大多數新的體系結構都是基於VGG提出的簡單、同質的拓撲結構原理。另外一方面，Google group引入了一個有趣的拆分、轉換和合並的概念，相應的塊稱爲inception塊。inception塊第一次給出了層內分支的概念，容許在不一樣的空間尺度上抽象特徵[30]。2015年，ResNet[31]提出的用於訓練深層CNNs的skip鏈接的概念很出名，以後，這個概念被大多數後續網絡使用，如Inception ResNet、WideResNet、ResNext等[32]-[34]。
In order to improve the learning capacity of a CNN, different architectural designs such as WideResNet, Pyramidal Net, Xception etc. explored the effect of multilevel transformations in terms of an additional cardinality and increase in width [32], [34], [35]. Therefore, the focus of research shifted from parameter optimization and connections readjustment towards improved architectural design (layer structure) of the network. This shift resulted in many new architectural ideas such as channel boosting, spatial and channel wise exploitation and attention based information processing etc. [36]–[38].	爲了提升CNN的學習能力，不一樣的結構設計，如WideResNet、金字塔網、exception等，從增長基數和增長寬度的角度探討了多級轉換的效果[32]、[34]、[35]。所以，研究的重點從網絡的參數優化和鏈接調整轉向網絡的改進結構設計（層結構）。這種轉變產生了許多新的架構思想，如信道加強、空間和信道利用以及基於注意力的信息處理等[36]-[38]。
In the past few years, different interesting surveys are conducted on deep CNNs that elaborate the basic components of CNN and their alternatives. The survey reported by [39] has reviewed the famous architectures from 2012-2015 along with their components. Similarly, in the literature, there are prominent surveys that discuss different algorithms of CNN and focus on applications of CNN [20], [26], [27], [40], [41]. Likewise, the survey presented in [42] discussed taxonomy of CNNs based on acceleration techniques. On the other hand, in this survey, we discuss the intrinsic taxonomy present in the recent and prominent CNN architectures. The various CNN architectures discussed in this survey can be broadly classified into seven main categories namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The rest of the paper is organized in the following order (shown in Fig. 1): Section 1 summarizes the underlying basics of CNN, its resemblance with primate’s visual cortex, as well as its contribution in MV. In this regard, Section 2 provides the overview on basic CNN components and Section 3 discusses the architectural evolution of deep CNNs. Whereas, Section 4, discusses the recent innovations in CNN architectures and categorizes CNNs into seven broad classes. Section 5 and 6 shed light on applications of CNNs and current challenges, whereas section 7 discusses future work and last section draws conclusion.	在過去的幾年裏，對深度CNN進行了不一樣有趣的調查，闡述了CNN的基本組成部分及其替代方案。[39]報告的調查回顧了2012-2015年著名架構及其組成部分。相似地，在文獻中，有一些著名的調查討論了CNN的不一樣算法，並着重於CNN的應用[20]、[26]、[27]、[40]、[41]。一樣，在[42]中提出的調查討論了基於加速技術的CNNs分類。另外一方面，在這項調查中，咱們討論了在最近和著名的CNN架構中存在的內在分類法。本次調查中討論的各類CNN架構大體可分爲七大類，即：空間開發、深度、多徑、寬度、特徵地圖開發、信道加強和基於注意力的CNN。論文的其他部分按如下順序組織（如圖1所示）：第1節總結了CNN的基本原理，它與靈長類視覺皮層的類似性，以及它在MV中的貢獻。在這方面，第2節概述了基本CNN組件，第3節討論了deep CNNs的體系結構演變。第4節討論了CNN體系結構的最新創新，並將CNN分爲七大類。第5節和第6節闡述了CNNs的應用和當前面臨的挑戰，第7節討論了將來的工做，最後一節得出結論。

Fig. 1: Organization of the survey paper.

2 Basic CNN Components

Nowadays, CNN is considered as the most widely used ML technique, especially in vision related applications. CNNs have recently shown state-of-the-art results in various ML applications. A typical block diagram of an ML system is shown in Fig. 2. Since, CNN possesses both good feature extraction and strong discrimination ability, therefore in a ML system; it is mostly used for feature extraction and classification.

目前，CNN被認爲是應用最普遍的ML技術，尤爲是在視覺相關應用中。CNNs最近在各類ML應用中顯示了最新的結果。ML系統的典型框圖如圖2所示。因爲CNN具備良好的特徵提取和較強的識別能力，所以在ML系統中，它主要用於特徵提取和分類。

A typical CNN architecture generally comprises of alternate layers of convolution and pooling followed by one or more fully connected layers at the end. In some cases, fully connected layer is replaced with global average pooling layer. In addition to the various learning stages, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [43]. The arrangement of CNN components play a fundamental role in designing new architectures and thus achieving enhanced performance. This section briefly discusses the role of these components in CNN architecture.

典型的CNN體系結構，一般包括交替的卷積層和池化，最後是一個或多個徹底鏈接的層。在某些狀況下，徹底鏈接層被替換爲全局平均池層。除了不一樣的學習階段，不一樣的常規單位，如 batch normalization和dropout，也被歸入優化CNN的表現[43]。CNN組件的排列在設計新的體系結構和提升性能方面起着基礎性的做用。本節簡要討論這些組件在CNN架構中的做用。

2.1 Convolutional Layer

Convolutional layer is composed of a set of convolutional kernels (each neuron act as a kernel). These kernels are associated with a small area of the image known as a receptive field. It works by dividing the image into small blocks (receptive fields) and convolving them with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) [43]. Convolution operation can expressed as follows:

卷積層由一組卷積核組成（每一個神經元充當一個核）。這些核與被稱爲感覺野的圖像的一小部分相關。它的工做原理是將圖像分割成小的塊（接收場），並用一組特定的權重（將濾波器的元素與相應的接收場元素相乘）[43]。卷積運算能夠表示爲：

Where, the input image is represented by x, y I , , xy shows spatial locality and k
l K represents the lth convolutional kernel of the kth layer. Division of image into small blocks helps in extracting locally correlated pixel values. This locally aggregated information is also known as feature motif. Different set of features within image are extracted by sliding convolutional kernel on the whole image with the same set of weights. This weight sharing feature of convolution operation makes CNN parameter efficient as compared to fully connected Nets. Convolution operation may further be categorized into different types based on the type and size of filters, type of padding, and the direction of convolution [44]. Additionally, if the kernel is symmetric, the convolution operation becomes a correlation operation [16].

其中，輸入圖像由x，y I，x y表示空間局部性，k l k表示第k層的第l卷積核。將圖像分割成小塊有助於提取局部相關像素值。這種局部彙集的信息也被稱爲特徵模體。在相同的權值集下，經過滑動卷積核提取圖像中不一樣的特徵集。與全連通網絡相比，卷積運算的這種權值共享特性使得CNN參數更有效。卷積操做還能夠基於濾波器的類型和大小、填充的類型和卷積的方向而被分爲不一樣的類型[44]。另外，若是核是對稱的，卷積操做就變成相關性操做[16]。

2.2 Pooling Layer

Feature motifs, which result as an output of convolution operation can occur at different locations in the image. Once features are extracted, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is an interesting local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region [45].

卷積運算輸出的特圖案能夠出如今圖像的不一樣位置。一旦特徵被提取，其精確位置就變得不那麼重要了，只要其相對於其餘位置的近似位置被保留。像卷積同樣的池化或下采樣是一種有趣的本地操做。它總結了接受野附近的類似信息，並輸出了該局部區域內的主導反應[45]。

Equation (2) shows the pooling operation in which l Z represents the lth output feature map, ,lxyF shows the lth input feature map, whereas p f (.) defines the type of pooling operation. The use ofpooling operation helps to extract a combination of features, which are invariant to translational shifts and small distortions [13], [46]. Reduction in the size of feature map to invariant feature set not only regulates complexity of the network but also helps in increasing the generalization by reducing overfitting. Different types of pooling formulations such as max, average, L2, overlapping, spatial pyramid pooling, etc. are used in CNN [47]–[49].

等式（2）表示池操做，其中l Z表示lth輸出特徵映射，lxyF表示lth輸入特徵映射，而p f（.）定義池操做的類型。使用pooling操做有助於提取特徵的組合，這些特徵對平移位移和小的失真是不變的[13]，[46]。將特徵映射的大小減小到不變特徵集不只能夠調節網絡的複雜度，並且有助於經過減小過擬合來增長泛化。CNN中使用了不一樣類型的池公式，如max、average、L二、overlapping、空間金字塔池化等[47]–[49]。

2.3 Activation Function

Activation function serves as a decision function and helps in learning a complex pattern. Selection of an appropriate activation function can accelerate the learning process. Activation function for a convolved feature map is defined in equation (3).

激活函數做爲一個決策函數，有助於學習一個複雜的模式。選擇合適的激活函數能夠加速學習過程。卷積特徵映射的激活函數在方程（3）中定義。

In above equation, k l F is an output of a convolution operation, which is assigned to activation function; A f (.) that adds non-linearity and returns a transformed output k l T for kth layer. In literature, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of ReLU such as leaky ReLU, ELU, and PReLU [39], [48], [50], [51] are used to inculcate nonlinear combination of features. However, ReLU and its variants are preferred over others activations as it helps in overcoming the vanishing gradient problem [52], [53].

在上面的等式中，k l F是卷積運算的輸出，該卷積運算被分配給激活函數；F（.）添加非線性並返回第k層的轉換輸出k l T。在文獻中，不一樣的激活函數如sigmoid、tanh、maxout、ReLU和ReLU的變體如leaky ReLU、ELU和PReLU[39]、[48]、[50]、[51]被用來灌輸特徵的非線性組合。然而，ReLU及其變體比其餘激活更受歡迎，由於它有助於克服消失梯度問題[52]，[53]。

Fig. 2: Basic layout of a typical ML system. In ML related tasks, initially data is preprocessed and then assigned to a classification system. A typical ML problem follows three steps: stage 1 is related to data gathering and generation, stage 2 performs preprocessing and feature selection, whereas stage 3 is based on model selection, parameter tuning, and analysis. CNN has a good feature extraction and strong discrimination ability, therefore in a ML system; it can be used for feature extraction and classification.

圖2：典型ML系統的基本佈局。在與ML相關的任務中，首先對數據進行預處理，而後將其分配給分類系統。一個典型的ML問題有三個步驟：階段1與數據收集和生成相關，階段2執行預處理和特徵選擇，而階段3基於模型選擇、參數調整和分析。CNN具備很好的特徵提取能力和較強的識別能力，所以在ML系統中能夠用於特徵提取和分類。

2.4 Batch Normalization

注：根據博主的經驗，此處常爲考點！

Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map k lT is shown in equation (4).

批處理規範化用於解決與特徵映射內部協方差偏移相關的問題。內協方差偏移是隱藏單元值分佈的一種變化，它會減慢收斂速度（經過強制學習速率爲小值），而且須要謹慎的初始化參數。轉換後的特徵映射k lT的批處理規範化如等式（4）所示。

In equation (4), k l N represents normalized feature map, kl F is the input feature map, B and 2 B  depict mean and variance of a feature map for a mini batch respectively. Batch normalization unifies the distribution of feature map values by bringing them to zero mean and unit variance [54]. Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network.

在式（4）中，k l N表示歸一化特徵映射，kl F是輸入特徵映射，Bμ和2 B分別表示小批量特徵映射的均值和方差。批量規範化經過使特徵映射值的平均值和單位方差爲零來統一分佈[54]。此外，它平滑了梯度的流動，起到了調節因子的做用，從而有助於提升網絡的泛化能力。

2.5 Dropout

Dropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability. In NNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which causes overfitting [55]. This random dropping of some connections or units produces several thinned network architectures, and finally one representative network is selected with small weights. This selected architecture is then considered as an approximation of all of the proposed networks [56].

Dropout在網絡中引入正則化，經過隨機跳過某些具備必定機率的單元或鏈接，最終提升泛化能力。在NNs中，學習非線性關係的多個鏈接有時是協同適應的，這會致使過分擬合[55]。一些鏈接或單元的隨機丟棄產生了幾種細化的網絡結構，最後選擇了一種具備表明性的網絡結構。而後將所選擇的體系結構看做是所提出的全部網絡的近似〔56〕。

2.6 Fully Connected Layer

Fully connected layer is mostly used at the end of the network for classification purpose. Unlike pooling and convolution, it is a global operation. It takes input from the previous layer and globally analyses output of all the preceding layers [57]. This makes a non-linear combination of selected features, which are used for the classification of data. [58].

全鏈接層主要用於網絡末端的分類。與池化和卷積不一樣，它是一個全局操做。它接受前一層的輸入，並全局分析全部前一層的輸出[57]。這使得用於數據分類的選定特徵的非線性組合。[58]。

Fig. 3: Evolutionary history of deep CNNs

3 Architectural Evolution of Deep CNN

Nowadays, CNNs are considered as the most widely used algorithms among biologically inspired AI techniques. CNN history begins from the neurobiological experiments conducted by Hubel and Wiesel (1959, 1962) [14], [59]. Their work provided a platform for many cognitive models, almost all of which were latterly replaced by CNN. Over the decades, different efforts have been carried out to improve the performance of CNNs. This history is pictorially represented in Fig. 3. These improvements can be categorized into five different eras and are discussed below.

目前，CNNs被認爲是生物人工智能技術中應用最普遍的算法。CNN的歷史始於Hubel和Wiesel（19591962）[14]，[59]進行的神經生物學實驗。他們的工做爲許多認知模型提供了一個平臺，幾乎全部的認知模型都被CNN所取代。幾十年來，人們一直在努力提升CNNs的性能。這段歷史在圖3中用圖形表示這些改進能夠分爲五個不一樣的時代，並在下面討論。

3.1 Late 1980s-1999: Origin of CNN

CNNs have been applied to visual tasks since the late 1980s. In 1989, LeCuN et al. proposed the first multilayered CNN named as ConvNet, whose origin rooted in Fukushima’s Neocognitron [60], [61]. LeCuN proposed supervised training of ConvNet, using Backpropagation algorithm [7], [62] in comparison to the unsupervised reinforcement learning scheme used by its predecessor Neocognitron. LeCuN’s work thus made a foundation for the modern 2D CNNs. Supervised training in CNN provides the automatic feature learning ability from raw input, rather than designing of handcrafted features, used by traditional ML methods. This ConvNet showed successful results for handwritten digit and zip code recognition related problems [63]. In 1998, ConvNet was improved by LeCuN and used for classifying characters in a document recognition application [64]. This modified architecture was named as LeNet-5, which was an improvement over the initial CNN as it can extract feature representation in a hierarchical way from raw pixels [65]. Reliance of LeNet-5 on fewer parameters along with consideration of spatial topology of images enabled CNN to recognize rotational variants of the image [65]. Due to the good performance of CNN in optical character recognition, its commercial use in ATM and Banks started in 1993 and 1996, respectively. Though, many successful milestones were achieved by LeNet-5, yet the main concern associated with it was that its discrimination power was not scaled to classification tasks other than hand recognition.

自20世紀80年代末以來，CNNs已經被應用於視覺任務中。提出了第一個叫作ConvNet的多層CNN，其起源於Fukushima’s 的Neocognitron[60]，[61]。LeCuN提出了ConvNet的有監督訓練，使用了Backpropagation算法[7]，[62]，與其前身Neocognitron使用的無監督強化學習方案相比。他的做品爲現代2D CNN奠基了基礎。CNN中的監督訓練提供了從原始輸入中自動學習特徵的能力，而不是傳統ML方法所使用的手工特徵的設計。這個ConvNet顯示了手寫數字和郵政編碼識別相關問題的成功結果[63]。1998年，LeCuN改進了ConvNet，並將其用於文檔識別應用程序中的字符分類[64]。這種改進的結構被命名爲LeNet-5，這是對初始CNN的改進，由於它能夠從原始像素中以分層的方式提取特徵表示[65]。LeNet-5對較少參數的依賴以及對圖像空間拓撲的考慮使得CNN可以識別圖像的旋轉變體[65]。因爲CNN在光學字符識別方面的良好性能，其在ATM和銀行的商業應用分別始於1993年和1996年。儘管LeNet-5取得了許多成功的里程碑，但與之相關的主要問題是它的辨別能力並無擴展到除手識別之外的分類任務。

3.2 Early 2000: Stagnation of CNN

In the late 1990s and early 2000s, interest in NNs reduced and less attention was given to explore the role of CNNs in different applications such as object detection, video surveillance, etc. Use of CNN in ML related tasks became dormant due to the insignificant improvement in performance at the cost of high computational time. At that time, other statistical methods and, in particular, SVM became more popular than CNN due to its relatively high performance [66]–[68]. It was widely presumed in early 2000 that the backpropagation algorithm used for training of CNN was not effective in converging to optimal points and therefore unable to learn useful features in supervised fashion as compared to handcrafted features [69]. Meanwhile, different researchers kept working on CNN and tried to optimize its performance. In 2003, Simard et al. improved CNN architecture and showed good results as compared to SVM on a hand digit benchmark dataset; MNIST [64], [68], [70]–[72]. This performance improvement expedited the research in CNN by extending its application in optical character recognition (OCR) to other script’s character recognition [72]–[74], deployment in image sensors for face detection in video conferencing, and regulation of street crimes, etc. Likewise, CNN based systems were industrialized in markets for tracking customers [75]–[77]. Moreover, CNN’s potential in other applications such as medical image segmentation, anomaly detection, and robot vision was also explored [78]–[80].

在20世紀90年代末和21世紀初，人們對神經網絡的興趣逐漸減小，對神經網絡在目標檢測、視頻監控等不一樣應用中的做用的研究也愈來愈少。因爲性能上的顯著提升，在ML相關任務中使用神經網絡以犧牲較高的計算時間而變得不活躍。當時，其餘統計方法，特別是支持向量機，因爲其相對較高的性能而變得比CNN更受歡迎[66]-[68]。2000年初，人們廣泛認爲，用於CNN訓練的反向傳播算法在收斂到最優勢方面並不有效，所以與手工製做的特徵相比，沒法以監督方式學習有用的特徵[69]。與此同時，不一樣的研究人員繼續研究CNN，並試圖優化其性能。2003年，Simard等人。改進了CNN的體系結構，與支持向量機相比，在一個手寫數字基準數據集上顯示了良好的結果；MNIST[64]，[68]，[70]–[72]。這種性能的提升加速了CNN的研究，將其在光學字符識別（OCR）中的應用擴展到其餘腳本的字符識別[72]-[74]，在視頻會議中部署用於面部檢測的圖像傳感器，以及對街頭犯罪的監管等。一樣，基於CNN的系統也在市場上實現了工業化用於跟蹤客戶[75]–[77]。此外，CNN在醫學圖像分割、異常檢測和機器人視覺等其餘應用領域的潛力也獲得了探索[78]-[80]。

3.3 2006-2011: Revival of CNN

Deep NNs have generally complex architecture and time intensive training phase that sometimes spanned over weeks and even months. In early 2000, there were only a few techniques for the training of deep Networks. Additionally, it was considered that CNN is not able to scale for complex problems. These challenges halted the use of CNN in ML related tasks.	深度NNs一般具備複雜的結構和時間密集型訓練階段，有時跨越數週甚至數月。在2000年初，只有少數技術用於訓練深層網絡。此外，有人認爲CNN沒法擴展到複雜的問題。這些挑戰阻止了CNN在ML相關任務中的應用。
To address these problems, in 2006 many interesting methods were reported to overcome the difficulties encountered in the training of deep CNNs and learning of invariant features. Hinton proposed greedy layer-wise pre-training approach in 2006, for deep architectures, which revived and reinstated the importance of deep learning [81], [82]. The revival of a deep learning [83], [84] was one of the factors, which brought deep CNNs into the limelight. Huang et al. (2006) used max pooling instead of subsampling, which showed good results by learning of invariant features [46], [85].	爲了解決這些問題，2006年報道了許多有趣的方法來克服在訓練深層CNNs和學習不變特徵方面遇到的困難。Hinton在2006年提出了貪婪的分層預訓練方法，用於深層架構，這恢復了深層學習的重要性[81]，[82]。深度學習的復興[83]，[84]是其中的一個因素，這使深度cnn成爲了焦點。Huang等人。（2006）使用最大值池代替子採樣，經過學習不變特徵顯示了良好的結果[46]，[85]
In late 2006, researchers started using graphics processing units (GPUs) [86], [87] to accelerate training of deep NN and CNN architectures [88], [89]. In 2007, NVIDIA launched the CUDA programming platform [90], [91], which allows exploitation of parallel processing capabilities of GPU with a much greater degree [92]. In essence, the use of GPUs for NN training [88], [93] and other hardware improvements were the main factor, which revived the research in CNN. In 2010, Fei-Fei Li’s group at Stanford, established a large database of images known as ImageNet, containing millions of labeled images [94]. This database was coupled with the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions, where the performances of various models have been evaluated and scored [95]. Consequently, ILSVRC and NIPS have been very active in strengthening research and increasing the use of CNN and thus making it popular. This was a turning point in improving the performance and increasing the use of CNN.	2006年底，研究人員開始使用圖形處理單元（GPU）[86]，[87]來加速深度神經網絡和CNN架構的訓練[88]，[89]。2007年，NVIDIA推出了CUDA編程平臺[90]，[91]，它容許在更大程度上利用GPU的並行處理能力[92]。從本質上講，GPUs在神經網絡訓練中的應用[88]、[93]和其餘硬件的改進是主要因素，這使CNN的研究從新活躍起來。2010年，李飛飛在斯坦福大學的團隊創建了一個名爲ImageNet的大型圖像數據庫，其中包含數百萬個標記圖像[94]。該數據庫與年度ImageNet大型視覺識別挑戰賽（ILSVRC）相結合，對各類模型的性能進行了評估和評分[95]。所以，ILSVRC和NIPS在增強研究和增長CNN的使用方面很是積極，從而使其流行起來。這是一個轉折點，在提升性能和增長使用有線電視新聞網。

3.4 2012-2014: Rise of CNN

Availability of big training data, hardware advancements, and computational resources contributed to advancement in CNN algorithms. Renaissance of CNN in object detection, image classification, and segmentation related tasks had been observed in this period [9], [96]. However, the success of CNN in image classification tasks was not only due to the result of aforementioned factors but largely contributed by the architectural modifications, parameter optimization, incorporation of regulatory units, and reformulation and readjustment of connections within the network [39], [42], [97].	大訓練數據的可用性、硬件的先進性和計算資源有助於CNN算法的進步。CNN在目標檢測、圖像分類和與分割相關的任務方面的復興在這一時期已經被觀察到了[9]，[96]。然而，CNN在圖像分類任務中的成功不只是因爲上述因素的結果，並且在很大程度上是因爲結構的修改、參數的優化、調節單元的合併以及網絡內鏈接的從新制定和調整[39]、[42]、[97]。γi
The main breakthrough in CNN performance was brought by AlexNet [21]. AlexNet won the 2012-ILSVRC competition, which has been one of the most difficult challenges in image detection and classification. AlexNet improved performance by exploiting depth (incorporating multiple levels of transformation) and introduced regularization term in CNN. The exemplary performance of AlexNet [21] compared to conventional ML techniques in 2012-ILSVRC (AlexNet reduced error rate from 25.8 to 16.4) suggested that the main reason of the saturation in CNN performance before 2006 was largely due to the unavailability of enough training data and computational resources. In summary, before 2006, these resource deficiencies made it hard to train a high-capacity CNN without deterioration of performance [98].	CNN的主要突破是由AlexNet帶來的[21]。AlexNet贏得了2012-ILSVRC比賽，這是圖像檢測和分類領域最困難的挑戰之一。AlexNet利用深度（包含多個層次的轉換）提升了性能，並在CNN中引入了正則化項。與2012-ILSVRC（AlexNet將錯誤率從25.8下降到16.4）中的傳統ML技術相比，AlexNet的示例性性能[21]代表，2006年以前CNN性能飽和的主要緣由是缺少足夠的訓練數據和計算資源。總之，在2006年以前，這些資源不足使得在不下降性能的狀況下難以訓練高容量CNN[98]
With CNN becoming more of a commodity in the computer vision (CV) field, a number of attempts have been made to improve the performance of CNN with reduced computational cost. Therefore, each new architecture try to overcome the shortcomings of previously proposed architecture in combination with new structural reformulations. In year 2013 and 2014, researchers mainly focused on parameter optimization to accelerate CNN performance in a range of applications with a small increase in computational complexity. In 2013, Zeiler and Fergus [28] defined a mechanism to visualize learned filters of each CNN layer. Visualization approach was used to improve the feature extraction stage by reducing the size of the filters. Similarly, VGG architecture [29] proposed by the Oxford group, which was runner-up at the 2014-ILSVRC competition, made the receptive field much smaller in comparison to that of AlexNet but, with increased volume. In VGG, depth was increased from 9 layers to 16, by making the volume of features maps double at each layer. In the same year, GoogleNet [99] that won 2014-ILSVRC competition, not only exerted its efforts to reduce computational cost by changing layer design, but also widened the width in compliance with depth to improve CNN performance. GoogleNet introduced the concept of split, transform, and merge based blocks, within which multiscale and multilevel transformation is incorporated to capture both local and global information [33], [99], [100]. The use of multilevel transformations helps CNN in tackling details of images at various levels. In the year 2012-14, the main improvement in the learning capacity of CNN was achieved by increasing its depth and parameter optimization strategies. This suggested that the depth of a CNN helps in improving the performance of a classifier.	隨着CNN在計算機視覺（CV）領域的應用愈來愈普遍，人們在下降計算成本的前提下，對CNN的性能進行了許多嘗試。所以，每個新的架構都試圖結合新的結構重組來克服先前提出的建築的缺點。在第2013和2014年，研究人員主要集中在參數優化，以加速CNN在一系列應用中的性能，計算複雜性的增長很小。2013年，Zeiler和Fergus[28]定義了一種機制，能夠可視化每一個CNN層的學習過濾器。採用可視化的方法，經過減少濾波器的尺寸來改善特徵提取階段。一樣，在2014-ILSVRC競賽中得到亞軍的Oxford group提出的VGG架構[29]也使得接受場比AlexNet小得多，但隨着體積的增長。在VGG中，深度從9層增長到16層，使每層的特徵地圖體積加倍。同年，贏得2014-ILSVRC競賽的GoogleNet[99]不只努力經過改變層設計來下降計算成本，還根據深度拓寬了寬度以提升CNN性能。GoogleNet引入了基於分割、變換和合並的塊的概念，其中結合了多尺度和多級變換來捕獲局部和全局信息[33]、[99]、[100]。多級轉換的使用有助於CNN處理不一樣層次的圖像細節。2012-2014年，CNN的學習能力主要經過提升其深度和參數優化策略來實現。這代表CNN的深度有助於提升分類器的性能。

3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNN

It is generally observed the major improvements in CNN performance occurred from 2015-2019. The research in CNN is still on going and has a significant potential of improvement. Representational capacity of CNN depends on its depth and in a sense can help in learning complex problems by defining diverse level of features ranging from simple to complex. Multiple levels of transformation make learning easy by chopping complex problems into 15 smaller modules. However, the main challenge faced by deep architectures is the problem of negative learning, which occurs due to diminishing gradient at lower layers of the network. To handle this problem, different research groups worked on readjustment of layers connections and design of new modules. In earlier 2015, Srivastava et al. used the concept of cross-channel connectivity and information gating mechanism to solve the vanishing gradient problem and to improve the network representational capacity [101]–[103]. This idea got famous in late 2015 and a similar concept of residual blocks or skip connections was coined [31]. Residual blocks are a variant of cross-channel connectivity, which smoothen learning by regularizing the flow of information across blocks [104]–[106]. This idea was used in ResNet architecture for the training of 150 layers deep network [31]. The idea of cross-channel connectivity is further extended to multilayer connectivity by Deluge, DenseNet, etc. to improve representation [107], [108].	通常觀察到，CNN在2015-2019年的表現出現了重大改善。CNN的研究仍在進行中，有很大的改進潛力。CNN的表徵能力取決於它的深度，在某種意義上能夠經過定義從簡單到複雜的不一樣層次的特徵來幫助學習複雜的問題。經過將複雜的問題分解成15個較小的模塊，多層次的轉換使學習變得容易。然而，深度架構面臨的主要挑戰是負學習問題，這是因爲網絡較低層的梯度減少而產生的。爲了解決這個問題，不一樣的研究小組致力於從新調整層鏈接和設計新的模塊。2015年初，Srivastava等人。利用跨通道鏈接和信息選通機制的概念解決了消失梯度問題，提升了網絡的表示能力[101]–[103]。這一想法在2015年底變得頗有名，並創造了相似的剩餘塊或跳過鏈接的概念[31]。剩餘塊是跨信道鏈接的一種變體，它經過調整跨塊的信息流來平滑學習[104]–[106]。該思想被用於ResNet體系結構中，用於150層深度網絡的訓練[31]。爲了改進表示[107]、[108]，經過Deluge、DenseNet等將跨信道鏈接的思想進一步擴展到多層鏈接。γi
In the year 2016, the width of the network was also explored in connection with depth to improve feature learning [34], [35]. Apart from this, no new architectural modification became prominent but instead, different researchers used hybrid of the already proposed architectures to improve deep CNN performance [33], [104]–[106], [109], [110]. This fact gave the intuition that there might be other factors more important as compared to the appropriate assembly of the network units that can effectively regulate CNN performance. In this regard, Hu et al. (2017) identified that the network representation has a role in learning of deep CNNs [111]. Hu et al. introduced the idea of feature map exploitation and pinpointed that less informative and domain extraneous features may affect the performance of the network to a larger extent. He exploited the aforementioned idea and proposed new architecture named as Squeeze and Excitation Network (SE-Network) [111]. It exploits feature map (commonly known as channel in literature) information by designing a specialized SE-block. This block assigns weight to each feature map depending upon its contribution in class discrimination. This idea was further investigated by different researchers, which assign attention to important regions by exploiting both spatial and feature map (channel) information [37], [38], [112]. In 2018, a new idea of channel boosting was introduced by Khan et al [36]. The motivation behind the training of network with boosted channel representation was to use an enriched representation. This idea effectively boost the performance of a CNN by learning diverse features as well as exploiting the already learnt features through the concept of TL.	2016年，還結合深度探索了網絡的寬度，以改進特徵學習[34]，[35]。除此以外，沒有新的架構修改變得突出，但相反，不一樣的研究人員使用已經提出的架構的混合來改進深層CNN性能[33]、[104]–[106]、[109]、[110]。這一事實給人的直覺是，與可以有效調節CNN性能的網絡單元的適當組裝相比，可能還有其餘因素更重要。在這方面，胡等人。（2017）肯定了網絡表明在學習深層CNN方面的做用[111]。Hu等人。介紹了特徵圖的開發思想，指出信息量小、領域無關的特徵對網絡性能的影響較大。他利用了上述思想，提出了一種新的結構，稱爲擠壓激勵網絡（SE網絡）[111]。它經過設計一個專門的SE塊來開發特徵映射（在文獻中一般稱爲通道）信息。此塊根據其在類別識別中的貢獻爲每一個特徵映射分配權重。不一樣的研究者對此進行了進一步的研究，他們利用空間和特徵地圖（通道）信息將注意力分配到重要區域[37]、[38]、[112]。2018年，Khan等人[36]提出了一種新的渠道提高理念。提升渠道表徵的網絡訓練背後的動機是使用豐富的表徵。這一思想經過學習不一樣的特徵以及經過TL的概念利用已經學習的特徵，有效地提升了CNN的性能
From 2012 up till now, a lot of improvements have been reported in CNN architecture. As regards the architectural advancement of CNNs, recently the focus of research has been on designing of new blocks that can boost network representation by exploiting both feature maps and spatial information or by adding artificial channels.	從2012年到如今，CNN的架構有不少改進。關於CNNs的體系結構進展，近年來的研究重點是設計新的塊，經過利用特徵圖和空間信息或添加人工通道來加強網絡表示。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。