Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

時間 2019-11-30

標籤 edge intelligence demand deep learning model inference device synergy 欄目 Intel 简体版

原文原文鏈接

邊緣智能：按需深度學習模型和設備邊緣協同的共同推理

本文爲SIGCOMM 2018 Workshop (Mobile Edge Communications, MECOMM)論文。node

筆者翻譯了該論文。因爲時間倉促，且筆者英文能力有限，錯誤之處在所不免；歡迎讀者批評指正。算法

本文及翻譯版本僅用於學習使用。若是有任何不當，請聯繫筆者刪除。promise

本文做者包含3位，En Li, Zhi Zhou, and Xu Chen@School of Data and Computer Science, Sun Yat-sen University 服務器

ABSTRACT (摘要）

As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suﬀers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s eﬀectiveness in enabling on-demand low-latency edge intelligence.
網絡

做爲機器學習的骨幹技術，深度神經網絡（DNNs）已經迅速成爲人們關注的焦點。然而，在資源受限的移動設備上運行DNN毫不是微不足道的，由於它會帶來高性能和高能耗開銷。因爲不受控制的長廣域網延遲，將DNN加載到雲中以便執行會帶來不可預測的性能。爲了應對這些挑戰，在本文中，咱們提出了Edgent，一種具備設備邊緣協同做用的協做和按需DNN協同推理框架。 Edgent追求兩個設計目標：（1）DNN劃分，自適應地劃分設備和邊緣之間的DNN計算，以便利用鄰近的混合計算資源進行實時DNN推理。（2）DNN正確調整大小，經過在適當的中間DNN層提早退出來加速DNN推理，以進一步減小計算延遲。基於Raspberry Pi的原型實現和普遍評估證實了Edgent在實現按需低延遲邊緣智能方面的有效性。架構

1 INTRODUCTION & RELATED WORK （引言和相關工做）

As the backbone technology supporting modern intelligent mobile applications, Deep Neural Networks (DNNs) represent the most commonly adopted machine learning technique and have become increasingly popular. Due to DNNs’s ability to perform highly accurate and reliable inference tasks, they have witnessed successful applications in a broad spectrum of domains from computer vision [14] to speech recognition [12] and natural language processing [16]. However, as DNN-based applications typically require tremendous amount of computation, they cannot be well supported by today’s mobile devices with reasonable latency and energy consumption. app

做爲支持現代智能移動應用的骨幹技術，深度神經網絡（DNN）表明了最經常使用的機器學習技術，而且愈來愈受歡迎。因爲DNN可以執行高度準確和可靠的推理任務，他們見證了從計算機視覺[14]到語音識別[12]和天然語言處理[16]等普遍領域的成功應用。可是，因爲基於DNN的應用程序一般須要大量的計算，所以當今的移動設備沒法很好地支持它們（在合理的延遲和能耗約束下）。框架

In response to the excessive resource demand of DNNs, the traditional wisdom resorts to the powerful cloud datacenter for training and evaluating DNNs. Input data generated from mobile devices is sent to the cloud for processing, and then results are sent back to the mobile devices after the inference. However, with such a cloud-centric approach, large amounts of data (e.g., images and videos) are uploaded to the remote cloud via a long wide-area network data transmission, resulting in high end-to-end latency and energy consumption of the mobile devices. To alleviate the latency and energy bottlenecks of cloud-centric approach, a better solution is to exploiting the emerging edge computing paradigm. Specifcally, by pushing the cloud capabilities from the network core to the network edges (e.g., base stations and WiFi access points) in close proximity to devices, edge computing enables low-latency and energy-efficient DNN inference. less

爲了應對DNN的過多資源需求，傳統智慧採用強大的雲數據中心來訓練和評估DNN。從移動設備生成的輸入數據被髮送到雲進行處理，而後在推斷以後將結果發送回移動設備。然而，利用這種以云爲中心的方法，大量數據（例如，圖像和視頻）經過長廣域網數據傳輸上傳到遠程雲，致使移動設備上大的端到端延遲和能量消耗。爲了緩解以云爲中心的方法的延遲和能量瓶頸，更好的解決方案是利用新興的邊緣計算範例。具體地，經過將雲的能力從網絡核心推送到緊鄰設備的網絡邊緣（例如，基站和WiFi接入點），邊緣計算實現低延遲和高效能的DNN推斷。dom

While recognizing the benefts of edge-based DNN inference, our empirical study reveals that the performance of edge-based DNN inference is highly sensitive to the available bandwidth between the edge server and the mobile device. Specifcally, as the bandwidth drops from 1Mbps to 50Kbps, the latency of edge-based DNN inference climbs from 0.123s to 2.317s and becomes on par with the latency of local processing on the device. Then, considering the vulnerable and volatile network bandwidth in realistic environments (e.g., due to user mobility and bandwidth contention among various Apps), a natural question is that can we further improve the performance (i.e., latency) of edge-based DNN execution, especially for some mission-critical applications such as VR/AR games and robotics [13].

雖然咱們認識到基於邊緣的DNN推理的好處，但咱們的實證研究代表，基於邊緣的DNN推理的性能對邊緣服務器和移動設備之間的可用帶寬高度敏感。具體而言，隨着帶寬從1Mbps降至50Kbps，基於邊緣的DNN推斷的延遲從0.123s上升到2.317s，而且與設備上本地處理的延遲至關。而後，考慮到現實環境中易受攻擊和易變的網絡帶寬（例如，因爲用戶移動性和各類應用之間的帶寬爭用），一個天然的問題是咱們可否進一步改善基於邊緣的DNN執行的性能（即延遲），特別是對於一些關鍵任務應用，如VR/AR遊戲和機器人[13]。

To answer the above question in the positive, in this paper we proposed Edgent, a deep learning model co-inference framework with device-edge synergy. Towards low-latency edge intelligence, Edgent pursues two design knobs. The frst is DNN partitioning, which adaptively partitions DNN computation between mobile devices and the edge server based on the available bandwidth, and thus to take advantage of the processing power of the edge server while reducing data transfer delay. However, worth noting is that the latency after DNN partition is still restrained by the rest part running on the device side. Therefore, Edgent further combines DNN partition with DNN right-sizing which accelerates DNN inference through early-exit at an intermediate DNN layer. Needless to say, early-exit naturally gives rise to the latency-accuracy tradeoﬀ (i.e., early-exit harms the accuracy of the inference). To address this challenge, Edgent jointly optimizes the DNN partitioning and right-sizing in an on-demand manner. That is, for mission-critical applications that typically have a predefned deadline, Edgent maximizes the accuracy without violating the deadline. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s eﬀectiveness in enabling on-demand low-latency edge intelligence.

爲了回答上述問題，咱們在本文中提出了Edgent，一種具備設備邊緣協同做用的深度學習模型協同推理框架。對於低延遲邊緣智能（做爲初始探索，本文咱們只考慮執行延遲問題。在將來工做中，咱們也將考慮能耗問題），Edgent追求兩個設計目標。第一個是DNN分區，其基於可用帶寬自適應地劃分移動設備和邊緣服務器之間的DNN計算，從而利用邊緣服務器的處理能力，同時減小數據傳輸延遲。但值得注意的是，DNN分區後的延遲仍然受到設備端運行的其他部分的限制。所以，Edgent進一步將DNN分區與DNN正確大小調整相結合，經過在中間DNN層的早期退出來加速DNN推斷。不用說，提早退出天然會產生延遲和準確度之間的均衡（即，提早退出會損害推斷的準確性）。爲了解決這一挑戰，Edgent以按需方式協同優化DNN分區和正確大小調整。也就是說，對於一般具備預約截止時間的關鍵任務應用程序，Edgent在不違反截止時間的狀況下最大化準確性。基於Raspberry Pi的原型實現和普遍評估證實了Edgent在實現按需低延遲邊緣智能方面的有效性。

While the topic of edge intelligence has began to garner much attention recently, our study is diﬀerent from and complementary to existing pilot eﬀorts. On one hand, for fast and low power DNN inference at the mobile device side, various approaches as exemplifed by DNN compression and DNN architecture optimization has been proposed [3–5, 7, 9]. Diﬀerent from these works, we take a scale-out approach to unleash the benefts of collaborative edge intelligence between the edge and mobile devices, and thus to mitigate the performance and energy bottlenecks of the end devices. On the other hand, though the idea of DNN partition among cloud and end device is not new [6], realistic measurements show that the DNN partition is not enough to satisfy the stringent timeliness requirements of mission-critical applications. Therefore, we further apply the approach of DNN right-sizing to speed up DNN inference.

雖然邊緣智能的話題在最近引發了不少關注，但咱們的研究與現有的工做不一樣而且互爲補充。一方面，對於移動設備側的快速和低功率DNN推斷，已經提出了DNN壓縮和DNN架構優化爲例的各類方法[3-5,7,9]。與這些工做不一樣，咱們採用橫向擴展方法釋放邊緣和移動設備之間協同邊緣智能的好處，從而減輕終端設備的性能和能耗瓶頸。另外一方面，雖然雲和終端設備之間DNN劃分的想法並不新鮮[6]，但實際測量代表DNN劃分不足以知足任務關鍵型應用的嚴格的實時性要求。所以，咱們進一步應用DNN正確大小調整的方法來加速DNN推理。

2 BACKGROUND & MOTIVATION （背景 & 動機）

In this section, we frst give a primer on DNN, then analyse the inefciency of edge- or device-based DNN execution, and finally illustrate the benefts of DNN partitioning and rightsizing with device-edge synergy towards low-latency edge intelligence.

在本節中，咱們首先介紹DNN，而後分析基於邊緣或設備的DNN執行的效率，最後說明利用設備邊緣協同做用的DNN劃分和正確大小調整實現低延遲邊緣智能的好處。

2.1 A Primer on DNN （DNN基礎）

DNN represents the core machine learning technique for a broad spectrum of intelligent applications spanning computer vision, automatic speech recognition and natural language processing. As illustrated in Fig. 1, computer vision applications use DNNs to extract features from an input image and classify the image into one of the pre-defned categories. A typical DNN model is organized in a directed graph which includes a series of inner-connected layers, and within each layer comprising some number of nodes. Each node is a neuron that applies certain operations to its input and generates an output. The input layer of nodes is set by raw data while the output layer determines the category of the data. The process of passing forward from the input layer to the out layer is called model inference. For a typical DNN containing tens of layers and hundreds of nodes per layer, the number of parameters can easily reach the scale of millions. Thus, DNN inference is computational intensive. Note that in this paper we focus on DNN inference, since DNN training is generally delay tolerant and is typically conducted in an oﬀ-line manner using powerful cloud resources.

DNN表明了包括計算機視覺、自動語音識別和天然語言處理在內的普遍智能應用的核心機器學習技術。如圖1所示，計算機視覺應用程序使用DNN從輸入圖像中提取特徵並將圖像分類爲某一預約類別。典型的DNN模型被組織爲有向圖，該有向圖包括一系列內部鏈接的層，而且在每一個層內包括一些節點。每一個節點都是一個神經元，它將某些操做應用於其輸入並生成輸出。輸入層節點由原始數據設置，而輸出層肯定數據的類別。從輸入層到外層的前向傳遞過程稱爲模型推斷。對於包含數十層和每層包含數百個節點的典型DNN，參數的數量能夠輕鬆達到數百萬的規模。所以，DNN推斷是計算密集型的。注意，在本文中，咱們關注DNN推理，由於DNN訓練一般是延遲容忍的，而且一般使用強大的雲資源以一種離線的方式進行。

圖1：一個用於計算機視覺的4層DNN

2.2 Inefficiency of Device- or Edge-based DNN Inference （基於設備或邊緣的DNN推理不高效）

Currently, the status quo of mobile DNN inference is either direct execution on the mobile devices or offloading to the cloud/edge server for execution. Unfortunately, both approaches may suﬀer from poor performance (i.e., end-to-end latency), being hard to well satisfy real-time intelligent mobile applications (e.g., AR/VR mobile gaming and intelligent robots) [2]. As illustration, we take a Raspberry Pi tiny computer and a desktop PC to emulate the mobile device and edge server respectively, running the classical AlexNet [1] DNN model for image recognition over the Cifar-10 dataset [8]. Fig. 2 plots the breakdown of the end-to-end latency of diﬀerent approaches under varying bandwidth between the edge and mobile device. It clearly shows that it takes more than 2s to execute the model on the resource-limited Raspberry Pi. Moreover, the performance of edge-based execution approach is dominated by the input data transmission time (the edge server computation time keeps at ∼10ms) and thus highly sensitive to the available bandwidth. Specifcally, as the available bandwidth jumps from 1Mbps to 50Kbps, the end-to-end latency climbs from 0.123s to 2.317s. Considering the network bandwidth resource scarcity in practice (e.g., due to network resource contention among users and apps) and computing resource limitation on mobile devices, both of the device- and edge-based approaches are challenging to well support many emerging real-time intelligent mobile applications with stringent latency requirement.

目前，移動DNN推斷的現狀是要麼在移動設備上直接執行，要麼加載到雲/邊緣服務器執行。不幸的是，這兩種方法均可能經歷較差的性能（即端到端延遲），難以很好地知足實時智能移動應用（例如AR/VR移動遊戲和智能機器人）[2]。例如，咱們採用Raspberry Pi小型計算機和臺式PC分別模擬移動設備和邊緣服務器，運行經典的AlexNet [1] DNN模型，經過Cifar-10數據集進行圖像識別[8]。圖2繪製了邊緣和移動設備之間不一樣帶寬下不一樣方法的端到端延遲的細分。它清楚地代表在資源有限的Raspberry Pi上執行模型須要2秒以上。此外，基於邊緣的執行方法的性能由輸入數據傳輸時間（邊緣服務器計算時間保持在~10ms）決定，所以對可用帶寬高度敏感。具體而言，隨着可用帶寬從1Mbps降至50Kbps，端到端延遲從0.123秒攀升至2.317秒。考慮到實際中網絡帶寬資源的稀缺性（例如，因爲用戶和應用之間的網絡資源爭用）以及移動設備上的計算資源限制，基於設備和邊緣的方法都難以很好地支持許多新興的具備嚴格延遲要求的實時智能移動應用程序。

圖2：AlexNet運行時間。

2.3 Enabling Edge Intelligence with DNN Partitioning and Right-Sizing （使用DNN劃分和正確大小調整使能邊緣智能）

DNN Partitioning: For a better understanding of the performance bottleneck of DNN execution, we further break the runtime (on Raspberry Pi) and output data size of each layer in Fig. 3. Interestingly, we can see that the runtime and output data size of diﬀerent layers exhibit great heterogeneities, and layers with a long runtime do not necessarily have a large output data size. Then, an intuitive idea is DNN partitioning, i.e., partitioning the DNN into two parts and offloading the computational intensive one to the server at a low transmission overhead, and thus to reduce the end-to-end latency. For illustration, we choose the second local response normalization layer (i.e., lrn 2) in Fig. 3 as the partition point and offload the layers before the partition point to the edge server while running the rest layers on the device. By DNN partitioning between device and edge, we are able to collaborate hybrid computation resources in proximity for low-latency DNN inference.

DNN劃分：爲了更好地理解DNN執行的性能瓶頸，咱們進一步分解運行時（在Raspberry Pi上）並在圖3中給出每層的數據大小。有趣的是，咱們能夠看到運行時間和不一樣層的輸出數據大小表現出很大的異構性，具備長運行時間的層不必定具備大的輸出數據大小。而後，直觀的想法是DNN劃分，即，將DNN分紅兩部分並以低傳輸開銷將計算密集的一部分卸載到服務器，從而減小端到端等待時間。爲了說明，咱們選擇圖3中的第二局部響應歸一化層（即，lrn 2）做爲劃分點，並將劃分點以前的層卸載到邊緣服務器，同時在設備上運行其他層。經過DNN在設備和邊緣之間進行劃分，咱們可以爲低延遲DNN推理協同混合計算資源。

圖3：樹莓Pi設備上AlexNet層的運行時間。

DNN Right-Sizing: While DNN partitioning greatly reduces the latency by bending the computing power of the edge server and mobile device, we should note that the optimal DNN partitioning is still constrained by the run time of layers running on the device. For further reduction of latency, the approach of DNN Right-Sizing can be combined with DNN partitioning. DNN right-sizing promises to accelerate model inference through an early-exit mechanism. That is, by training a DNN model with multiple exit points and each has a diﬀerent size, we can choose a DNN with a small size tailored to the application demand, meanwhile to alleviate the computing burden at the model division, and thus to reduce the total latency. Fig. 4 illustrates a branchy AlexNet with five exit points. Currently, the early-exit mechanism has been supported by the open source framework BranchyNet[15]. Intuitively, DNN right-sizing further reduces the amount of computation required by the DNN inference tasks.

DNN正確大小調整：雖然DNN劃分經過下降邊緣服務器和移動設備的計算能力大大減小了延遲，但咱們應該注意到最佳DNN劃分仍然受到設備上運行的層的運行時間的限制。爲了進一步減小延遲，DNN正確大小調整的方法能夠與DNN劃分相結合。 DNN正確的規模承諾經過早期退出機制加速模型推理。也就是說，經過訓練具備多個出口點的DNN模型而且每一個具備不一樣的尺寸，咱們能夠選擇適合應用需求的小尺寸DNN，同時減輕模型部門的計算負擔，從而減小總延遲。圖4示出了具備五個出口點分支的AlexNet。目前，早期退出機制獲得了開源框架BranchyNet的支持[15]。直觀地說，DNN正確大小調整進一步減小了DNN推理任務所需的計算量。

圖4: DNN正確大小調整中早期退出機制的示例

Problem Description: Obviously, DNN right-sizing incurs the problem of latency-accuracy tradeoﬀ — while early exit reduces the computing time and the device side, it also deteriorates the accuracy of the DNN inference. Considering the fact that some applications (e.g., VR/AR game) have stringent deadline requirement while can tolerate moderate accuracy loss, we hence strike a nice balance between the latency and the accuracy in an on-demand manner. Particularly, given the predefned and stringent latency goal, we maximize the accuracy without violating the deadline requirement. More specifcally, the problem to be addressed in this paper can be summarized as: given a predefned latency requirement, how to jointly optimize the decisions of DNN partitioning and right-sizing, in order to maximize DNN inference accuracy.

問題描述：顯然，DNN正確大小調整會產生延遲-準確性的權衡 - 雖然早期退出會縮短計算時間(設備)，但它也會下降DNN推理的準確性。考慮到某些應用程序（例如，VR/AR遊戲）具備嚴格的期限要求而可以容忍適度的準確度損失的事實，所以咱們以按需方式在延遲和準確性之間取得了良好的平衡。特別是，考慮到預約義和嚴格的延遲目標，咱們在不違反期限要求的狀況下最大化準確性。更具體地說，本文要解決的問題可概括爲：給定預約的延遲要求，如何聯合優化DNN劃分和正確大小調整的決策，以最大化DNN推理的準確性。

3 FRAMEWORK （框架）

We now outline the initial design of Edgent, a framework that automatically and intelligently selects the best partition point and exit point of a DNN model to maximize the accuracy while satisfying the requirement on the execution latency.

咱們如今概述Edgent的初始設計，這是一個自動智能地選擇DNN模型的最佳劃分點和退出點的框架，以在知足執行延遲要求的同時最大化準確性。

3.1 System Overview （系統概述）

Fig. 5 shows the overview of Edgent. Edgent consists of three stages: ofine training stage, online optimization stage and co-inference stage.

圖5顯示了Edgent的概述。 Edgent由三個階段組成：訓練階段、在線優化階段和共同推理階段。

圖5： Edgent概覽

At offline training stage, Edgent performs two initializations: (1) profling the mobile device and the edge server to generate regression-based performance prediction models (Sec. 3.2) for diﬀerent types DNN layer (e.g., Convolution, Pooling, etc.). (2) Using Branchynet to train DNN models with various exit points, and thus to enable early-exit. Note that the performance profling is infrastructure-dependent, while the DNN training is application-dependent. Thus, given the sets of infrastructures (i.e., mobile devices and edge servers) and applications, the two initializations only need to be done once in an offline manner.

在離線訓練階段，Edgent執行兩次初始化：（1）對移動設備和邊緣服務器進行分析，以針對不一樣類型的DNN層（例如，卷積，池化等）生成基於迴歸的性能預測模型（第3.2節）。（2）使用Branchynet訓練具備不一樣退出點的DNN模型，從而實現提早退出。請注意，性能分析依賴於基礎結構，而DNN訓練則取決於應用程序。所以，給定基礎設施（即，移動設備和邊緣服務器）和應用程序，兩個初始化僅須要以離線方式進行一次。

At online optimization stage, the DNN optimizer selects the best partition point and early-exit point of DNNs to maximize the accuracy while providing performance guarantee on the end-to-end latency, based on the input: (1) the profiled layer latency prediction models and Branchynet trained DNN models with various sizes. (2) the observed available bandwidth between the mobile device and edge server. (3) The pre-defined latency requirement. The optimization algorithm is detailed in Sec. 3.3.

在線優化階段，DNN優化器選擇DNN的最佳分區點和早期退出點，以最大化準確性，同時根據輸入提供端到端延遲的性能保證：（1）分析的層延遲預測模型和各類尺寸的Branchynet訓練的DNN模型。（2）移動設備和邊緣服務器之間觀察到的可用帶寬。（3）預先肯定的延遲要求。優化算法詳見第3.3節。

At co-inference stage, according to the partition and early-exit plan, the edge server will execute the layer before the partition point and the rest will run on the mobile device.

在協同推理階段，根據劃分和提早退出規劃，邊緣服務器將執行該層劃分點以前的層，且其他部分將在移動設備上運行。

3.2 Layer Latency Prediction （層延遲預測）

When estimating the runtime of a DNN, Edgent models the per-layer latency rather than modeling at the granularity of a whole DNN. This greatly reduces the profling overhead since there are very limited classes of layers. By experiments, we observe that the latency of diﬀerent layers is determined by various independent variables (e.g., input data size, output data size) which are summarized in Table 1. Note that we also observe that the DNN model loading time also has an obvious impact on the overall runtime. Therefore, we further take the DNN model size as a input parameter to predict the model loading time. Based on the above inputs of each layer, we establish a regression model to predict the latency of each layer based on its profles. The fnal regression models of some typical layers are shown in Table 2 (size is in bytes and latency is in ms).

在估計DNN的運行時間時，Edgent會對每層的延遲進行建模，而不是以整個DNN爲粒度進行建模。這極大地減小了分析開銷，由於存在很是有限的層類別。經過實驗，咱們觀察到不一樣層的延遲由各類獨立變量（例如，輸入數據大小，輸出數據大小）決定，如表1所示。注意，咱們還觀察到DNN模型的加載時間對總運行時間也有明顯的影響。所以，咱們進一步將DNN模型的大小做爲輸入參數來預測模型的加載時間。基於每層的上述輸入，咱們創建迴歸模型以基於分析預測每一個層的延遲。表2中顯示了一些典型層的最終迴歸模型（大小以字節爲單位，延遲以毫秒爲單位）。

3.3 Joint Optimization on DNN Partition and DNN Right-Sizing （DNN劃分和DNN正確大小調整的協同優化）

At online optimization stage, the DNN optimizer receives the latency requirement from the mobile device, and then searches for the optimal exit point and partition point of the trained branchynet model. The whole process is given in Algorithm 1. For a branchy model with M exit points, we denote that the i-th exit point has Ni layers. Here a layer index i correspond to a more accurate inference model of larger size. We use the above-mentioned regression models to predict EDj the runtime of the j-th layer when it runs on device and ESj the runtime of the j-th layer when it runs on server. Dp is the output of the p-th layer. Under a specific bandwidth B, with the input data Input, then we calcuate Ai,p the whole runtime when the p-th is the partition point of the selected model of i-th exit point. When p = 1, the model will only run on the device then ESp = 0, Dp-1/B = 0, Input/B = 0, and when p = Ni, the model will only run on the server then EDp = 0, Dp-1/B = 0. In this way, we can find out the best partition point having the smallest latency for the model of i-th exit point. Since the model partition does not affect the inference accuracy, we can then sequentially try the DNN inference models with diﬀerent exit points(i.e., with diﬀerent accuracy), and find the one having the largest size and meanwhile satisfying the latency requirement. Note that since regression models for layer latency prediction are trained beforehand, Algorithm 1 mainly involves linear search operations and can be done very fast (no more than 1ms in our experiments) .

在線優化階段，DNN優化器從移動設備接收延遲要求，而後搜索訓練的branchynet模型的最佳出口點和分區點。整個過程在算法1中給出。對於具備M個出口點的分支模型，咱們表示第i個出口點具備N_i層。這裏，更大的層索引i對應於更準確的推斷模型。咱們使用上面提到的迴歸模型來預測第j層在設備上運行時的運行時間ED_j，ES_j是它在服務器上運行時運行時間。 D_p是第p層的輸出。在特定帶寬B下，使用輸入數據Input，咱們計算總運行時間A_i,p=_，其中，p是所選模型的劃分點，i表示個出口點。當p = 1時，模型將僅在設備上運行，那麼ES_p = 0，D_p-1 / B = 0，Input/ B = 0；當p = N_i時，模型將僅在服務器上運行，那麼ED_p = 0 ，D_p-1 / B = 0。經過這種方式，咱們能夠找到具備最小延遲的最佳分區點，用於第i個出口點的模型。因爲模型劃分不影響推理精度，咱們能夠依次嘗試具備不一樣出口點的DNN推理模型（即，具備不一樣的精度），並找到具備最大尺寸並同時知足延遲要求的模型。請注意，因爲預先訓練了層延遲預測的迴歸模型，所以算法1主要涉及線性搜索操做，而且能夠很是快速地完成（在咱們的實驗中不超過1ms）。

4 EVALUATION （評估）

We now present our preliminary implementation and evaluation results.

如今，咱們給出初步實現和評估結果。

4.1 Prototype （原型）

We have implemented a simple prototype of Edgent to verify the feasibility and efcacy of our idea. To this end, we take a desktop PC to emulate the edge server, which is equipped with a quad-core Intel processor at 3.4 GHz with 8 GB of RAM, and runs the Ubuntu system. We further use Raspberry Pi 3 tiny computer to act as a mobile device. The Raspberry Pi 3 has a quad-core ARM processor at 1.2 GHz with 1 GB of RAM. The available bandwidth between the edge server and the mobile device is controlled by the WonderShaper [10] tool. As for the deep learning framework, we choose Chainer [11] that can well support branchy DNN structures.

咱們已經實現了Edgent的簡單原型系統來驗證咱們的想法的可行性和有效性。爲此，咱們採用臺式機PC模擬邊緣服務器，該服務器配備了3.4 GHz的四核英特爾處理器和8 GB的RAM，並運行Ubuntu系統。咱們進一步使用Raspberry Pi 3微型計算機充當移動設備。 Raspberry Pi 3具備1.2 GHz的四核ARM處理器和1 GB的RAM。邊緣服務器和移動設備之間的可用帶寬由WonderShaper [10]工具控制。至於深度學習框架，咱們選擇可以很好地支持分支DNN結構的Chainer [11]。

For the branchynet model, based on the standard AlexNet model, we train a branchy AlexNet for image recognition over the large-scale Cifar-10 dataset [8]. The branchy AlexNet has five exit points as showed in Fig. 4 (Sec. 2), each exit point corresponds to a sub-model of the branchy AlexNet. Note that in Fig. 4, we only draw the convolution layers and the fully-connected layers but ignore other layers for ease of illustration. For the five sub-models, the number of layers they each have is 12, 16, 19, 20 and 22, respectively.

對於branchynet模型，基於標準的AlexNet模型，咱們訓練了一個分支的AlexNet，用於大規模Cifar-10數據集的圖像識別[8]。如圖4（第2節）所示，分支的AlexNet具備5個出口點，每一個出口點對應於分支AlexNet的子模型。請注意，在圖4中，咱們僅繪製卷積層和徹底鏈接的層，爲了便於說明而忽略其餘層。對於這5個子模型，它們各自具備的層數分別爲12,16,19,20和22。

For the regression-based latency prediction models for each layer, the independent variables are shown in the Table. 1, and the obtained regression models are shown in Table 2.

對於每層的基於迴歸預測模型，獨立變量顯示在表1中，得到的迴歸模型如表2所示。

4.2 Results （結果）

We deploy the branchynet model on the edge server and the mobile device to evaluate the performance of Edgent. Specifcally, since both the pre-defned latency requirement and the available bandwidth play vital roles in Edgent’s optimization logic, we evaluate the performance of Edgent under various latency requirements and available bandwidth.

咱們在邊緣服務器和移動設備上部署了branchynet模型，以評估Edgent的性能。具體而言，因爲預先肯定的延遲要求和可用帶寬在Edgent的優化邏輯中起着相當重要的做用，所以咱們在各類延遲要求和可用帶寬下評估Edgent的性能。

We first investigate the eﬀect of the bandwidth by fixing the latency requirement at 1000ms and varying the bandwidth 50kbps to 1.5Mbps. Fig. 6(a) shows the best partition point and exit point under diﬀerent bandwidth. While the best partition points may ﬂuctuate, we can see that the best exit point gets higher as the bandwidth increases. Meaning that the higher bandwidth leads to higher accuracy. Fig. 6(b) shows that as the bandwidth increases, the model runtime first drops substantially and then ascends suddenly. However, this is reasonable since the accuracy gets better while the latency is still within the latency requirement when increase the bandwidth from 1.2Mbps to 2Mbps. It also shows that our proposed regression-based latency approach can well estimate the actual DNN model runtime latency. We further fix the bandwidth at 500kbps and vary the latency from 100ms to 1000ms. Fig. 6(c) shows the best partition point and exit point under diﬀerent latency requirements. As expected, the best exit point gets higher as the latency requirement increases, meaning that a larger latency goal gives more room for accuracy improvement.

咱們首先經過使用固定的1000ms的延遲要求並將帶寬由50kbps到1.5Mbps變化來研究帶寬的影響。圖6（a）顯示了不一樣帶寬下的最佳劃分點和退出點。雖然最佳劃分點可能會波動，但咱們能夠看到隨着帶寬的增長，最佳退出點會變得更高，這意味着更高的帶寬會帶來更高的準確性。圖6（b）顯示隨着帶寬的增長，模型運行時間首先大幅降低，而後忽然上升。可是，這是合理的，由於當將帶寬從1.2Mbps增長到2Mbps時，準確性變得更好，而延遲仍然在延遲要求內。它還代表咱們提出的基於迴歸的延遲方法能夠很好地估計實際的DNN模型運行時延遲。咱們進一步將帶寬固定在500kbps，並將延遲從100ms改成1000ms。圖6（c）顯示了不一樣延遲要求下的最佳劃分點和出口點。正如預期的那樣，隨着延遲需求的增長，最佳出口點會愈來愈高，這意味着更大的延遲目標能夠爲更高的準確性提供更多空間。

圖6：不一樣帶寬和延遲要求下的結果。

In Fig. 7, under diﬀerent latency requirements, it shows the model accuracy of diﬀerent inference methods. The accuracy is negative if the inference can not satisfy the latency requirement. The network bandwidth is set to 400kbps. Seen in the Fig. 7, at a very low latency requirement (100ms), all four methods can’t satisfy the requirement. As the latency requirement increases, inference by Edgent works earlier than the other methods that at the 200ms to 300ms requirements, by using a small model with a moderate inference accuracy to meet the requirements. The accuracy of the model selected by Edgent gets higher as the latency requirement relaxes.

在圖7中，根據不一樣的延遲要求，顯示了不一樣推理方法的模型精度。若是推斷不能知足延遲要求，則準確度爲負。網絡帶寬設置爲400kbps。如圖7所示，在很是低的延遲要求（100ms）下，全部四種方法都不能知足要求。隨着延遲要求的增長，Edgent的推理比其餘200ms至300ms要求的方法更早結束，經過使用具備中等推理精度的小模型來知足要求。隨着延遲要求的放鬆，Edgent選擇的模型的準確性會提升。

圖7：不一樣延遲需求下的精度比較。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。