TensorFlow中的Placement啓發式算法模塊——Placer

時間 2019-11-17

標籤 tensorflow placement 啓發式算法模塊 placer 简体版

原文原文鏈接

背景

[做者：DeepLearningStack，阿里巴巴算法工程師，開源TensorFlow Contributor]html

受限於單個Device的計算能力和存儲大小，許多深度學習模型都有着使用模型分片或相關策略的需求。模型分片的本質是將模型和相關的計算切分到不一樣的Device，這樣作不但能夠解決單個Device放不下大模型的問題，還有可能有計算加速的收益。在深度學習框架方面，顯然TensorFlow比Caffe具備更高的靈活性，這主要得益於TensorFlow的Placement機制。Placement是TensorFlow引入的特有概念，它是指某個Op被放在了哪個Device上，所以模型分片問題實際上就是該模型上每一個Op的Placement設置問題。在Python層面，一共存在兩個API與Placement相關的接口，它們不但普遍存在與框架代碼中，還能夠被用戶拿來直接使用。可是用戶指定Placement信息存在必定的不可靠性，它與Op的實際狀況每每存在必定的矛盾，而TensorFlow中的Placer就是解決這個問題的模塊。node

Placer功能描述

首先看一下NodeDef的結構，有兩個地方和Placement相關。一個是device屬性，它顯示指定了這個Node應該被放在何種Device上，另外一個是字符串標記loc:@xxxx，這是placement的約束條件，隱式指明該Node的Placement應該和哪些Node保持一致。準確地說，該Node應該和組名爲xxxx內的全部Node的Placement保持一致，這兩個信息有時候會出現矛盾的情形。python

Placer不但要處理兩者的矛盾，還要經過一些規則儘量避免因Placement不當帶來的性能問題。每一個Node在通過Placer處理後都會獲得最終的Placement信息，它將從新覆蓋NodeDef中的device屬性內容。因此，通俗地講，Placer的功能就是計算並填入全部NodeDef的device屬性。算法

前驅內容

閱讀代碼時不免會碰到一些爲解決這個問題專門設立的名詞和經典的算法，因此建議在閱讀Placer模塊相關內容以前先確認已經弄清楚下面的東西，避免走一些彎路。數組

重要概念

Placement：每一個Op的屬性信息，它顯式地指明瞭某個Op應該被放置在哪個Device上計算數據結構

Colocation Group：這也是每一個Op的Placement相關的屬性信息，從NodeDef上看就是字符串爲loc:@xxxx字樣的內容。它是若干Node的集合，在算法中又被稱爲約束（Constraint）條件。屬於同一個Colocation Group中的全部Node被約束爲必需要具備相同的Placement信息。這是Placement信息的隱式表達，它和Placement能夠同時被指定，所以存在矛盾的狀況。若是發生衝突，則直接報錯 app

Placer決策基本原則

Placer會根據會對Graph進行必定程度的分析，並結合用戶的要求對每一個Node的Placement進行微調，微調的原則能夠歸納爲下面四點數據結構和算法

1. 儘量知足用戶要求（User Requirement First）：每一個Node的placement會盡可能知足用戶的要求分佈式

2. 儘量使用計算更快的設備（High Performance Device）：若某個Node的Placement沒有被用戶指定，則優先分配計算更快的設備

3. 保證程序可運行（Runable）：若某個Node不存在用戶要求的Placement相關實現版本，會退而求其次選擇其它實現版本，保障程序能夠用

4. 儘量考慮近鄰特性（Near When Possible）：在作Placement的微調時考慮節點的近鄰特性，儘量減小無心義的拷貝

儘量知足用戶要求（User Requirement First）

用戶要求分爲兩種，一種是顯示指定，表現爲在Node中設置的device信息；另外一種是隱式指定，表現爲loc:@xxxx屬性，即Colocation Group。Placer會根據用戶這兩方面的要求並結合實際狀況作Placement信息補全和微調。文章開頭的截圖展現了某個Node的NodeDef信息，它代表類型爲MatMul的Op被用戶顯示指定放到'/device:GPU:0'上，同時但願放入名爲global_step的Colocation Group中。NodeDef中的device屬性和loc:@xxxx屬性分別由下面兩個python級別的API引入，它們都由用戶來控制，有些被用在高層API內部封裝中。

# device attributes
@tf_export("device")
def device(device_name_or_function):

# colocation attributes
@tf_export("colocate_with")
def colocate_with(op, ignore_existing=False):

儘量使用更快的計算設備（High Performance Device）

若是某個Node的device屬性中不含device_type（即GPU或CPU），那麼Placer必須決定使用何種Device。每種Device註冊到TensorFlow中時都帶有優先級，一般高優先級的Device具備更好的計算性能。當某個Op具備多種Device實現時，Placer將選取優先級最高的Device實現版本，經過設置device_type爲全部實現版本中最高優先級的Device來實現這種選取。

保證程序可運行（Runable）

這是經過Soft Placement機制保證的。若是某個Node被顯示指定精確放在某Device上，但系統中卻沒有該Device上的實現版本，那麼爲了保證程序可用，Soft Placement將發揮做用。它將忽略device type，在系統中按照Device優先級選取另外一個可用的實現版本從新改寫Placement。舉例而言，假設某Node的op是SparseToDense，device_type被指定爲GPU，但目前SparseToDense在TensorFlow中只有CPU的實現，那麼Soft Placement將改寫該Node的device_type爲CPU。

儘量考慮近鄰特性（Near When Possible）

在Placer中使用如下三種啓發式規則來實現這一原則。

a. 若某個Node是GeneratorNode（0-indegree，1-outdegree，且輸出非reference type），將其與Consumer具備相同的Placement能夠防止無心義的跨Device拷貝。這一步在算法中被稱之爲啓發式規則A。

b. 若某個Node是MetaDataNode（直接在Tensor的元數據MetaData上操做，好比Reshape），將其與Producer具備相同的Placemen能夠防止無心義的跨Device拷貝。這一步在算法中被稱爲啓發式規則B。

c. 若某個Node的輸入是Reference type或者是Reource type，那麼儘可能將其與輸入放在同一個Colocation Group中。算法中沒有爲這個步驟起名字，爲了方便咱們稱之爲啓發式規則C。

Placer決策算法整體流程

整體流程分爲四個步驟，下圖展現了宏觀層面的流程圖。其中最後兩個步驟相對較爲複雜，下一小節中將會細化其流程圖。

Placer算法決策分步詳解與關鍵代碼對照

第一步——根據用戶指定作Colocation Group

通常狀況下，沒有被用戶指定Colocation Group信息的Node會被單獨放入一個Group中做爲惟一的成員，並以該Node的Name做爲Group的名字，因此Graph中每一個Node都會有本身的Colocation Group。從邏輯上來講，合併多個Group是很是簡單的問題，可是這個場景中的Group不只是Node的集合，還包含若干屬性，好比某個Group的possible device表示這個Group可用的全部Device集合。所以咱們須要一種數據結構和算法，幫助咱們在合併兩個Group時很方便地生成新Group及相關屬性（方便Union），而且可以根據某個Node快速查看所屬Group的全部屬性（快速Find），這就是Find-Union的優點所在。Find-Union算法原理將不在這裏描述，這裏只給出代碼中Find-Union用到的基本數據結構——Member，它用來描述Group的基本信息。在閱讀下段代碼註釋前，須要對Find-Union中的樹形結構含義有基本的理解。

 1 // Represents a node in the disjoint node set forest, and the
 2   // accumulated constraints on the device used by that node.
 3   struct Member {
 4     Member() = default;
 5     // The id of the node that is the parent of this one, or its own
 6     // id if it is a root. parent <= 0 indicates that this member is invalid.
 7     int parent = -1;
 8 
 9     // A proxy for the depth of the tree that is used to prefer
10     // connecting smaller trees to larger trees when merging disjoint
11     // sets.
12     int rank = 0;
13 
14     // The intersection of all device types supported by this node,
15     // and those of all of its children, in priority order
16     // of the preferred device.
17     DeviceTypeVector supported_device_types;
18 
19     // The merged form of the device requested for this node, with
20     // those of all of its children.
21     DeviceNameUtils::ParsedName device_name;
22 
23     // If this node is a root, stores a list of Devices to which this node
24     // and all of its children have been assigned, or nullptr if this
25     // has not yet been computed.
26     std::vector<Device*> possible_devices;
27   };

下面的代碼是處理這一步驟的核心代碼。首先建立ColocationGraph對象，這是一個處理Colocation Group的工具類，裏面使用了Find-Union算法對Group進行聚合。在調用InitiailizeMembers對Find-Union算法的基本數據結構進行初始化以後，就直接調用ColocationAllNodes根據用戶指定的全部colocation信息進行聚合。

 1   ColocationGraph colocation_graph(
 2       graph_, devices_,
 3       options_ == nullptr || options_->config.allow_soft_placement(),
 4       default_device_);
 5 
 6   TF_RETURN_IF_ERROR(colocation_graph.InitializeMembers());
 7 
 8   // 1. First add all of the nodes. Note that steps (1) and (2)
 9   // requires two passes over the nodes because the graph (and hence
10   // the constraints) may not be acyclic.
11   TF_RETURN_IF_ERROR(colocation_graph.ColocateAllNodes());

第二步——啓發式規則C的運用

這一步將對Colocation Group進行調整。在遍歷Graph的每一個Node時，須要根據Node input來決定是否將該Node所在的Group與Source Node所在的Group合併。若是Node的input是ref_type或者DT_RESOURCE（關於DT_RESOURCE通常會在使用ResourceVariable時纔會碰到。ResourceVariable與Variable相比具備不少新特性，這些特性是TF2.0中主推的內容。關於它的優點咱們不在這裏展開，只對其Op的類型作一個說明。Variable在C++層面的Op類型是VariableV2，而ResourceVariable在C++層面的Op類型爲VarHandleOp。後者產生的Tensor就是一種DT_RESOURCE），那麼就嘗試作合併。在合併以前須要作必要的可行性檢查，適當地主動報錯。好比在合併時除了要考慮這一對節點的鏈接之外，還須要考慮這個Node的其餘輸入是否屬於ref_type或者DT_RESOURCE。這一部分的代碼比較長，但相對比較簡單，這裏再也不展現。

第三步——啓發式規則B的運用

從這一步開始，Placer纔開始真正的爲每一個Node分配Device，下面的流程圖中展現了這一步驟。

1. 若是當前的Node的device屬性中已經有值，那麼Placer將再也不對其作重複的assign操做，直接跳過這個Node。

2. 若是當前Node是GeneratorNode，先將其放入一個名爲second_pass的vector中。

3. 若是不是以上兩種狀況，那麼該Node正是這一步驟須要處理的對象。先從該Node所在的Colocation Group中獲取可用的Devices（獲取會受到Soft Placement的影響）做爲候選。若是該node是MetaData node，那麼會嘗試應用啓發式規則B，不然，將分配候選集中優先級最高的Device。

下面的代碼展現了對MetaDataNode的處理邏輯，這就是啓發式規則B的代碼。

 1     int assigned_device = -1;
 2 
 3     // Heuristic B: If the node only operates on metadata, not data,
 4     // then it is desirable to place that metadata node with its
 5     // input.
 6     if (IsMetadata(node)) {
 7       // Make sure that the input device type is in the list of supported
 8       // device types for this node.
 9       const Node* input = (*node->in_edges().begin())->src();
10       // TODO(vrv): if the input is empty, consider postponing this
11       // node's assignment to the second pass, so that we handle the
12       // case where a metadata node's input comes from a backedge
13       // of a loop.
14       if (CanAssignToDevice(input->assigned_device_name(), *devices)) {
15         assigned_device = input->assigned_device_name_index();
16       }
17     }
18 
19     // Provide the default, if necessary.
20     if (assigned_device == -1) {
21       assigned_device = graph_->InternDeviceName((*devices)[0]->name());
22     }
23 
24     AssignAndLog(assigned_device, node);

第四步——啓發式規則A的運用

這一步將對second_pass數組中的全部的Node分配Device，下面的流程圖中展現了這一步驟。

放在second_pass中的代碼所有是GeneratorNode，因此只須要應用啓發式規則A便可，和步驟3同樣，啓發式規則A的應用也是嘗試性的，若是實在不能知足，會直接分配候選Device中優先級最高的Device，下面是啓發式規則A的應用部分代碼。

 1     int assigned_device = -1;
 2 
 3     // Heuristic A application.
 4     if (IsGeneratorNode(node)) {
 5       const Node* output = (*node->out_edges().begin())->dst();
 6       int output_device_name = output->assigned_device_name_index();
 7 
 8       const bool consumers_on_same_device = std::all_of(
 9           node->out_edges().begin(), node->out_edges().end(),
10           [output_device_name](const Edge* e) {
11             return e->dst()->assigned_device_name_index() == output_device_name;
12           });
13 
14       if (consumers_on_same_device &&
15           CanAssignToDevice(output->assigned_device_name(), *devices)) {
16         assigned_device = output_device_name;
17       }
18     }
19 
20     // Provide the default, if necessary.
21     if (assigned_device == -1) {
22       assigned_device = graph_->InternDeviceName((*devices)[0]->name());
23     }
24 
25     AssignAndLog(assigned_device, node);

至此，全部Node的Placement信息都已經分配並微調完畢。

總結

通過Placer處理的GraphDef保證了計算圖在Placement層面已經不存在任何衝突，所以它被認爲是解決Placement衝突的最後一道防線。在Placer以後，GraphDef將被送入GraphPartitioner模塊中根據每一個Node的device作子圖切分，並插入Send，Recv以及必要的ControlFlow節點。從上面的梳理中咱們也能夠看出Placer模塊的核心是應用多種啓發式規則對Placement進行微調，但這些啓發式規則還相對較爲簡單，並無徹底解決性能問題。若是在Placement方面去挖掘性能方面的優化空間，咱們立刻能夠想到，在分佈式模式下，粗糙的Placement方案會讓做業性能變得很是差，由於它會引入計算以外的通訊開銷。TensorFlow爲了高度靈活性，將Placement策略的負擔丟給了用戶，這也是爲何有些用戶寫出的TensorFlow分佈式程序性能很是差的緣由之一。從TensorFlow框架的功能角度來講，它應該可以解放用戶的編寫程序負擔，讓用戶可以徹底專一在模型算法層面的研究中。可是自動搜索Placement最佳策略的難度很是大，由於它要考慮集羣通訊的帶寬，以及每一個Op的計算量，是一個與硬件和環境高度聯繫的複雜問題。不只如此，一般深度學習模型含有成千上萬個Node，這使得方案的搜索空間巨大無比。對於這個問題，Google曾經提出過強化學習搜索最佳模型分片策略的方法，有興趣地同窗能夠參考這篇ICML論文： Device Placement Optimization with Reinforcement Learning。