mxnet的訓練過程——從python到C++

時間 2019-12-12

標籤 mxnet 訓練過程 python c++ 欄目 Python 简体版

原文原文鏈接

mxnet的訓練過程——從python到C++

mxnet（github-mxnet）的python接口至關完善，咱們能夠徹底不看C++的代碼就能直接訓練模型，若是咱們要學習它的C++的代碼，從python訓練與預測的模型中能夠看到C++的代碼是怎麼被調用的。上一篇博客中，我已經說明了mshadow的工做原理——mshadow的原理--MXNet；在這一篇中，來講明一下mxnet的訓練過程，看python是調用發哪些C++的接口，但對C++接口的更進一步解釋並無很詳細，具體能夠本身看源碼，後面也可能會有新的博客解釋。html

實驗代碼

下面是mxnet訓練的簡單樣例代碼，python調試所用的工具是Wing Pro，C++的調試工具推薦使用Qt Creator，Qt Creator要求有Cmakelist，而後要打開Debug編譯相關的so文件才能調試。node

# -*- coding: utf-8 -*-
import mxnet as mx
import numpy as np
import logging
logging.getLogger().setLevel(logging.DEBUG)

# product data
def productData(Dim, half_len):
    '''
    product data for training or eval

    Dim : dimension
    half_len : 2*half_len is the number of training data
    '''

    data = np.append(np.random.uniform(-1, 0, [half_len, Dim]),
                           np.random.uniform(0, 1, [half_len, Dim]), axis = 0)
    label = np.append(np.zeros(half_len), np.ones(half_len))

    return data, label

#get the data
np.random.seed(1)
Dim = 3
train_data,train_label = productData(Dim, 1)
eval_data, eval_label = productData(Dim, 1)

#data iter
batch_size = 1
train_iter = mx.io.NDArrayIter(train_data,train_label, batch_size, shuffle=True)
eval_iter = mx.io.NDArrayIter(eval_data, eval_label, batch_size, shuffle=False)

#input variable
X = mx.sym.Variable('data')
Y = mx.symbol.Variable('softmax_label')

#netword config
fc_1  = mx.sym.FullyConnected(data=X, name='fc1', num_hidden = 2)
fc_2  = mx.sym.FullyConnected(data=fc_1, name='fc2', num_hidden = 3)
fc_3  = mx.sym.FullyConnected(data=fc_2, name='fc3', num_hidden = 4)
lro = mx.sym.SoftmaxOutput(data=fc_3, label=Y, name="softmax")

#build the model
model = mx.mod.Module(
    symbol = lro ,
    data_names=['data'],
    label_names = ['softmax_label']# network structure
)

#train the model
model.fit(train_iter, eval_iter,
            optimizer_params={'learning_rate':0.5, 'momentum': 0.9},
            num_epoch=1,
            eval_metric='mse',
            batch_end_callback = mx.callback.Speedometer(batch_size, 1))

#predict the result
pre = model.predict(eval_iter).asnumpy()
print np.argmax(pre, axis = 1)

上面的代碼十分簡單，對於mxnet python訓練的人都很容易看明白第一點，在這裏不展開講這些python代碼的具體意義，而講這些代碼是怎麼與mxnet底層的C++代碼交互的，python與C++交互的python庫ctypes，本人用的mxnet版本是0.7，其它版本的代碼結構不會差異太大。python

Create Variable

mx.io.NDArrayIter沒有引用到C++的函數，當建立一個變量符號（Symbol Variable）時，會引用到MXSymbolCreateVariable函數。要注意的是調用的python函數若是是mxnet包內的，就會引用包的相應函數，調用的C++函數都會封裝在C_api.h中，對應的函數在./src/c_api下。調用過程如下：Variable()_python --> MXSymbolCreateVariable()_C++ --> CreateVariable()_C++。咱們來看一下C++中Symbol類及其與之相關的結構體：c++

/*!
 * \brief Symbol is used to represent dynamically generated symbolic computation graph.
 *
 *   This class is used as a tool to generate computation graphs(aka. configuration) of the network.
 *   Symbol is always composite, the head Node is the output node of the symbol.
 *   An atomic symbol can be seen as a special case of the composite symbol with only the head node.
 */
class Symbol {
 public:
 ...
 protected:
  // Declare node, internal data structure.
  struct Node;
  /*! \brief an entry that represents output data from a node */
  struct DataEntry {
    /*! \brief the source node of this data */
    std::shared_ptr<Node> source;
    /*! \brief index of output from the source. */
    uint32_t index;
    /*! \brief enabled default copy constructor */
    DataEntry() {}
    /*! \brief constructor from index */
    DataEntry(std::shared_ptr<Node> source, uint32_t index)
        : source(source), index(index) {}
  };
  /*!
   * \brief the head nodes of Symbols
   * This head is only effective when
   */
  std::vector<DataEntry> heads_;
 ...
}

/*!
 * \brief Node is represents node of an operator in the symbolic graph.
 *
 * It stores connection to the inputs to function represented by OperatorProperty
 * NOTE on data structure: there are three types of node:
 * - Normal node: contains all the necessary elements of a graph.
 * - OperatorProperty: the inputs_ is empty, represents an OperatorProperty that has not been applied.
 * - Variable: the sym_ is nullptr, represents an named Variable of tensors that can be composed.
 */
struct Symbol::Node {
  /*! \brief Operator of this node */
  std::unique_ptr<OperatorProperty> op;
  /*! \brief name of the node */
  std::string name;
  /*! \brief inputs to this node */
  std::vector<DataEntry> inputs;
  /*! \brief source node of the current node */
  std::shared_ptr<Symbol::Node> backward_source_node;
  /*!
   * \brief additional attributes about the node,
   *  Use pointer to save space, as attr can be accessed in a slow way,
   *  not every node will have attributes.
   */
  std::unique_ptr<std::map<std::string, std::string> > attr;
  /*!
    *\brief constructor
    *\param op the OperatorProperty to construct the Node
    *\param name the name of the symbol
   */
  explicit Node(OperatorProperty *op,
                const std::string& name)
      : op(op), name(name) {}
  /*!
    *\brief copy constructor constructor
   */
  explicit Node(const Node& other)
      : name(other.name) {
    if (other.op != nullptr) {
      op.reset(other.op->Copy());
    }
    if (other.attr.get() != nullptr) {
      attr.reset(new std::map<std::string, std::string>(*(other.attr)));
    }
  }
  ~Node() {
   ...
  }
  /*! \return Whether the symbol is atomic */
  inline bool is_atomic() const {
    return inputs.size() == 0 && op != nullptr;
  }
  /*! \return Whether it is unit variable */
  inline bool is_variable() const {
    return op == nullptr && !backward_source_node;
  }
  /*! \return Whether it is backward op */
  inline bool is_backward() const {
    return backward_source_node.get() != nullptr;
  }
};

/*! \return whwther the symbol is atomic */
inline bool Symbol::is_atomic() const {
  return heads_[0].source->is_atomic();
}

經過上面的inline bool is_variable()函數能夠看到variable的特色，建立一個variable也特別簡單，直接建立一個Symbol的並把初始數據壓入到heads_容器中就能建立，以下：git

Symbol Symbol::CreateVariable(const std::string &name) {
  Symbol s;
  s.heads_.push_back(DataEntry(std::make_shared<Node>(nullptr, name), 0));
  return s;
}

在mxnet中層(mx.sym.FullyConnected\mx.sym.SoftmaxOutput等)和變量都是Symbol。github

python動態加載函數

mxnet中的層的種類多是會發生變化的，當用C++寫一個新的層時，都要先註冊到mxnet內核dlmc中，python在載入Symbol模塊時，會動態加載全部的層。下面先來簡單地說明python是如何動態加載的，再來看下mxnet中的python是如何動態加載的。算法

import sys

def fib(n):
    a, b = 0, 1
    result = []
    while(b<n):
        result.append(b)
        a, b = b, a+b
    print(result)

print("load function in here")
setattr(sys.modules[__name__], "FIBC", fib)

假如上面的代碼放在load_test.py中，當import load_test時會先運行腳本中第一行和最後兩行代碼，最後一行代碼將FIBC定位到fib上，因此至關於能夠引用FIBC函數，結果以下：apache

>>> import load_test
load function in here
>>> load_test.fib(16)
[1, 1, 2, 3, 5, 8, 13]
>>> load_test.FIBC(16)
[1, 1, 2, 3, 5, 8, 13]

那麼在mxnet的python中是怎麼實現的呢？在導入Symbol模塊時會運行_init_symbol_module()，這個函數能加載註冊在mxnet內核中的全部Symbol，來看下面兩個函數：api

def _init_symbol_module():
    """List and add all the atomic symbol functions to current module."""
    plist = ctypes.POINTER(ctypes.c_void_p)()
    size = ctypes.c_uint()

    check_call(_LIB.MXSymbolListAtomicSymbolCreators(ctypes.byref(size),
                                                     ctypes.byref(plist)))
    module_obj = sys.modules[__name__]
    module_internal = sys.modules["mxnet._symbol_internal"]
    for i in range(size.value):
        hdl = SymbolHandle(plist[i])
        function = _make_atomic_symbol_function(hdl)
        if function.__name__.startswith('_'):
            setattr(module_internal, function.__name__, function)
        else:
            setattr(module_obj, function.__name__, function)



def _make_atomic_symbol_function(handle):
    """Create an atomic symbol function by handle and funciton name."""
    name = ctypes.c_char_p()
    desc = ctypes.c_char_p()
    key_var_num_args = ctypes.c_char_p()
    num_args = mx_uint()
    arg_names = ctypes.POINTER(ctypes.c_char_p)()
    arg_types = ctypes.POINTER(ctypes.c_char_p)()
    arg_descs = ctypes.POINTER(ctypes.c_char_p)()
    ret_type = ctypes.c_char_p()

    check_call(_LIB.MXSymbolGetAtomicSymbolInfo(
        handle, ctypes.byref(name), ctypes.byref(desc),
        ctypes.byref(num_args),
        ctypes.byref(arg_names),
        ctypes.byref(arg_types),
        ctypes.byref(arg_descs),
        ctypes.byref(key_var_num_args),
        ctypes.byref(ret_type)))
    param_str = ctypes2docstring(num_args, arg_names, arg_types, arg_descs)
    key_var_num_args = py_str(key_var_num_args.value)
    func_name = py_str(name.value)
    desc = py_str(desc.value)
    if key_var_num_args:
        desc += '\nThis function support variable length of positional input.'
    doc_str = ('%s\n\n' +
               '%s\n' +
               'name : string, optional.\n' +
               '    Name of the resulting symbol.\n\n' +
               'Returns\n' +
               '-------\n' +
               'symbol: Symbol\n' +
               '    The result symbol.')
    doc_str = doc_str % (desc, param_str)
    extra_doc = "\n" + '\n'.join([x.__doc__ for x in type.__subclasses__(SymbolDoc)
                                  if x.__name__ == '%sDoc' % func_name])
    doc_str += re.sub(re.compile("    "), "", extra_doc)

    def creator(*args, **kwargs):
        """Activation Operator of Neural Net.
        The parameters listed below can be passed in as keyword arguments.

        Parameters
        ----------
        name : string, required.
            Name of the resulting symbol.

        Returns
        -------
        symbol: Symbol
            the resulting symbol
        """
        param_keys = []
        param_vals = []
        symbol_kwargs = {}
        name = kwargs.pop('name', None)
        attr = kwargs.pop('attr', None)

        if key_var_num_args and key_var_num_args not in kwargs:
            param_keys.append(c_str(key_var_num_args))
            param_vals.append(c_str(str(len(args))))

        for k, v in kwargs.items():
            if isinstance(v, Symbol):
                symbol_kwargs[k] = v
            else:
                param_keys.append(c_str(k))
                param_vals.append(c_str(str(v)))
        # create atomic symbol
        param_keys = c_array(ctypes.c_char_p, param_keys)
        param_vals = c_array(ctypes.c_char_p, param_vals)
        sym_handle = SymbolHandle()
        check_call(_LIB.MXSymbolCreateAtomicSymbol(
            handle,
            mx_uint(len(param_keys)),
            param_keys, param_vals,
            ctypes.byref(sym_handle)))

        if len(args) != 0 and len(symbol_kwargs) != 0:
            raise TypeError(
                '%s can only accept input'
                'Symbols either as positional or keyword arguments, not both' % func_name)
        if key_var_num_args and len(symbol_kwargs) != 0:
            raise ValueError('This function supports variable length of Symbol arguments.\n' +
                             'Please pass all the input Symbols via positional arguments' +
                             ' instead of keyword arguments.')
        s = Symbol(sym_handle)
        attr = AttrScope.current.get(attr)
        if attr:
            s._set_attr(**attr)
        hint = func_name.lower()
        name = NameManager.current.get(name, hint)
        s._compose(*args, name=name, **symbol_kwargs)
        return s

    creator.__name__ = func_name
    creator.__doc__ = doc_str
    return creator

先從MXSymbolListAtomicSymbolCreators中獲取以註冊在內核中的OperatorPropertyReg對象數組。
_make_atomic_symbol_function這個函數用獲取相應Symbol的信息，以及返回一個creator的對象，能夠看到creator.__name__是以Symbol的名字來命名的。
setattr(module_obj, function.__name__, function)將剛纔返回的creator寫入到這個模板中，當導入這個模板後，能夠直接引用creator.__name__來調用相應的creator(*args, **kwargs)函數。

至於如何向mxnet內核註冊，能夠看下全鏈接層的樣例：數組

DMLC_REGISTER_PARAMETER(FullyConnectedParam);

MXNET_REGISTER_OP_PROPERTY(FullyConnected, FullyConnectedProp)
.describe("Apply matrix multiplication to input then add a bias.")
.add_argument("data", "Symbol", "Input data to the FullyConnectedOp.")
.add_argument("weight", "Symbol", "Weight matrix.")
.add_argument("bias", "Symbol", "Bias parameter.")
.add_arguments(FullyConnectedParam::__FIELDS__());

struct FullyConnectedParam : public dmlc::Parameter<FullyConnectedParam> {
  int num_hidden;
  bool no_bias;
  DMLC_DECLARE_PARAMETER(FullyConnectedParam) {
    // TODO(bing) add support for boolean
    DMLC_DECLARE_FIELD(num_hidden).set_lower_bound(1)
    .describe("Number of hidden nodes of the output.");
    DMLC_DECLARE_FIELD(no_bias).set_default(false)
    .describe("Whether to disable bias parameter.");
  }
};

Create OperatorSymbol

這一段的題目我也不知道叫什麼名字好，其實就是建立一個層的Symbol，但這個Symbol內有Node是與層有關的操做(operator)。下面這幾個層是過程都是同樣的，對於每個層都建立一個相應的Symbol，從上面能夠看到調用這些函數時，其實是調用一個Creator對象，因此單卡調試python代碼會直接入到creator(*args, **kwargs)中，咱們繼續看下在這個函數中的操做，咱們以fc_3 = mx.sym.FullyConnected(data=fc_2, name='fc3', num_hidden = 4)爲例。

#netword config
fc_1  = mx.sym.FullyConnected(data=X, name='fc1', num_hidden = 2)
fc_2  = mx.sym.FullyConnected(data=fc_1, name='fc2', num_hidden = 3)
fc_3  = mx.sym.FullyConnected(data=fc_2, name='fc3', num_hidden = 4)
lro = mx.sym.SoftmaxOutput(data=fc_3, label=Y, name="softmax")

有creator(*args, **kwargs)中先是將參數中的Symbol對象（在這裏是fc_2）與非Symbol對象分開（定義在FullyConnectedParam的num_hidden），將非Symbol對象的參數傳入到C++函數中MXSymbolCreateAtomicSymbol中建立Symbol，並掛在這個Symbol的heads_[0].source。

建立了Symbol後，還要裝前一層的Symbol掛在這一層上面，這裏調用s._compose(*args, name=name, **symbol_kwargs)。這個函數調用了C++中的MXSymbolCompose --> Compose，Compose會將是上層的Symbol對象掛在heads_[0].source->inputs相應位置上，heads_[0].source->inputs的位置有這個Symbol的heads_[0].source->op->ListArguments決定的。有這例子中，fc3.heads_[0].source->inputs[0] = fc2，FullyConnectedProp.ListArguments以下，其它的空位用NULL（從上面的is_variable()能夠看出這裏填充的是variable）填充，最後返回這個操做Symbol。

std::vector<std::string> ListArguments() const override {
    if (!param_.no_bias) {
      return {"data", "weight", "bias"};
    } else {
      return {"data", "weight"};
    }
  }

到運行完lro = mx.sym.SoftmaxOutput(data=fc_3, label=Y, name="softmax")，咱們能夠獲得一個以下的網絡結構圖，但這還不是計算圖，這裏我將Symbol分爲兩類，一類是層，便是Symbol:OP；一類是變量，便是Symbol:Var。

圖1 網絡結構的Symbol鏈接網

Bind構建計算圖

#build the model
model = mx.mod.Module(
    symbol = lro ,
    data_names=['data'],
    label_names = ['softmax_label']# network structure
)

這個是構建一個模型，這個初始化函數我想講的是arg_names = symbol.list_arguments()，這個涉及到圖的深度優先搜索，調用的是C++內的MXSymbolListArguments，C++中主要是以下三個函數作了深度優先搜索而後返回變量的列表。

std::vector<std::string> Symbol::ListArguments() const {
  std::vector<std::string> ret;
  if (this->is_atomic()) {
    return heads_[0].source->op->ListArguments();
  } else {
    this->DFSVisit([&ret](const std::shared_ptr<Node> &node) {
        if (node->is_variable()) {
          ret.push_back(node->name);
        }
      });
    return ret;
  }
}

template<typename FVisit>
inline void Symbol::DFSVisit(FVisit fvisit) const {
  typedef const std::shared_ptr<Node>* GNode;
  std::vector<GNode> head_nodes(heads_.size());
  std::transform(heads_.begin(), heads_.end(), head_nodes.begin(),
                 [](const DataEntry& e)->GNode {
                   return &e.source;
                 });
  graph::PostOrderDFSVisit<GNode, Node*>(
      head_nodes,
      [fvisit](GNode n) { fvisit(*n); },  // FVisit
      [](GNode n)->Node* { return n->get(); },  // HashFunc
      [](GNode n)->uint32_t { return (*n)->inputs.size() +
            static_cast<int>((*n)->is_backward()); },  // InDegree
      [](GNode n, uint32_t index)->GNode {  // GetInput
        if (index < (*n)->inputs.size()) {
          return &(*n)->inputs.at(index).source;
        } else {
          return &(*n)->backward_source_node;
        }
      });
}

template <typename GNode, typename HashType, typename FVisit,
          typename HashFunc, typename InDegree, typename GetInput>
void PostOrderDFSVisit(const std::vector<GNode>& heads, FVisit fvisit,
                       HashFunc hash, InDegree indegree, GetInput getinput) {
  std::vector<std::pair<GNode, uint32_t> > stack;
  std::unordered_set<HashType> visited;
  for (auto& head : heads) {
    HashType head_hash = hash(head);
    if (visited.count(head_hash) == 0) {
      stack.push_back(std::make_pair(head, 0));
      visited.insert(head_hash);
    }
    while (!stack.empty()) {
      std::pair<GNode, uint32_t>& back = stack.back();
      if (back.second == indegree(back.first)) {
        fvisit(back.first);
        stack.pop_back();
      } else {
        const GNode& input = getinput(back.first, back.second++);
        HashType input_hash = hash(input);
        if (visited.count(input_hash) == 0) {
          stack.push_back(std::make_pair(input, 0));
          visited.insert(input_hash);
        }
      }
    }
  }
}

從第一個函數ListArguments()能夠看到，若是Symbol是variable，則放到輸出結果ret中。第二個函數DFSVisit(FVisit fvisit)是幫第三個函數PostOrderDFSVisit(...)構建一些匿名函數。關鍵是看第三個函數，咱們在初始化模型時掛上去的lro,也圖1中的Symbol:OP--Out。這裏這裏深度優先搜索（DFS）的步驟以下：

將在初始化模型時掛上去的Symbol放到容器中（能夠當作一個隊列）
若是容器爲空，則結束，不然將容器中最老的元素賦給back。
back.second的值是訪問的次數
若是訪問次數等於入度數，將back從容器中拿掉，且若是back.first是變量則放到輸出結果ret中。
若是訪問次數不等於入度數，將back.first中的輸入input[back.second]拿出放入到容器的最後，且back.second的值增長一。
轉到步驟2。

從圖1的頂層開始的DFS，按以上步驟能夠獲得的結果以下（要注意的是下面的順序是惟一的）：

['data', 'fc1_weight', 'fc1_bias', 'fc2_weight', 'fc2_bias', 'fc3_weight', 'fc3_bias', 'softmax_label']

從這個順序也能夠看到爲何用DFS，由於遍歷的順序恰好是前向傳播計算的順序。

訓練fit

綁定執行器與初始化計算圖

在訓練以前會根據設備來綁定執行器（Bind Executor），沒有明確指出執行器時，默認爲cpu(0)，通常來講一個Executor對應該硬件的一個設備，好比一個cpu、一個gpu。python的函數調用過程以下：

base_module.py ： model.fit -->
module.py : bind -->
excutor_group.py :　DataParallelExecutorGroup.__init__ --> bind_exec --> _bind_ith_exec -->
symbol.py : bind -->
C++ : MXExecutorBindEX

_bind_ith_exec是python代碼中最關鍵的一個，它是不只綁定執行器，還分配了前向（arg_arrays）和後向（grad_arrays）傳播所須要的內存空間、Symbol是否要後向傳播（grad_req）、矩形形狀的推斷（infer shape）。其中infer shape也是引用了C++的代碼，裏面用到了迭代器生成TShape、拓樸排序等知識。

C++的調用關係如下：

MXExecutorBindEX() --> Executor::Bind() --> GraphExecutor::init()

看下GraphExecutor::init()具體作了什麼，InitGraph初始化了計算圖，這個計算圖包括了前向和後向的，InitDataEntryInfo初始化一些傳入來的變量，InitDataEntryMemory這個是爲中間的一些輸出分配內存空間，這裏涉及到兩個省內存的策略：

inplace。在這個策略裏，咱們模擬圖的遍歷過程，併爲每一個變量維護一個還有多少其餘變量須要它的計數。當咱們發現某個變量的計數變成0時，咱們便回收其內存空間：這個要求在寫操做層時有對應的ForwardInplaceOption與BackwardInplaceOption
co-share：咱們容許兩個變量使用同一段內存空間。這麼作固然會使得這兩個變量不能同時在寫這段空間。因此咱們只考慮對不能並行的變量進行co-share。每一次咱們考慮圖中的一條路（path），路上全部變量都有依賴關係因此不能被並行，而後咱們對其進行內存分配並將它們從圖中刪掉。這個能夠由算法獲得，但要設計一個內存池GraphStoragePool。

其實還有一個省內存的策略，不過與計算圖無關，就是我在上篇博客所說的——mshadow的原理--MXNet。

inline void Init(Symbol symbol,
                   const Context& default_ctx,
                   const std::map<std::string, Context>& ctx_map,
                   const std::vector<NDArray> &in_args,
                   const std::vector<NDArray> &arg_grad_store,
                   const std::vector<OpReqType> &grad_req_type,
                   const std::vector<NDArray> &aux_states,
                   Executor* shared_exec = nullptr) {
    enable_inplace_allocation_ = dmlc::GetEnv("MXNET_EXEC_ENABLE_INPLACE", true);
    prefer_bulk_execution_ = dmlc::GetEnv("MXNET_EXEC_PREFER_BULK_EXEC", true);
    if (shared_exec != NULL) {
      GraphExecutor* gexec = dynamic_cast<GraphExecutor*>(shared_exec);
      CHECK(gexec) << "Input executor for sharing memory must have GraphExecutor type.";
      shared_mem_ = gexec->shared_mem_;
    } else {
      shared_mem_ = std::make_shared<GraphStoragePool>();
    }

    CHECK_EQ(grad_req_type.size(), arg_grad_store.size());
    bool need_backward = false;
    for (auto req : grad_req_type) {
      if (req != kNullOp) need_backward = true;
    }
    this->InitGraph(symbol, default_ctx, ctx_map,
                    in_args, arg_grad_store, grad_req_type,
                    need_backward);
    this->InitDataEntryInfo(in_args, arg_grad_store, grad_req_type, aux_states);
    this->InitOperators();
    this->InitDataEntryMemory();
    this->InitResources();
    this->InitCachedOps();
    this->InitOpSegs();
  }

如圖2所示，這是mxnet省內存策略的效果：

圖2 前向預測與訓練時的省內存效果

訓練

訓練以前，先初始化除了輸入數的全部變量，初始化訓練的算法，這個在base_module.py：

self.init_params(initializer=initializer, arg_params=arg_params, aux_params=aux_params,
                 allow_missing=allow_missing, force_init=force_init)
self.init_optimizer(kvstore=kvstore, optimizer=optimizer,
                    optimizer_params=optimizer_params)

訓練的步驟主要是forward_backward與update，代碼以下：

################################################################################
        # training loop
        ################################################################################
        for epoch in range(begin_epoch, num_epoch):
            tic = time.time()
            eval_metric.reset()
            for nbatch, data_batch in enumerate(train_data):
                if monitor is not None:
                    monitor.tic()
                self.forward_backward(data_batch)
                self.update()
                self.update_metric(eval_metric, data_batch.label)

                if monitor is not None:
                    monitor.toc_print()

                if batch_end_callback is not None:
                    batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
                                                     eval_metric=eval_metric,
                                                     locals=locals())
                    for callback in _as_list(batch_end_callback):
                        callback(batch_end_params)

            # one epoch of training is finished
            for name, val in eval_metric.get_name_value():
                self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)
            toc = time.time()
            self.logger.info('Epoch[%d] Time cost=%.3f', epoch, (toc-tic))

            if epoch_end_callback is not None:
                arg_params, aux_params = self.get_params()
                for callback in _as_list(epoch_end_callback):
                    callback(epoch, self.symbol, arg_params, aux_params)

            #----------------------------------------
            # evaluation on validation set
            if eval_data:
                res = self.score(eval_data, validation_metric,
                                 batch_end_callback=eval_batch_end_callback, epoch=epoch)
                for name, val in res:
                    self.logger.info('Epoch[%d] Validation-%s=%f', epoch, name, val)

            # end of 1 epoch, reset the data-iter for another epoch
            train_data.reset()

forward與backward最後都調用了void RunOps(bool is_train, size_t topo_start, size_t topo_end)，估計這個函數纔是整個訓練的核心，但個函數涉及到的同步、異步處理的parameter server（PS），PS很複雜，在這裏就再也不展開討論了。

【防止爬蟲轉載而致使的格式問題——連接】：
http://www.cnblogs.com/heguanyou/p/7604326.html