[轉]kaldi ASR: DNN訓練




$ copy-int-vector "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1
speaker001_00003 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 15 15 15 18 890 889 889 889 889 889 889 892 894 893 893 893 86 88 87 90 89 89 89 89 89 89 89 89 89 89 89 89 89 89 194 193 196 195 195 198 197 386 385 385 385 385 385 385 385 385 388 387 387 390 902 901 901 904 903 906 905 905 905 905 905 905 905 905 905 905 905 914 913 913 916 918 917 917 917 917 917 917 752 751 751 751 751 751 754 753 753 753 753 753 753 753 753 756 755 755 926 925 928 927 927 927 927 927 927 927 930 929 929 929 929 929 929 929 929 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 16 18


$ show-transitions phones.txt final.mdl

Transition-state 1: phone = sil hmm-state = 0 pdf = 0
 Transition-id = 1 p = 0.966816 [self-loop]
 Transition-id = 2 p = 0.01 [0 -> 1]
 Transition-id = 3 p = 0.01 [0 -> 2]
 Transition-id = 4 p = 0.013189 [0 -> 3]
Transition-state 2: phone = sil hmm-state = 1 pdf = 1
 Transition-id = 5 p = 0.970016 [self-loop]
 Transition-id = 6 p = 0.01 [1 -> 2]
 Transition-id = 7 p = 0.01 [1 -> 3]
 Transition-id = 8 p = 0.01 [1 -> 4]
Transition-state 3: phone = sil hmm-state = 2 pdf = 2
 Transition-id = 9 p = 0.01 [2 -> 1]
 Transition-id = 10 p = 0.968144 [self-loop]
 Transition-id = 11 p = 0.01 [2 -> 3]
 Transition-id = 12 p = 0.0118632 [2 -> 4]
Transition-state 4: phone = sil hmm-state = 3 pdf = 3
 Transition-id = 13 p = 0.01 [3 -> 1]
 Transition-id = 14 p = 0.01 [3 -> 2]
 Transition-id = 15 p = 0.932347 [self-loop]
 Transition-id = 16 p = 0.0476583 [3 -> 4]
Transition-state 5: phone = sil hmm-state = 4 pdf = 4
 Transition-id = 17 p = 0.923332 [self-loop]
 Transition-id = 18 p = 0.0766682 [4 -> 5]
Transition-state 6: phone = a1 hmm-state = 0 pdf = 5
 Transition-id = 19 p = 0.889764 [self-loop]
 Transition-id = 20 p = 0.110236 [0 -> 1]

惟一的Transition-state對應惟一的pdf,其下又包括多個 Transition-id,

 labels_tr="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz |\" ark:- | ali-to-post ark:- ark:- |"

feats_tr="ark:copy-feats scp:$dir/train.scp ark:- |"
# input-dim,
  num_fea=$(feat-to-dim "$feats_tr nnet-forward \"$get_dim_from\" ark:- ark:- |" -)
# output-dim,
  num_tgt=$(hmm-info --print-args=false $alidir/final.mdl | grep pdfs | awk '{ print $NF }')

 utils/nnet/make_nnet_proto.py $proto_opts \
   ${bn_dim:+ --bottleneck-dim=$bn_dim} \
   $num_fea $num_tgt $hid_layers $hid_dim >$nnet_proto


  • ali-to-pdf: 將上面對齊文件中的transition-id轉化爲對應的pdf-id.
  • ali-to-post: 根據獲得的pdf-id,生成[pdf, post]對,即pdf與其對應的後驗機率。
$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1
 speaker001_00003 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 3 3 4 440 440 440 440 440 440 440 441 442 442 442 442 38 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 92 92 93 93 93 94 94 188 188 188 188 188 188 188 188 188 189 189 189 190 446 446 446 447 447 448 448 448 448 448 448 448 448 448 448 448 448 452 452 452 453 454 454 454 454 454 454 454 371 371 371 371 371 371 372 372 372 372 372 372 372 372 372 373 373 373 458 458 459 459 459 459 459 459 459 459 460 460 460 460 460 460 460 460 460 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4

觀察前兩幀,結合文章一開始,transition-id 分別爲4和1,而對應的pdf均爲0。對該結果再進行ali-to-post:lua

$ ali-to-pdf final.mdl "ark:gunzip -c ali.1.gz|" ark,t:- | head -n 1 | ali-to-post ark,t:- ark,t:-
 speaker001_00003 [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] ...... [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 3 1 ] [ 4 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 440 1 ] [ 441 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 442 1 ] [ 38 1 ] [ 39 1 ] [ 39 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 40 1 ] [ 92 1 ] [ 92 1 ]...... [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 0 1 ] [ 3 1 ] [ 4 1 ]


由此獲得了訓練數據以及對應的target label。進一步來看神經網絡的輸入與輸出的維度,網絡結構被utils/nnet/make_nnet_proto.py寫到nnet_proto文件中,該Python腳本的兩個重要參數 num_fea和num_tgt分別爲神經網絡的輸入與輸出的維度。其中num_fea是由feat-to-dim得到:orm

$ feat-to-dim scp:../tri4b_dnn/train.scp ark,t:- | grep speaker001_00003 
speaker001_00003 40


$ more final.feature_transform 
<Splice> 440 40 
[ -5 -4 -3 -2 -1 0 1 2 3 4 5 ]


$ hmm-info final.mdl
number of phones 218
number of pdfs 1026
number of transition-ids 2834
number of transition-states 1413

$ hmm-info final.mdl |  grep pdfs | awk '{ print $NF }'


<AffineTransform> <InputDim> 440 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.037344 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1024 <BiasMean> -2.000000 <BiasRange> 4.000000 <ParamStddev> 0.109375 <MaxNorm> 0.000000
<Sigmoid> <InputDim> 1024 <OutputDim> 1024
<AffineTransform> <InputDim> 1024 <OutputDim> 1026 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.109322 <LearnRateCoef> 1.000000 <BiasLearnRateCoef> 0.100000
<Softmax> <InputDim> 1026 <OutputDim> 1026



Perform one iteration (epoch) of Neural Network training with mini-batch Stochastic Gradient Descent. The training targets are usually pdf-posteriors, prepared by ali-to-post.


  • 解析訓練參數,配置網絡
  • 讀取特徵向量和target label,輸入爲Matrix< BaseFloat >類型,輸出爲Posterior類型,即<pdf-id, posterior>對。
    // get feature / target pair,
    Matrix<BaseFloat> mat = feature_reader.Value();
    Posterior targets = targets_reader.Value(utt);
  • 隨機打亂訓練數據,做爲神經網絡輸入與指望輸出:
    const CuMatrixBase<BaseFloat>& nnet_in = feature_randomizer.Value();
    const Posterior& nnet_tgt = targets_randomizer.Value();
    const Vector<BaseFloat>& frm_weights = weights_randomizer.Value();
  • 前向傳播,計算估計值nnet_out
    // forward pass,
    nnet.Propagate(nnet_in, &nnet_out);
  • 計算cost,這裏支持交叉熵和平方差和multitask。結果爲obj_diff
    // evaluate objective function we've chosen,
    if (objective_function == "xent") {
     // gradients re-scaled by weights in Eval,
     xent.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff);
    } else if (objective_function == "mse") {
     // gradients re-scaled by weights in Eval,
     mse.Eval(frm_weights, nnet_out, nnet_tgt, &obj_diff);
  • 根據偏差反向傳播,更新參數
    if (!crossvalidate) {
     // back-propagate, and do the update,
     nnet.Backpropagate(obj_diff, NULL);
  • 完成一次參數更新,繼續迭代。
    total_frames += nnet_in.NumRows(),


accepting: the loss was better, or we had fixed learn-rate, or we had fixed epoch-number



  • 訓練GMM-HMM模型,聚類,並獲得音素(或狀態)的後驗。
  • 對語音數據進行對齊,這裏獲得語音文件按時間順序transition-id到幀特徵向量的對應。
  • 生成< pdf-id, posterior > 對做爲訓練目標target
  • 語音文件特徵向量進行變換,這裏取先後5幀,拼成一個11幀維度更高的特徵向量,做爲神經網絡輸入。
  • 神經網絡輸入變換後的特徵向量,經過前向傳播,經Softmax層,獲得該幀特徵對應每一個pdf的機率預測值。
  • 對每一個pdf根據< pdf-id, posterior >查到目標後驗機率,與預測值求偏差
  • 反向傳播更新參數。
  • 不斷迭代,直到達到最大訓練次數,或模型通過cross validation獲得較低的偏差(loss)中止訓練。



其中 x_t 對應t時刻的觀測值(輸入),q_t=s_i 即表示t時刻的狀態爲 s_i。p(x_t) 爲該觀測值出現機率,對結果影響不大。p(s_i) 爲 s_i 出現的先驗機率,能夠從語料庫中統計獲得。最終獲得了與GMM相同的目的:HMM狀態到觀測幀特徵向量的輸出機率。就有了下面的示意圖:
