神經機器翻譯

機器翻譯的目標是將文本從一種語言自動翻譯成另外一種語言，給定一個待翻譯的語言的文本序列, 不存在一個翻譯是當前文本的最佳翻譯。
這是由於人類語言天生的模糊性和靈活性.這使得自動機器翻譯這一挑戰變得困難, 也許這是人工智能中最難的一項挑戰。
常規的機器翻譯方法有統計機器翻譯和神經機器翻譯，這裏咱們主要討論神經機器翻譯。python

從上圖中咱們能夠看到，翻譯的主要任務是在學習源端詞到目標端詞的一種映射關係，同時還包括調序，例如先翻譯了read a book 而不是on Sunday。git

那麼如何評價翻譯質量如何呢？github

翻譯專員人工評價（準確度更高，但費時費力）
自動評價（速度快，方便模型迭代，但存在缺陷）

實驗操做1

In[36]

# run prediction

!tar -zxf /home/aistudio/data/data13032/ddle_ai_course.t -C /home/aistudio
WORK_PATH = "/home/aistudio/paddle_ai_course"

# decompress pretrained models
!tar -zxf {WORK_PATH}/model_big.tgz -C {WORK_PATH}
!tar -zxf {WORK_PATH}/model_small.tgz -C {WORK_PATH}

!cd {WORK_PATH} && sh infer_small.sh model_small trans_res eval/test_enzh FILE
!cd {WORK_PATH}/eval && sh eval.sh {WORK_PATH}/trans_res/small_trans_res test_reference FILE
!cd {WORK_PATH}/eval && sed -r 's/(@@ )|(@@ ?$)//g' test_enzh > input.tok.txt && head -1 input.tok.txt && head -1 predict.tok.txt

memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 13:58:19.488519  2083 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 13:58:19.492918  2083 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 13:58:19.634228  2083 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 13:58:19.638108  2083 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
BLEU = 22.21, 57.5/29.8/16.6/8.6 (BP=1.000, ratio=1.037, hyp_len=2318, ref_len=2236)
last week the US Senate Finance Committee overwhelmingly approved a bill requiring the Treasury to identify a list of countries with &quot; fundamentally misaligned &quot; currency exchange rates . this opens the door for potential economic sanctions to be brought against Beijing .
上週 ， 美國參議院 財務 委員會 以 壓倒性 多數 批准 了 一項 法案 ， 要求 財政部 肯定 一個 「 根本 錯位 」 匯率 國家 名單 ， 這 開啓 了 可能 對 北京 實施 經濟制裁 的 大門 。

自動評價

在機器翻譯中，常見的自動評價指標是BLEU，在介紹具體作法以前咱們先引入一些基礎概念算法

N-gram

懲罰因子

BLEU算法

1. N-gram

N-gram是一種統計語言模型，該模型能夠將一句話表示n個連續的單詞序列，利用上下文中相鄰詞間的搭配信息，計算出句子的機率，從而判斷一句話是否通順。
BLEU也是採用了N-gram的匹配規則，經過它可以算出比較譯文和參考譯文之間n組詞的類似的一個佔比。架構

例如：app

1.1 1-gram

能夠看到機器翻譯6個詞，有5個詞命中參考覺得，那麼它的匹配度爲 5/6。

python2.7

1.2 2-gram

2元詞組的匹配度則是 3/5。

1.3 3-gram

3元詞組的匹配度則是 1/4，4元及以上均爲0

1.4 計算方法的修正

可是還存在一些狀況，經過n-gram是沒辦法反映譯文的正確性的，例如： ide

若是計算1-gram，全部的the都匹配上了，匹配度是7/7，這個顯然是錯誤的，因此BLEU修正了這個算法，算N-gram出現次數變爲譯文中和參考譯文中出現次數的最小值

因此上面的例子中，1-gram的匹配度爲2/7。oop

因此n-gram的計算方式以下：（公式中的i表明長度爲i-gram）

post

表示取n-gram在翻譯譯文和參考譯文中出現的最小次數

表示取n-gram在翻譯譯文中出現次數

2. 懲罰因子

咱們再舉一個例子，好比翻譯的句子爲: The dog，參考譯文是: The dog is on the floor. 若是根據上面的公式來計算，得分最後應該是1。
但這個句子翻譯不完整，理論上得分應該比較低，因此咱們引入下面的式子來對得分作一些懲罰。

這裏c爲機器翻譯譯文的詞數，r是參考譯文的詞數。若是c同r的差距很大，那邊BP的值就會很小，那麼最後的得分也會變得很小。

3. BLEU算法

最終BLEU的計算方式以下：

這裏Wi表明了i-gram的權重，通常認爲全部的i-gram的權重至關，爲1/N。

實驗操做2

In[37]

# 訓練模型
!cd {WORK_PATH} && sh train_small.sh

+ export FLAGS_eager_delete_tensor_gb=0.0
+ export FLAGS_fraction_of_gpu_memory_to_use=0.98
+ CUDA_VISIBLE_DEVICES=0 python -u src/train.py --src_vocab_fpath ./data/vocab.source --trg_vocab_fpath ./data/vocab.target --train_file_pattern ./data/translate-train-000* --token_delimiter   --use_token_batch True --batch_size 4096 --sort_type pool --pool_size 200000 --fetch_steps 50 save_freq 50 n_head 8 d_model 256 d_inner_hid 1024 n_layer 3 prepostprocess_dropout 0.1 ckpt_path ./model_small
14
['save_freq', '50', 'n_head', '8', 'd_model', '256', 'd_inner_hid', '1024', 'n_layer', '3', 'prepostprocess_dropout', '0.1', 'ckpt_path', './model_small']
10
[2019-09-20 13:58:34,031 INFO train.py:656] Namespace(batch_size=4096, device='GPU', enable_ce=False, fetch_steps=50, local=True, opts=['save_freq', '50', 'n_head', '8', 'd_model', '256', 'd_inner_hid', '1024', 'n_layer', '3', 'prepostprocess_dropout', '0.1', 'ckpt_path', './model_small'], pool_size=200000, shuffle=True, shuffle_batch=True, sort_type='pool', special_token=['0', '<EOS>', 'UNK'], src_vocab_fpath='./data/vocab.source', sync=True, token_delimiter=' ', train_file_pattern='./data/translate-train-000*', trg_vocab_fpath='./data/vocab.target', update_method='pserver', use_mem_opt=True, use_py_reader=False, use_token_batch=True, val_file_pattern=None)
[2019-09-20 13:58:34,158 INFO train.py:707] before adam
memory_optimize is deprecated. Use CompiledProgram and Executor
[2019-09-20 13:58:40,254 INFO train.py:725] local start_up:
W0920 13:58:41.012140  2131 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 13:58:41.016459  2131 device_context.cc:267] device: 0, cuDNN Version: 7.3.
[2019-09-20 13:58:41,043 INFO train.py:505] load checkpoint from ./model_small
[2019-09-20 13:58:41,165 INFO train.py:512] begin reader
[2019-09-20 13:58:46,721 INFO train.py:539] begin executor
I0920 13:58:46.764124  2131 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 13:58:46.832293  2131 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
[2019-09-20 13:58:46,870 INFO train.py:561] begin train
[2019-09-20 13:58:47,462 INFO train.py:594] step_idx: 0, epoch: 0, batch: 0, avg loss: 3.179328, normalized loss: 1.775048
[2019-09-20 13:58:52,159 INFO train.py:602] step_idx: 50, epoch: 0, batch: 50, avg loss: 3.173360, normalized loss: 1.769080, speed: 10.64 step/s
[2019-09-20 13:58:57,980 INFO train.py:602] step_idx: 100, epoch: 0, batch: 100, avg loss: 3.126150, normalized loss: 1.721869, speed: 8.59 step/s
[2019-09-20 13:59:06,231 INFO train.py:602] step_idx: 150, epoch: 0, batch: 150, avg loss: 3.107801, normalized loss: 1.703520, speed: 6.06 step/s
[2019-09-20 13:59:14,703 INFO train.py:602] step_idx: 200, epoch: 0, batch: 200, avg loss: 3.102083, normalized loss: 1.697803, speed: 5.90 step/s
[2019-09-20 13:59:23,036 INFO train.py:602] step_idx: 250, epoch: 0, batch: 250, avg loss: 3.259453, normalized loss: 1.855173, speed: 6.00 step/s
^C
Traceback (most recent call last):
  File "src/train.py", line 807, in <module>
    train(args)
  File "src/train.py", line 727, in train
    token_num, predict, pyreader)
  File "src/train.py", line 579, in train_loop
    feed=feed_dict_list)
  File "/opt/conda/envs/python27-paddle120-env/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 280, in run
    return_numpy=return_numpy)
  File "/opt/conda/envs/python27-paddle120-env/lib/python2.7/site-packages/paddle/fluid/executor.py", line 666, in run
    return_numpy=return_numpy)
  File "/opt/conda/envs/python27-paddle120-env/lib/python2.7/site-packages/paddle/fluid/executor.py", line 521, in _run_parallel
    tmp.set(tensor, program._places[i])
KeyboardInterrupt

In[38]

# 使用開發集挑選模型
!cd {WORK_PATH} && rm trans_res/*
!cd {WORK_PATH} && sh infer_small.sh trained_models trans_res eval/dev_enzh DIR
!cd {WORK_PATH}/eval && sh eval.sh ../trans_res dev_reference DIR

memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 13:59:38.703399  2178 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 13:59:38.708161  2178 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 13:59:38.845857  2178 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 13:59:38.849900  2178 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 13:59:45.884814  2216 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 13:59:45.889055  2216 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 13:59:46.049649  2216 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 13:59:46.053624  2216 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 13:59:53.189880  2254 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 13:59:53.194249  2254 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 13:59:53.344911  2254 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 13:59:53.348949  2254 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:00:00.514881  2292 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:00:00.518779  2292 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:00:00.664440  2292 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:00:00.668455  2292 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:00:07.745339  2330 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:00:07.749605  2330 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:00:07.892396  2330 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:00:07.896674  2330 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
iter_100.infer.model
BLEU = 11.22, 43.9/16.2/6.9/3.2 (BP=1.000, ratio=1.057, hyp_len=2436, ref_len=2305)
iter_150.infer.model
BLEU = 11.54, 44.0/16.5/7.2/3.4 (BP=1.000, ratio=1.057, hyp_len=2437, ref_len=2305)
iter_200.infer.model
BLEU = 10.84, 43.8/16.0/6.6/3.0 (BP=1.000, ratio=1.050, hyp_len=2420, ref_len=2305)
iter_250.infer.model
BLEU = 11.23, 44.0/16.3/6.9/3.2 (BP=1.000, ratio=1.051, hyp_len=2422, ref_len=2305)
iter_50.infer.model
BLEU = 11.56, 44.0/16.4/7.1/3.5 (BP=1.000, ratio=1.054, hyp_len=2430, ref_len=2305)

In[39]

#運行訓練代碼
# 根據挑選出來的訓練模型跑預測，查看在測試集上的表現
!cd {WORK_PATH} && rm trans_res/*
!cd {WORK_PATH} && sh infer_small.sh trained_models/iter_50.infer.model trans_res eval/test_enzh FILE
!cd {WORK_PATH}/eval && sh eval.sh {WORK_PATH}/trans_res/small_trans_res test_reference FILE
!cd {WORK_PATH}/eval && sed -r 's/(@@ )|(@@ ?$)//g' test_enzh > input.tok.txt && head -1 input.tok.txt && head -1 predict.tok.txt

memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:00:24.761777  2385 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:00:24.765815  2385 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:00:24.912528  2385 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:00:24.916476  2385 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
BLEU = 22.04, 57.3/29.6/16.4/8.5 (BP=1.000, ratio=1.038, hyp_len=2320, ref_len=2236)
last week the US Senate Finance Committee overwhelmingly approved a bill requiring the Treasury to identify a list of countries with &quot; fundamentally misaligned &quot; currency exchange rates . this opens the door for potential economic sanctions to be brought against Beijing .
上週 ， 美國參議院 財務 委員會 以 壓倒性 多數 批准 了 一項 法案 ， 要求 財政部 肯定 一個 「 根本 錯位 」 匯率 國家 名單 ， 這 開啓 了 可能 對 北京 實施 經濟制裁 的 大門 。

Transformer解析

這一節咱們介紹Transformer的總體流程和self-attention的機制

總體架構分爲兩個部分，一部分爲encoder，主要是用來對輸入的源語言進行語義化的向量表示；另外一部分爲decoder，解碼器，用來生成目標端的句子。從圖上能夠看出，Transformer主要由Multi-Head Attention和MLP組成，圖中的Nx表示重複N次，這裏重複表示堆疊多層，
下面咱們具體來看一下Multi-Head Attention的實現:

Multi-Head Attention是將輸入的Q,K,V切分紅多個通道，而後在每一個通道上分別計算Scaled Dot-Product Attention，最後再concat起來，Scaled Dot-Product Attention的主要目的是經過Q和K來計算出V中值的權重，對應到翻譯任務中即，在翻譯目標端第T個詞的時候，我須要着重看源端的那些詞來決定翻譯的結果。

BeamSearch

在解碼的過程當中，由於搜索空間巨大(指數量級)，因此通常會採用剪枝的方式來減小搜索空間，常見的算法爲BeamSearch，下圖爲beam爲2的一個示例。

綜上，Transformer + BeamSearch便可完成生成序列任務的預測。

Deeper/Bigger is Better

在Transformer模型結構下，更寬(hidden_size），更深（層數)的結構通常會顯著帶來效果的變化，接下來，咱們簡單的將hidden_size，layers和heads參數增大，觀察一下在英中翻譯任務上BLEU的變化。

Go Further

bpe(一種新的切詞方法，能夠很大程度緩解oov的問題):
https://arxiv.org/abs/1508.07909;
github: https://github.com/rsennrich/subword-nmt
T2T模型論文：https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
更改模型結構 src/model.py -> def dense_encoder

實驗操做3

In[40]

# 訓練更深更寬的模型
!cd {WORK_PATH} && rm -rf trained_*
!cd {WORK_PATH} && sh train.sh

+ export FLAGS_eager_delete_tensor_gb=0.0
+ export FLAGS_fraction_of_gpu_memory_to_use=0.98
+ CUDA_VISIBLE_DEVICES=0 python -u src/train.py --src_vocab_fpath ./data/vocab.source --trg_vocab_fpath ./data/vocab.target --train_file_pattern ./data/translate-train-000* --token_delimiter   --use_token_batch True --batch_size 4096 --sort_type pool --pool_size 200000 --fetch_steps 10 save_freq 20 n_head 16 d_model 1024 d_inner_hid 4096 prepostprocess_dropout 0.3 ckpt_path ./model_big
12
['save_freq', '20', 'n_head', '16', 'd_model', '1024', 'd_inner_hid', '4096', 'prepostprocess_dropout', '0.3', 'ckpt_path', './model_big']
10
[2019-09-20 14:00:42,344 INFO train.py:656] Namespace(batch_size=4096, device='GPU', enable_ce=False, fetch_steps=10, local=True, opts=['save_freq', '20', 'n_head', '16', 'd_model', '1024', 'd_inner_hid', '4096', 'prepostprocess_dropout', '0.3', 'ckpt_path', './model_big'], pool_size=200000, shuffle=True, shuffle_batch=True, sort_type='pool', special_token=['0', '<EOS>', 'UNK'], src_vocab_fpath='./data/vocab.source', sync=True, token_delimiter=' ', train_file_pattern='./data/translate-train-000*', trg_vocab_fpath='./data/vocab.target', update_method='pserver', use_mem_opt=True, use_py_reader=False, use_token_batch=True, val_file_pattern=None)
[2019-09-20 14:00:42,665 INFO train.py:707] before adam
memory_optimize is deprecated. Use CompiledProgram and Executor
[2019-09-20 14:01:04,859 INFO train.py:725] local start_up:
W0920 14:01:05.594372  2435 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:01:05.598951  2435 device_context.cc:267] device: 0, cuDNN Version: 7.3.
[2019-09-20 14:01:05,667 INFO train.py:505] load checkpoint from ./model_big
[2019-09-20 14:01:06,284 INFO train.py:512] begin reader
[2019-09-20 14:01:11,938 INFO train.py:539] begin executor
I0920 14:01:12.019623  2435 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:01:12.219225  2435 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
[2019-09-20 14:01:12,312 INFO train.py:561] begin train
[2019-09-20 14:01:13,236 INFO train.py:594] step_idx: 0, epoch: 0, batch: 0, avg loss: 2.630473, normalized loss: 1.226193
[2019-09-20 14:01:19,283 INFO train.py:602] step_idx: 10, epoch: 0, batch: 10, avg loss: 2.690705, normalized loss: 1.286424, speed: 1.65 step/s
[2019-09-20 14:01:25,591 INFO train.py:602] step_idx: 20, epoch: 0, batch: 20, avg loss: 2.393026, normalized loss: 0.988745, speed: 1.59 step/s
[2019-09-20 14:01:39,230 INFO train.py:602] step_idx: 30, epoch: 0, batch: 30, avg loss: 2.518420, normalized loss: 1.114139, speed: 0.73 step/s
[2019-09-20 14:01:45,339 INFO train.py:602] step_idx: 40, epoch: 0, batch: 40, avg loss: 2.413766, normalized loss: 1.009486, speed: 1.64 step/s
[2019-09-20 14:02:40,083 INFO train.py:602] step_idx: 50, epoch: 0, batch: 50, avg loss: 2.590959, normalized loss: 1.186678, speed: 0.18 step/s
^C
Traceback (most recent call last):
  File "src/train.py", line 807, in <module>
    train(args)
  File "src/train.py", line 727, in train
    token_num, predict, pyreader)
  File "src/train.py", line 575, in train_loop
    init_flag, dev_count)
  File "src/train.py", line 375, in prepare_feed_dict_list
    ModelHyperParams.d_model)
  File "src/train.py", line 250, in prepare_batch_input
    [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
  File "src/train.py", line 233, in pad_batch_data
    return_list += [slf_attn_bias_data.astype("float32")]
KeyboardInterrupt

In[41]

# 使用開發集挑選模型，並在測試集上驗證效果
!cd {WORK_PATH} && rm trans_res/*
!cd {WORK_PATH} && sh infer.sh trained_models trans_res eval/dev_enzh DIR
!cd {WORK_PATH}/eval && sh eval.sh ../trans_res dev_reference DIR

memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:02:52.170383  2482 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:02:52.174947  2482 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:02:52.953883  2482 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:02:52.961632  2482 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:03:05.511006  2520 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:03:05.514842  2520 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:03:06.248040  2520 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:03:06.256811  2520 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
iter_20.infer.model
BLEU = 14.35, 47.1/19.3/9.4/5.0 (BP=1.000, ratio=1.046, hyp_len=2412, ref_len=2305)
iter_40.infer.model
BLEU = 14.51, 47.1/19.5/9.5/5.1 (BP=1.000, ratio=1.046, hyp_len=2411, ref_len=2305)

In[44]

# 根據挑選出來的訓練模型跑預測，查看在測試集上的表現
!cd {WORK_PATH} && rm trans_res/*
!cd {WORK_PATH} && sh infer.sh trained_models/iter_40.infer.model trans_res eval/test_enzh FILE
!cd {WORK_PATH}/eval && sh eval.sh {WORK_PATH}/trans_res/big_trans_res test_reference FILE
!cd {WORK_PATH}/eval && sed -r 's/(@@ )|(@@ ?$)//g' test_enzh > input.tok.txt && head -1 input.tok.txt && head -1 predict.tok.txt

memory_optimize is deprecated. Use CompiledProgram and Executor
W0920 14:05:38.702764  2713 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0
W0920 14:05:38.706779  2713 device_context.cc:267] device: 0, cuDNN Version: 7.3.
I0920 14:05:39.428685  2713 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0920 14:05:39.436707  2713 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1

實驗效果

	Base Model	Big Model
BLEU	22.21	30.39

能夠看出，更寬更深的模型效果提高很明顯

請點擊此處查看本環境基本用法.
Please click here for more detailed instructions.

點擊連接，使用AI Studio一鍵上手實踐項目吧：https://aistudio.baidu.com/aistudio/projectdetail/120044

下載安裝命令

## CPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安裝命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 訪問 PaddlePaddle 官網，瞭解更多相關內容。

生成任務：語言的中外翻譯，你應該知道的NLP生成任務