Facebook論文：爲實現跨語種Zero-Shot遷移的巨量多語言句子Embeddings

時間 2019-12-12

標籤論文實現語種 zero shot 遷移巨量多語言句子 embeddings 欄目硅谷简体版

原文原文鏈接

Mikel Artetxe
Holger Schwenk (Facebook)
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyondhtml

Abstract

本文介紹了一種能夠學習多語言句子表示的方法，可用於30多個語種，93種語言and written in 28 different scripts.
系統用了全部語言共享BPE詞彙表的單BiLSTM 編碼器，同時又在parallel corpora上訓練的auxiliary解碼器。
這種技術容許咱們只在英語上annotated data訓練出的句子embedding模型的基礎上訓練分類器，而後遷移到其餘93種語言上，不須要任何修改ios

它由編碼器（encoder）、解碼器（decoder）兩大部分組成。其中，編碼器是個無關語種的BiLSTM，負責構建句嵌入，這些句嵌入接下來會經過線性變來換初始化LSTM解碼器。爲了讓這樣一對編碼器、解碼器能處理全部語言，還有個小條件：編碼器最好不知道輸入的到底是什麼語言，這樣才能學會獨立於語種的表示。因此，還要從全部輸入語料中學習出一個「比特對嵌入詞庫」（BPE）。
不過，解碼器又有着徹底相反的需求：它得知道輸入的到底是什麼語言，才能得出相應的輸出。因而，Facebook就爲解碼器附加了一項輸入：語言ID，也就是上圖的Lid
訓練這樣一個系統，Facebook用了16個英偉達V100 GPU，將batch size設置爲12.8萬個token，花5天時間訓練了17個週期。
用包含14種語言的跨語種天然語言推斷數據集（cross-lingual natural language inference，簡稱XNLI）來測試，這種多語種句嵌入（上圖的Proposed method）零數據（Zero-Shot）遷移成績，在其中13種語言上都創造了新紀錄，只有西班牙語例外。另外，Facebook用其餘任務測試了這個系統，包括ML-Doc數據集上的分類任務、BUCC雙語文本數據挖掘。他們還在收集了衆多外語學習者翻譯例句的Tatoeba數據集基礎上，製造了一個122種語言對齊句子的測試集，來證實自家算法在多語言類似度搜索任務上的能力。
http://www.sohu.com/a/2854308...

BPE vocabulary （Byte Pair Encoding：Byte pair encoding是一種簡單的數據壓縮技術，它把句子中常常出現的字節pairs用一個沒有出現的字節去替代。）

A new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one.web

The Cross-lingual Natural Language Inference ( XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations. The corpus is made to evaluate how to perform inference in any language (including low-resources ones like Swahili or Urdu) when only English NLI data is available at training time. One solution is cross-lingual sentence encoding, for which XNLI is an evaluation benchmark.

Also achieve very competitive results in cross-lingual document classification(MLDoc dataset)
Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs.
也製做了一個新的測試集of aligned sentences in 122 languages based on the Tatoeba corpus and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages.
Our Pytorch implementation, pre-trained encoder and the multilingual test set will be freely available.算法

Natural language inference
Natural language inference is the task of determining whether a 「hypothesis」 is true (entailment), false (contradiction), or undetermined (neutral) given a 「premise」.

MultiNLI
The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. The data can be downloaded from the MultiNLI website.
SciTail
The SciTail entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced but created from sentences that already exist 「in the wild」. Hypotheses were created from science questions and the corresponding answer candidates, while relevant web sentences from a large corpus were used as premises. Models are evaluated based on accuracy.api

Public leaderboards for in-genre (matched) and cross-genre (mismatched) evaluation are available, but entries do not correspond to published models.app

State-of-the-art results can be seen on the SNLI website.less

SNLI：The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy.

Introduction

The advance techiniques in NLP are known to be particularly data hungry, limiting their applicability in many practical scenarios.
An increasingly popular approach to alleviate this issue is to first learn general language representations on unlabeled data, which are then integrated in task-specific downstream systems
This approach was first popularized by word embeddings, but has recently been superseded by sentence-level representaions.
Nevertheless, all these works learn a separate model for each language and are thus unable to leverage information across different languages, greatly limiiting their potential performance for low-resource languges
Universal langauage agnostic sentence embeddings, that is, vertor representations of sentences that are general with respect to two dimensions: the input language and the NLP tasks機器學習

因爲語料不足，因此你們都如今無監督學習數據表徵，如word embeddings，如今轉向了sentence embeddings。你們也都在嘗試跨語言和多任務學習。分佈式

The hope that languages with limited resources benefit from joint training over many languages, the desire to perform zero-shot transfer of an NLP model from one language (e.g. English) to another,
And the possibility to handle code-switching.

We achieve the this by using a single encoder that can handle multiple languages, so that senantically similar sentences in different languages are close in the resulting embedding space.ide

語碼轉換
說明語碼轉換是一個常見的語言現象，指一我的在一個對話中交替使用多於一種語言或其變體。此現象是衆多語言接觸現象之一，常出現於多語者的平常語言。除了平常語言的對話，語碼轉換也出現於文字書寫中。「語碼轉換」之討論一定會牽涉「雙語」以內容。語碼轉換的語料中可見兩種以上語言在語音、句法結構等多方面的相互影響。

Contributions

We learn one shared encoder that can handle 93 different languages. All languages are jointly embedded in a shared space, in contrast to most other works which usally consider separate English/foreign alignments.
Cross-lingual 1)natural lanuage inference (XNLI dataset) and 2)classification (ML-Doc dataset), 3)bitext mining (BUCC dataset) and 4)multilingual similarity search (Tatoeba dataset)

推理、分類、bitext，多語言類似搜索

We define a new test set based on the freely available Tatoeba corpus and provide baseline results for 122 languages. We report accuracy for multilingual similarity search on this test set, but the corpus could also be used for MT evaluation.

Tatoeba
English-German Sentence Translation Database (Manythings/Tatoeba)The Tatoeba Project is also run by volunteers and is set to make the most bilingual sentence translations available between many different languages.Manythings.org compiles the data and makes it accessible. http://www.manythings.org/cor...

Bitext API是另外一個深度語言分析工具，提供易於導出到各類數據管理工具的數據。該平臺產品可用於聊天機器人和智能助手、CS和Sentiment，以及一些其餘核心NLP任務。這個API的重點是語義、語法、詞典和語料庫，可用於80多種語言。此外，該API是客戶反饋分析自動化方面的最佳API之一。該公司聲稱能夠將洞察的準確度作到90%
文檔: https://docs.api.bitext.com/
Demo: http://parser.bitext.com/
強烈推薦20個必須瞭解的API，涉及機器學習、NLP和人臉檢測

Related Work

Word Embeddings (Distributed Representations of Words and Phrases and their Compositionality)
Glove GloVe: Global Vectors for Word Representation

There has been an increasing interest in learning continuous vector representations of longer linguistic units like sentences.
These sentence embeddins are commonly obtained using a Recurrent Neural Network (RNN) encoder, which is typically trained in an unsupervised way over large collections of unlabelled corpora.

相關知識補充

1、文本表示和各詞向量間的對比
一、文本表示哪些方法？
下面對文本表示進行一個概括，也就是對於一篇文本能夠如何用數學語言表示呢？
基於one-hot、tf-idf、textrank等的bag-of-words；
主題模型：LSA（SVD）、pLSA、LDA；
基於詞向量的固定表徵：word2vec、fastText、glove
基於詞向量的動態表徵：elmo、GPT、bert
二、怎麼從語言模型理解詞向量？怎麼理解分佈式假設？
上面給出的4個類型也是nlp領域最爲經常使用的文本表示了，文本是由每一個單詞構成的，而談起詞向量， one-hot是可認爲是最爲簡單的詞向量，但存在維度災難和語義鴻溝等問題； 經過構建共現矩陣並利用SVD求解構建詞向量，則計算複雜度高；而早期詞向量的研究一般來源於語言模型，好比NNLM和RNNLM，其主要目的是語言模型，而詞向量只是一個副產物。

所謂分佈式假設，用一句話能夠表達：相同上下文語境的詞有似含義。而由此引伸出了word2vec、fastText，在此類詞向量中，雖然其本質仍然是語言模型，可是它的目標並非語言模型自己，而是詞向量，其所做的一系列優化，都是爲了更快更好的獲得詞向量。glove則是基於全局語料庫、並結合上下文語境構建詞向量，結合了LSA和word2vec的優勢。
三、傳統的詞向量有什麼問題？怎麼解決？各類詞向量的特色是什麼？
上述方法獲得的詞向量是固定表徵的，沒法解決一詞多義等問題，如「川普」。爲此引入基於語言模型的動態表徵方法：elmo、GPT、bert。
各類詞向量的特色：
（1）One-hot 表示：維度災難、語義鴻溝；
（2）分佈式表示 (distributed representation) ：

矩陣分解（LSA）：利用全局語料特徵，但SVD求解計算複雜度大；
基於NNLM/RNNLM的詞向量：詞向量爲副產物，存在效率不高等問題；
word2vec、fastText：優化效率高，可是基於局部語料；
glove：基於全局預料，結合了LSA和word2vec的優勢；
elmo、GPT、bert：動態特徵；

五、word2vec和fastText對比有什麼區別？（word2vec vs fastText）
1）均可以無監督學習詞向量， fastText訓練詞向量時會考慮subword；
2） fastText還能夠進行有監督學習進行文本分類，其主要特色：
結構與CBOW相似，但學習目標是人工標註的分類結果；
採用hierarchical softmax對輸出的分類標籤創建哈夫曼樹，樣本中標籤多的類別被分配短的搜尋路徑；
引入N-gram，考慮詞序特徵；
引入subword來處理長詞，處理未登錄詞問題；

六、glove和word2vec、 LSA對比有什麼區別？（word2vec vs glove vs LSA）
1）glove vs LSA
LSA（Latent Semantic Analysis）能夠基於co-occurance matrix構建詞向量，實質上是基於全局語料採用SVD進行矩陣分解，然而SVD計算複雜度高；
glove可看做是對LSA一種優化的高效矩陣分解算法，採用Adagrad對最小平方損失進行優化；
2）word2vec vs glove
word2vec是局部語料庫訓練的，其特徵提取是基於滑窗的；而glove的滑窗是爲了構建co-occurance matrix，是基於全局語料的，可見glove須要事先統計共現機率；所以，word2vec能夠進行在線學習，glove則須要統計固定語料信息。
word2vec是無監督學習，一樣因爲不須要人工標註；glove一般被認爲是無監督學習，但實際上glove仍是有label的，即共現次數
$log(X_{ij})$。
word2vec損失函數實質上是帶權重的交叉熵，權重固定；glove的損失函數是最小平方損失函數，權重能夠作映射變換。
整體來看，glove能夠被看做是更換了目標函數和權重函數的全局word2vec。

elmo、GPT、bert三者之間有什麼區別？（elmo vs GPT vs bert）
以前介紹詞向量均是靜態的詞向量，沒法解決一次多義等問題。下面介紹三種elmo、GPT、bert詞向量，它們都是基於語言模型的動態詞向量。下面從幾個方面對這三者進行對比：
（1）特徵提取器：elmo採用LSTM進行提取，GPT和bert則採用Transformer進行提取。不少任務代表Transformer特徵提取能力強於LSTM，elmo採用1層靜態向量+2層LSTM，多層提取能力有限，而GPT和bert中的Transformer可採用多層，並行計算能力強。
（2）單/雙向語言模型：
GPT採用單向語言模型，elmo和bert採用雙向語言模型。可是elmo其實是兩個單向語言模型（方向相反）的拼接，這種融合特徵的能力比bert一體化融合特徵方式弱。
GPT和bert都採用Transformer，Transformer是encoder-decoder結構，GPT的單向語言模型採用decoder部分，decoder的部分見到的都是不完整的句子；bert的雙向語言模型則採用encoder部分，採用了完整句子。

深刻解剖word2vec

見知乎: nlp中的詞向量對比：word2vec/glove/fastText/elmo/GPT/bert

Motivation

Skip-thought model 2015 couple the encoder with an auxiliary decoder, and train the entire system end-to-end to predict the surrounding sentences over a large collection of books.
It was later shown that more competitive results could be obtained by training the encoder over labeled Natural Language INference (NLI) data 2017
This was recently extended to multitask learning , combining different training objectives like that of skip-though, NLI and manchine traslation 2018.

we introduce auxiliary decoders: separate decoder models which are only used to provide alearning signal to the encoders.
Hierarchical Autoregressive Image Models with Auxiliary Decoders

While the previous methods consider a single language at a time, multilingual representaions have attracted a large attention in recent times.

Most of research focuses on cross-lingual word embeddings 2017 which are commonly learned jointly from parallel corpora 2015.
An alternative approach that is becoming increasingly polular is to train word embeddings independently for each language over monoligual corpora, and then map them to a shared space based on a bilingual dictionary 2013 2018.
Cross-lingual word embeddins are often used to build bag-of-word representations of longer linguistic units by taking their centroid 2012.

While this approach has the advantage of requireing a weak (or even no) cross-lingual signal, it has been shown that the resulting sentence embeddings words raher poorly in practical cross-lingual transfer settings 2018.

A more competitive approach that we follow here is to use a sequence-to-sequence encoder-decoder architecture.

The full system is trained end-to-end on parallel corpora akin to neural machine translation: the encoder maps the source sequence into a fixed-length vector representaion, which is used by the decoder to create the target sequence.
This decoder is then discarded, and the encoder is kept to embed sentences in any of the training languages
While some proposals use a separate encoder for each language 2018 sharing a single encoder for all languages also gives strong results.

Nevertheless, most existing work is either limited to few, rather close languages or more commonly, consider pairwise joint embeddings with English and one foreign luaguage only.
Tothe best of our knowledge, all existing word on learning multilingual representations for a large number of languages is limited to word embeddings, our being the first paper exporing massively multilingual sentence representatios.
While all the previous approaches learn a fixed-length representation for each sentence, a recent research line has obtained very strong results using variable-length representation instead, consisting of contextualized embeddings of the words in the sentence.

For that purpose, these methods train either an RNN or self-attentional encoder over unnaotated corpora using some form of language modeling. A classifier can then learned on top of the resulting encoder,
which is commnly further fine-tuned during this supervised training.
Despite the strong performance of these approaches in monolingual settings, we argue that fixed-length approaches provide a more generic, flexible and compatible representation form for our multilingual scenario,
and our model indeed outperforms the multilingual BERT modelin zero-shot transfer

Proposed method

做者用了一個single, language agnostic的BiLSTM encoder來構建sentence embeddings，並一併在parallel corpora上生成了auxiliary decoder。

laser 主要原理
laser 主要原理是將全部語種的用多層bi-lstm encode，最後state 拿出來，而後用max-pooling 變成固定維度的向量，用這個向量去decode，訓練時候翻譯成2個語種，論文說1個語種的效果很差，翻譯成2個目標語種就行，也不必定全部的都須要翻譯成2個，大部分翻譯就行。而後在下游應用，把encoder拿回來用，decoder就沒啥用了。
而後發現語料少的語種在和語料的多一塊兒訓練過程當中有受益。
知乎：Google bert 和Facebook laser 的對比

As it can be seen, sentence embeddings are obtained by applying a max-pooling operation over the output of a BiLSTM encoder.
These sentence embeddings are used to initialize the decoder LSTM through a linear transformation, and are also concatenated to its input embeddings at every time step.
Note that there is no other connection between the encoder and the decoder, as we want all relevent information of the input sequence to captured by the sentence embedding
For the purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatentaion of all training corpora.
This way, the encoder has no explicit signal on what the input language is, encouraging it to learn language is, encourageing it to learn language independent representations.
In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step.

*In this paper, we limit our study to a stacked BiLSTM with 1 to 5 layers, each 512-dimensional.
The resulting sentence represtations (after concatenating both directions) are 1024 dimensional.
The decoder has always one layer of dimension 2048. The input embedding size is set to 320, while the language ID embedding has 32 dimensions.

In preceding work, each sentence at the input was jontly translated into all other languges. While this approach was shown to learn high-quality representaions,
it poses two obvious drawbacks when trying to scale to a large number of languages.

First, it requires an N-way parallel corpus, which is difficult to obtain for all languages.
Second, it has a quadratic cost with respect to the number of languages, making training prohibitively slow as the number of languages is increased.

In our preliminary experiments, we observed that similar results can be obtained by using less target languages - two seem to be enough. (Note that, if we had a single target language, the only way to train the encoder for that language would be auto-encoding, which we observe to work poorly. Having two target languages avoids this problem.)

At the same time, we relax the requirement for N-way parallel corpora by considering independent alignments with the two target languages, e.g. we do not require each source sentence to be translated into two target languages.
Training minimizes the cross-entropy loss on the training corpus, alternating over all combinations of the languages involved.
For thea purpose, we use Adam with a constant learning rate of 0.001 and dropout set to 0.1, and train for a fixed number of epochs.(Implementation based on fairseq)
With a total batch size of 128,000 tokens. Unless otherwise specified, we train our model for 17 epochs, which takes about 5 days. Stopping traiing early decreases the overall performance only slightly.