Page 1
Published as a conference paper at ICLR 2017
AS IMPLE BUT T OUGH - TO -B EAT B ASELINE FOR S EN -
TENCE E MBEDDINGS
Sanjeev Arora, Yingyu Liang, Tengyu Ma
Princeton University
{arora,yingyul,tengyu}@cs.princeton.edu
A BSTRACT
The success of neural network methods for computing word embeddings has mo-
tivated methods for generating semantic embeddings of longer pieces of text, such
as sentences and paragraphs. Surprisingly, Wieting et al (ICLR'16) showed that
such complicated methods are outperformed, especially in out-of-domain (transfer
learning) settings, by simpler methods involving mild retraining of word embed-
dings and basic linear regression. The method of Wieting et al. requires retraining
with a substantial labeled dataset such as Paraphrase Database (Ganitkevitch et
al., 2013).
The current paper goes further, showing that the following completely unsuper-
vised sentence embedding is a formidable baseline: Use word embeddings com-
puted using one of the popular methods on unlabeled corpus like Wikipedia, rep-
resent the sentence by a weighted average of the word vectors, and then modify
them a bit using PCA/SVD. This weighting improves performance by about 10%
to 30% in textual similarity tasks, and beats sophisticated supervised methods in-
cluding RNN's and LSTM's. It even improves Wieting et al.'s embeddings. This
simple method should be used as the baseline to beat in future, especially when
labeled training data is scarce or nonexistent.
The paper also gives a theoretical explanation of the success of the above unsu-
pervised method using a latent variable generative model for sentences, which is
a simple extension of the model in Arora et al. (TACL'16) with new "smoothing"
terms that allow for words occurring out of context, as well as high probabilities
for words like and, not in all contexts.
1 I NTRODUCTION
Word embeddings computed using diverse methods are basic building blocks for Natural Language
Processing (NLP) and Information Retrieval (IR). They capture the similarities between words
(eg, (Bengio et al ., 2003; Collobert & Weston, 2008; Mikolov et al., 2013a; Pennington et al.,
2014) ). Recent work has tried to compute embeddings that capture the semantics of word sequences
(phrases, sentences, and paragraphs), with methods ranging from simple additional composition of
the word vectors to sophisticated architectures such as convolutional neural networks and recurrent
neural networks (eg, ( Iyyer et al ., 2015; Le & Mikolov, 2014; Kiros et al., 2015; Socher et al.,
2011; Blunsom et al., 2014; Tai et al., 2015; Wang et al., 2016)). Recently, (Wieting et al., 2016)
learned general-purpose, paraphrastic sentence embeddings by starting with standard word embed-
dings and modifying them based on supervision from the Paraphrase pairs dataset (PPDB), and
constructing sentence embeddings by training a simple word averaging model. This simple method
leads to better performance on textual similarity tasks than a wide variety of methods and serves as a
good initialization for textual classification tasks. However, supervision from the paraphrase dataset
seems crucial, since they report that simple average of the initial word embeddings does not work
very well.
Here we give a new sentence embedding method that is embarrassingly simple: just compute the
weighted average of the word vectors in the sentence and then remove the projections of the average
vectors on their first principal component (「common component removal」). Here the weight of a
word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency; we call
1
Page 2
Published as a conference paper at ICLR 2017
this smooth inverse frequency (SIF). This method achieves significantly better performance than the
unweighted average on a variety of textual similarity tasks, and on most of these tasks even beats
some sophisticated supervised methods tested in (Wieting et al ., 2016), including some RNN and
LSTM models. The method is well-suited for domain adaptation settings, ie, word vectors trained
on various kinds of corpora are used for computing the sentence embeddings in different testbeds.
It is also fairly robust to the weighting scheme: using the word frequencies estimated from different
corpora does not harm the performance; a wide range of the parameters a can achieve close-to-best
results, and an even wider range can achieve significant improvement over unweighted average.
Of course, this SIF reweighting is highly reminiscent of TF-IDF reweighting from information
retrieval (Sparck Jones , 1972; Robertson, 2004) if one treats a 「sentence」 as a 「document」 and
make the reasonable assumption that the sentence doesn't typically contain repeated words. Such
reweightings (or related ideas like removing frequent words from the vocabulary) are a good rule of
thumb but has not had theoretical justification in a word embedding setting.
The current paper provides a theoretical justification for the reweighting using a generative model for
sentences, which is a simple modification for the Random Walk on Discourses model for generating
text in (Arora et al ., 2016). In that paper, it was noted that the model theoretically implies a sentence
embedding, namely, simple average of embeddings of all the words in it.
We modify this theoretical model, motivated by the empirical observation that most word embedding
methods, since they seek to capture word cooccurence probabilities using vector inner product, end
up giving large vectors to frequent words, as well as giving unnecessarily large inner products to
word pairs, simply to fit the empirical observation that words sometimes occur out of context in
documents. These anomalies cause the average of word vectors to have huge components along
semantically meaningless directions. Our modification to the generative model of ( Arora et al.,
2016) allows 「smoothing」 terms, and then a max likelihood calculation leads to our SIF reweighting.
Interestingly, this theoretically derived SIF does better (by a few percent points) than traditional TF-
IDF in our setting. The method also improves the sentence embeddings of Wieting et al., as seen
in Table 1. Finally, we discovered that —contrary to widespread belief—Word2Vec(CBOW) also
does not use simple average of word vectors in the model, as misleadingly suggested by the usual
expression Pr[w|w 1 ,w 2 ,...,w 5 ] ∝ exp(v w · ( 1
5
∑
i v w i )). A dig into the implementation shows
it implicitly uses a weighted average of word vectors —again, different from TF-IDF— and this
weighting turns out to be quite similar in effect to ours. (See Section 3.1. )
2 R ELATED W ORK
Word embeddings. Word embedding methods represent words as continuous vectors in a low
dimensional space which capture lexical and semantic properties of words. They can be obtained
from the internal representations from neural network models of text ( Bengio et al. , 2003; Collobert
& Weston, 2008; Mikolov et al., 2013a) or by low rank approximation of co-occurrence statis-
tics (Deerwester et al ., 1990; Pennington et al., 2014). The two approaches are known to be closely
related (Levy & Goldberg, 2014; Hashimoto et al., 2016; Arora et al., 2016).
Our work is most directly related work to (Arora et al ., 2016), which proposed a random walk model
for generating words in the documents. Our sentence vector can be seen as approximate inference
of the latent variables in their generative model.
Phrase/Sentence/Paragraph embeddings. Previous works have computed phrase or sentence
embeddings by composing word embeddings using operations on vectors and matrices eg,
(Mitchell & Lapata, 2008; 2010; Blacoe & Lapata, 2012). They found that coordinate-wise multipli-
cation of the vectors performed very well among the binary operations studied. Unweighted averag-
ing is also found to do well in representing short phrases (Mikolov et al ., 2013a). Another approach
is recursive neural networks (RNNs) defined on the parse tree, trained with supervision (Socher
et al ., 2011) or without (Socher et al., 2014). Simple RNNs can be viewed as a special case where
the parse tree is replaced by a simple linear chain. For example, the skip-gram model ( Mikolov et al.,
2013b) is extended to incorporate a latent vector for the sequence, or to treat the sequences rather
than the word as basic units. In (Le & Mikolov , 2014) each paragraph was assumed to have a latent
paragraph vector, which influences the distribution of the words in the paragraph. Skip-thought
2
Page 3
Published as a conference paper at ICLR 2017
of (Kiros et al ., 2015) tries to reconstruct the surrounding sentences from surrounded one and treats
the hidden parameters as their vector representations. RNNs using long short-term memory (LSTM)
capture long-distance dependency and have also been used for modeling sentences (Tai et al., 2015).
Other neural network structures include convolution neural networks, such as (Blunsom et al. , 2014)
that uses a dynamic pooling to handle input sentences of varying length and do well in sentiment
prediction and classification tasks.
The directed inspiration for our work is (Wieting et al ., 2016) which learned paraphrastic sentence
embeddings by using simple word averaging and also updating standard word embeddings based on
supervision from paraphrase pairs; the supervision being used for both initialization and training.
3 AS IMPLE M ETHOD FOR S ENTENCE E MBEDDING
We briefly recall the latent variable generative model for text in (Arora et al ., 2016). The model treats
corpus generation as a dynamic process, where the t-th word is produced at step t. The process is
driven by the random walk of a discourse vector c t ∈ d . Each word w in the vocabulary has a
vector in d as well; these are latent variables of the model. The discourse vector represents 「what
is being talked about.」 The inner product between the discourse vector c t and the (time-invariant)
word vector v w for word w captures the correlations between the discourse and the word. The
probability of observing a word w at time t is given by a log-linear word production model from
Mnih and Hinton:
Pr[w emitted at time t | c t ] ∝ exp (⟨c t ,v w ⟩) .
(1)
The discourse vector c t does a slow random walk (meaning that c t+1 is obtained from c t by adding a
small random displacement vector), so that nearby words are generated under similar discourses. It
was shown in ( Arora et al. , 2016) that under some reasonable assumptions this model generates be-
havior –in terms of word-word cooccurrence probabilities—that fits empirical works like word2vec
and Glove. The random walk model can be relaxed to allow occasional big jumps in c t , since a
simple calculation shows that they have negligible effect on cooccurrence probabilities of words.
The word vectors computed using this model are reported to be similar to those from Glove and
word2vec(CBOW).
Our improved Random Walk model. Clearly, it is tempting to define the sentence embedding as
follows: given a sentence s, do a MAP estimate of the discourse vectors that govern this sentence.
We note that we assume the discourse vector c t doesn't change much while the words in the sentence
were emitted, and thus we can replace for simplicity all the c t ’s in the sentence s by a single discourse
vector c s . In the paper (Arora et al ., 2016), it was shown that the MAP estimate of c s is —up to
multiplication by scalar—the average of the embeddings of the words in the sentence.
In this paper, towards more realistic modeling, we change the model ( 1 ) as follows. This model has
two types of 「smoothing term」, which are meant to account for the fact that some words occur out
of context, and that some frequent words (presumably 「the」, 「and 」 etc.) appear often regardless of
the discourse. We first introduce an additive term αp(w) in the log-linear model, where p(w) is the
unigram probability (in the entire corpus) of word and α is a scalar. This allows words to occur even
if their vectors have very low inner products with c s . Secondly, we introduce a common discourse
vector c 0 ∈ d which serves as a correction term for the most frequent discourse that is often related
to syntax. (Other possible correction is left to future work.) It boosts the co-occurrence probability
of words that have a high component along c 0 .
Concretely, given the discourse vector c s , the probability of a word w is emitted in the sentence s is
modeled by,
Pr[w emitted in sentence s | c s ] = αp(w) + (1 − α)
exp (⟨˜c s ,v w ⟩)
Z ˜c s
,
(2)
where ˜c s = βc 0 + (1 − β)c s , c 0 ⊥ c s
where α and β are scalar hyperparameters, and Z ˜c s
= ∑ w∈V exp (⟨˜c s ,v w ⟩) is the normalizing
constant (the partition function). We see that the model allows a word w unrelated to the discourse
c s to be emitted for two reasons: a) by chance from the term αp(w); b) if w is correlated with the
common discourse vector c 0 .
3
Page 4
Published as a conference paper at ICLR 2017
Algorithm 1 Sentence Embedding
Input: Word embeddings {v w : w ∈ V}, a set of sentences S, parameter a and estimated probabil-
ities {p(w) : w ∈ V} of the words.
Output: Sentence embeddings {v s : s ∈ S}
1: for all sentence s in S do
2:
v s ← 1
|s|
∑
w∈s
a
a+p(w)
v w
3: end for
4: Form a matrix X whose columns are {v s : s ∈ S}, and let u be its first singular vector
5: for all sentence s in S do
6:
v s ← v s − uu v s
7: end for
Computing the sentence embedding. The word embeddings yielded by our model are actually
the same. 1 The sentence embedding will be defined as the max likelihood estimate for the vector c s
that generated it. ( In this case MLE is the same as MAP since the prior is uniform.) We borrow the
key modeling assumption of ( Arora et al. , 2016), namely that the word v w ’s are roughly uniformly
dispersed, which implies that the partition function Z c is roughly the same in all directions. So
assume that Z ˜c s is roughly the same, say Z for all ˜c s . By the model (2 ) the likelihood for the
sentence is
p[s | c s ] =
∏
w∈s
p(w | c s ) =
∏
w∈s
[
αp(w) + (1 − α)
exp (⟨v w , ˜c s ⟩)
Z
]
.
Let
f w (˜c s ) = log
[
αp(w) + (1 − α)
exp (⟨v w , ˜c s ⟩)
Z
]
denote the log likelihood of sentence s. Then, by simple calculus we have,
∇f w (˜c s ) =
1
αp(w) + (1 − α) exp (⟨v w , ˜c s ⟩) /Z
1 − α
Z
exp (⟨v w , ˜c s ⟩) v w .
Then by Taylor expansion, we have,
f w (˜c s ) ≈ f w (0) + ∇f w (0) ˜c s
= constant +
(1 − α)/(αZ)
p(w) + (1 − α)/(αZ)
⟨v w , ˜c s ⟩ .
Therefore, the maximum likelihood estimator for ˜c s on the unit sphere (ignoring normalization) is
approximately, 2
arg max
∑
w∈s
f w (˜c s ) ∝
∑
w∈s
a
p(w) + a
v w , where a =
1 − α
αZ
.
(3)
That is, the MLE is approximately a weighted average of the vectors of the words in the sentence.
Note that for more frequent words w, the weight a/(p(w) + a) is smaller, so this naturally leads to
a down weighting of the frequent words.
To estimate c s , we estimate the direction c 0 by computing the first principal component of ˜c s ’s for
a set of sentences. 3 In other words, the final sentence embedding is obtained by subtracting the
projection of ˜c s ’s to their first principal component. This is summarized in Algorithm 1.
1 We empirically discovered the significant common component c 0 in word vectors built by existing meth-
ods, which inspired us to propose our theoretical model of this paper.
2 Note that max c:c=1 C + ⟨c, g⟩ = g/g for any constant C.
3 Here the first principal component is computed without centralizing ˜c s '.
4
Page 5
Published as a conference paper at ICLR 2017
10 -10
10 -8
10 -6
10 -4
10 -2
10 0
Word frequency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Weight
word2vec weighting
our weighting (a=0.0001)
Figure 1: The subsampling probabilities in word2vec are similar to our weighting scheme.
3.1 C ONNECTION TO S UBSAMPLING P ROBABILITIES IN WORD 2 VEC
Word2vec ( Mikolov et al. , 2013b) uses a sub-sampling technique which downsamples word w with
probability proportional to 1/√p(w) where p(w) is the marginal probability of the word w. This
heuristic not only speeds up the training but also learns more regular word representations. Here
we explain that this corresponds to an implicit reweighting of the word vectors in the model and
therefore the statistical benefit should be of no surprise.
Recall the vanilla CBOW model of word2vec:
Pr[w t | w t−1 ,...,w t−5 ] ∝ exp (⟨¯v t ,v w ⟩) , where ¯v t =
1
5
5
∑
i=1
v w t−i .
(4)
It can be shown that the loss (MLE) for the single word vector v w (from this occurrence) can be
abstractly written in the form,
g(v w ) = γ(⟨¯v t ,v w ⟩) + negative sampling terms ,
where γ(x) = log(1/(1 + e −x )) is the logistic function. Therefore, the gradient of g(v w ) is
∇g(v w ) = γ (⟨¯v t ,v w ⟩)¯v t = α(v w t−5 + v w t−4 + v w t−3 + v w t−2 + v w t−1 ) ,
(5)
where α is a scalar. That is, without the sub-sampling trick, the update direction is the average of
the word vectors in the window.
The sub-sampling trick in (Mikolov et al. , 2013b) randomly selects the summands in equation (5) to
「estimate」 the gradient. Specifically, the sampled update direction is
˜∇g(v w ) = α(J 5 v w t−5 + J 4 v w t−4 + J 3 v w t−3 + J 2 v w t−2 + J 1 v w t−1 )
(6)
where J k ’s are Bernoulli random variables with Pr [J k = 1] = q(w t−k )
min
{
1,
√ 10 −5
p(w t−k )
}
.
However, we note that˜∇g(v w ) is (very) biased estimator! We have that the expectation of ˜∇g(v w )
is a weighted sum of the word vectors,
E[ ˜∇g(v w )
]
= α(q(w t−5 )v w t−5 + q(w t−4 )v w t−4 + q(w t−3 )v w t−3 + q(w t−2 )v w t−2 + q(w t−1 )v w t−1 ) .
In fact, the expectation E[˜∇g(v w )] corresponds to the gradient of a modified word2vec model with
the average ¯v t (in equation (4 )) being replaced by the weighted average ∑
5
k=1 q(w t−k )v w t−k . Such
a weighted model can also share the same form of what we derive from our random walk model
as in equation ( 3). Moreover, the weighting q(w i ) closely tracks our weighting scheme a/(a +
p(w)) when using parameter a = 10 −4 ; see Figure 1 for an illustration. Therefore, the expected
gradient here is approximately the estimated discourse vector in our model! Thus, word2vec with
sub-sampling gradient heuristic corresponds to a stochastic gradient update method for using our
weighting scheme.
5
Page 6
Published as a conference paper at ICLR 2017
Results collected from ( Wieting et al. , 2016) except tfidf-GloVe
Our approach
Supervised
Su.
Un.
Se.
Un.
Se.
or not
Tasks
PP
PP
DAN
RNN
iRNN
LSTM
LSTM
ST
avg-
tfidf-
avg-
GloVe
PSL
-proj.
(no)
(og)
GloVe
GloVe
PSL
+WR
+WR
STS'12
58.7
60.0
56.0
48.1
58.4
51.0
46.4
30.8
52.5
58.7
52.8
56.2
59.5
STS'13
55.8
56.8
54.2
44.7
56.7
45.2
41.5
24.8
42.3
52.1
46.4
56.6
61.8
STS'14
70.9
71.3
69.5
57.7
70.9
59.8
51.5
31.4
54.2
63.8
59.5
68.5
73.5
STS'15
75.8
74.8
72.7
57.2
75.6
63.9
56.0
31.0
52.7
60.6
60.0
71.7
76.3
SICK'14
71.6
71.6
70.7
61.2
71.2
63.9
59.0
49.8
65.9
69.4
66.4
72.2
72.9
Twitter'15
52.9
52.8
53.7
45.1
52.9
47.6
36.1
24.7
30.3
33.8
36.3
48.0
49.0
Table 1: Experimental results (Pearson's r × 100) on textual similarity tasks. The highest score in
each row is in boldface. The methods can be supervised (denoted as Su.), semi-supervised (Se.),
or unsupervised (Un.). 「GloVe+WR」 stands for the sentence embeddings obtained by applying our
method to the GloVe word vectors; 「PSL+WR」 is for PSL word vectors. See the main text for the
description of the methods.
4 E XPERIMENTS
4.1 T EXTUAL S IMILARITY T ASKS
Datasets. We test our methods on the 22 textual similarity datasets including all the datasets from
SemEval semantic textual similarity (STS) tasks (2012-2015) ( Agirre et al. , 2012; 2013; 2014; Agir-
rea et al ., 2015), and the SemEval 2015 Twitter task (Xu et al., 2015) and the SemEval 2014 Seman-
tic Relatedness task ( Marelli et al. , 2014). The objective of these tasks is to predict the similarity
between two given sentences. The evaluation criterion is the Pearson's coefficient between the pre-
dicted scores and the ground-truth scores.
Experimental settings. We will compare our method with the following:
1. Unsupervised: ST, avg-GloVe, tfidf-GloVe. ST denotes the skip-thought vectors ( Kiros
et al ., 2015), avg-GloVe denotes the unweighted average of the GloVe vectors (Pennington
et al. , 2014), 4 and tfidf-GloVe denotes the weighted average of GloVe vectors using TF-IDF
weights.
2. Semi-supervised: avg-PSL. This method uses the unweighted average of the PARAGRAM-
SL999 (PSL) word vectors from ( Wieting et al., 2015). The word vectors are trained using
labeled data, but the sentence embedding are computed by unweighted average without
training.
3. Supervised: PP, PP-proj., DAN, RNN, iRNN, LSTM (og), LSTM (no). All these methods
are initialized with PSL word vectors and then trained on the PPDB dataset. PP and PP-
proj. are proposed in ( Wieting et al. , 2016). The first is an average of the word vectors, and
the second additionally adds a linear projection. The word vectors are updated during the
training. DAN denotes the deep averaging network of ( Iyyer et al ., 2015). RNN denotes the
classical recurrent neural network, and iRNN denotes a variant with the activation being the
identity, and the weight matrices initialized to identity. The LSTM is the version from (Gers
et al ., 2002), either with output gates (denoted as LSTM(og)) or without (denoted as
LSTM (no)).
Our method can be applied to any types of word embeddings. So we denote the sentence embeddings
obtained by applying our method to word embeddings method 「XXX」 as 「XXX+WR」. 5 To get a
completely unsupervised method, we apply it to the GloVe vectors, denoted as GloVe+WR. The
weighting parameter a is fixed to 10 −3 , and the word frequencies p(w) are estimated from the
4 We used the vectors that are publicly available at http://nlp.stanford.edu/projects/glove/. They are 300-
dimensional vectors that were trained on the 840 billion token Common Crawl corpus.
5 「W」 stands for the smooth inverse frequency weighting scheme, and 「R」 stands for removing the common
components.
6
Page 7
Published as a conference paper at ICLR 2017
10-5
10-4
10-3
10-2
10-1
100
Weighting parameter a
0.30
0.35
0.40
0.45
0.50
0.55
Pearson's coefficient
PSL+WR
GloVe+WR
SN+WR
PSL
GloVe
SN
(a)
enwiki poliblogs commoncrawl text8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Pearson's coefficient
PSL+WR
GloVe+WR
SN+WR
(b)
Figure 2: Effect of weighting scheme in our method on the average performance on STS 2012 tasks.
Best viewed in color. (a) Performance vs weighting parameter a. Three types of word vectors
(PSL, GloVe, SN) are tested using p(w) estimated on the enwiki dataset. The best performance
is usually achieved at a = 10 −3 to a = 10 −4 . (b) Performance vs datasets used for estimating
p(w). Four datasets (enwiki, poliblogs, commoncrawl, text8) are used to estimate p(w) which is
then used in our method. The parameter a is fixed to be 10 −3 . The performance is almost the same
for different settings.
commoncrawl dataset. 6 This is denoted by GloVe+WR in Table 1. We also apply our method on the
PSL vectors, denoted as PSL+WR, which is a semi-supervised method.
Results. The results are reported in Table 1 . Each year there are 4 to 6 STS tasks. For clarity, we
only report the average result for the STS tasks each year; the detailed results are in the appendix.
The unsupervised method GloVe+WR improves upon avg-GloVe significantly by 10% to 30%, and
beats the baselines by large margins. It achieves better performance than LSTM and RNN and is
comparable to DAN, even though the later three use supervision. This demonstrates the power of
this simple method: it can be even stronger than highly-tuned supervisedly trained sophisticated
models. Using TF-IDF weighting scheme also improves over the unweighted average, but not as
much as our method.
The semi-supervised method PSL+WR achieves the best results for four out of the six tasks and is
comparable to the best in the rest of two tasks. Overall, it outperforms the avg-PSL baseline and all
the supervised models initialized with the same PSL vectors. This demonstrates the advantage of
our method over the training for those models.
We also note that the top singular vectors c 0 of the datasets seem to roughly correspond to the
syntactic information or common words. For example, closest words (by cosine similarity) to c 0
in the SICK dataset are 「just」, 「when」, 「even」, 「one」, 「up」, 「little」, 「way」, 「there」, 「while」, and
「but.」
Finally, in the appendix, we showed that our two ideas all contribute to the improvement: for GloVe
vectors, using smooth inverse frequency weighting alone improves over unweighted average by
about 5%, using common component removal alone improves by 10%, and using both improves by
13%.
4.1.1 E FFECT OF W EIGHTING P ARAMETER ON P ERFORMANCE
We study the sensitivity of our method to the weighting parameter a, the method for computing
word vectors, and the estimated word probabilities p(w). First, we test the performance of three
6 It is possible to tune the parameter a to get better results. The effect of a and the corpus for estimating
word frequencies are studied in Section 4.1.1.
7
Page 8
Published as a conference paper at ICLR 2017
PP
DAN
RNN
LSTM (no)
LSTM (og)
skip-thought
Ours
similarity (SICK)
84.9
85.96
73.13
85.45
83.41
85.8
86.03
entailment (SICK)
83.1
84.5
76.4
83.2
82.0
-
84.6
sentiment (SST)
79.4
83.4
86.5
86.6
89.2
-
82.2
Table 2: Results on similarity, entailment, and sentiment tasks. The sentence embeddings are com-
puted unsupervisedly, and then used as features in downstream supervised tasks. The row for sim-
ilarity (SICK) shows Pearson's r × 100 and the last two rows show accuracy. The highest score in
each row is in boldface. Results in Column 2 to 6 are collected from ( Wieting et al. , 2016), and
those in Column 7 for skip-thought are from ( Lei Ba et al. , 2016).
types of word vectors (PSL, GloVe, and SN) on the STS 2012 tasks. SN vectors are trained on the
enwiki dataset (Wikimedia, 2012) using the method in (Arora et al., 2016), while PSL and GloVe
vectors are those used in Table 1. We enumerate a ∈ {10 −i , 3 × 10 −i : 1 ≤ i ≤ 5} and use the
p(w) estimated on the enwiki dataset. Figure 2a shows that for all three kinds of word vectors, a
wide range of a leads to significantly improved performance over the unweighted average. Best
performance occurs from a = 10 −3 to a = 10 −4 .
Next, we fix a = 10 −3 and use four very different datasets to estimate p(w): enwiki (wikipedia, 3
billion tokens), poliblogs (Yano et al ., 2009) (political blogs, 5 million), commoncrawl (Buck et al.,
2014) (Internet crawl, 800 billion), text8 (Mahoney, 2008) (wiki subset, 1 million). Figure 2b shows
performance is almost the same for all four settings.
The fact that our method can be applied on different types of word vectors trained on different
corpora also suggests it should be useful across different domains. This is especially important for
unsupervised methods, since the unlabeled data available may be collected in a different domain
from the target application.
4.2 S UPERVISED TASKS
The sentence embeddings obtained by our method can be used as features for downstream super-
vised tasks. We consider three tasks: the SICK similarity task, the SICK entailment task, and the
Stanford Sentiment Treebank (SST) binary classification task ( Socher et al., 2013). To highlight the
representation power of the sentence embeddings learned unsupervisedly, we fix the embeddings
and only learn the classifier. Setup of supervised tasks mostly follow ( Wieting et al., 2016) to al-
low fair comparison, ie, the classifier a linear projection followed by the classifier in ( Kiros et al.,
2015) . The linear projection maps the sentence embeddings into 2400 dimension (the same as the
skip-thought vectors), and is learned during the training. We compare our method to PP, DAN,
RNN, and LSTM, which are the methods used in Section 4 .1. We also compare to the skip-thought
vectors (with improved training in ( Lei Ba et al., 2016)).
Results. Our method gets better or comparable performance compared to the competitors. It gets
the best results for two of the tasks. This demonstrates the power of our simple method. We em-
phasize that our embeddings are unsupervisedly learned, while DAN, RNN, LSTM are trained with
supervision. Furthermore, skip-thought vectors are much higher dimensional than ours (though pro-
jected into higher dimension, the original 300 dimensional embeddings contain all the information).
The advantage is not as significant as in the textual similarity tasks. This is possibly because sim-
ilarity tasks rely directly upon cosine similarity, which favors our method's approach of removing
the common components (which can be viewed as a form of denoising), while in supervised tasks,
with the cost of some label information, the classifier can pick out the useful components and ignore
the common ones.
Finally, we speculate that our method doesn't outperform RNN's and LSTM's for sentiment tasks
because (a) the word vectors —and more generally the distributional hypothesis of meaning —has
known limitations for capturing sentiment due to the 「antonym problem」, (b) also in our weighted
average scheme, words like 「not」 that may be important for sentiment analysis are downweighted a
lot. To address (a), there is existing work on learning better word embeddings for sentiment analysis
(eg, (Maas et al ., 2011)). To address (b), it is possible to design weighting scheme (or learn weights)
for this specific task.
8
Page 9
Published as a conference paper at ICLR 2017
Dataset
RNN
LSTM (no)
LSTM (og)
similarity (SICK)
original
73.13
85.45
83.41
random
54.50
77.24
79.39
entailment (SICK)
original
76.4
83.2
82.0
random
61.7
78.2
81.0
sentiment (SST)
original
86.5
86.6
89.2
random
84.2
82.9
84.1
Table 3: Comparison of results on the original datasets and the ones with words randomly shuffled
in sentences. The rows with label 「original」 are the results on the original datasets, and those with
label 「random」 are the results on the randomly shuffled datasets. The row for similarity (SICK)
shows Pearson's r × 100 and the other rows show accuracy.
4.3 T HE EFFECT OF THE ORDER OF WORDS IN SENTENCES
A interesting feature of our method is that it ignores the word order. This is in contrast to that RNN’s
and LSTM's can potentially take advantage of the word order. The fact that our method achieves
better or comparable performance on these benchmarks raise the following question: is word order
not important in these benchmarks? We conducted an experiment suggesting that word order does
play some role.
We trained and tested RNN/LSTM on the supervised tasks where the words in each sentence are
randomly shuffled, and the results are reported in Table 3. 7 It can be observed that the performance
drops noticeably. Thus our method —which ignores word order—must be much better at exploit-
ing the semantics than RNN's and LSTM's. An interesting future direction is to explore if some
ensemble idea can combine the advantages of both approaches.
5 C ONCLUSIONS
This work provided a simple approach to sentence embedding, based on the discourse vectors in
the random walk model for generating text ( Arora et al., 2016). It is simple and unsupervised, but
achieves significantly better performance than baselines on various textual similarity tasks, and can
even beat sophisticated supervised methods such as some RNN and LSTM models. The sentence
embeddings obtained can be used as features in downstream supervised tasks, which also leads to
better or comparable results compared to the sophisticated methods.
6 A CKNOWLEDGEMENTS
We thank the reviewers for insightful comments. We also thank the authors of ( Wieting et al. , 2016;
Bowman et al., 2015) for sharing their code or the preprocessed datasets.
This work was supported in part by NSF grants CCF-1527371, DMS-1317308, Simons Investigator
Award, Simons Collaboration Grant, and ONRN00014- 16-1-2329. Tengyu Ma was supported in
addition by Simons Award in Theoretical Computer Science and IBM PhD Fellowship.
R EFERENCES
Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pi-
lot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and
Computational Semantics-Volume 1: Proceedings of the main conference and the shared task,
and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp.
385–393. Association for Computational Linguistics, 2012.
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. Sem 2013 shared
task: Semantic textual similarity. in second joint conference on lexical and computational seman-
7 On the random order datasets, the hyperparameters are enumerated as in the other experiments, and the
best results are reported.
9
Page 10
Published as a conference paper at ICLR 2017
tics. In Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity,
2013.
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Wei-
wei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Multilingual
semantic textual similarity. In Proceedings of the 8th international workshop on semantic evalu-
ation (SemEval 2014), pp. 81–91, 2014.
Eneko Agirrea, Carmen Baneab, Claire Cardiec, Daniel Cerd, Mona Diabe, Aitor Gonzalez-Agirrea,
Weiwei Guof, Inigo Lopez-Gazpioa, Montse Maritxalara, Rada Mihalceab, et al. Semeval-2015
task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings
of the 9th international workshop on semantic evaluation (SemEval 2015), pp. 252–263, 2015.
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable
model approach to PMI-based word embeddings. Transaction of Association for Computational
Linguistics, 2016.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic
language model. Journal of Machine Learning Research, 2003.
William Blacoe and Mirella Lapata. A comparison of vector-based representations for semantic
composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, 2012.
Phil Blunsom, Edward Grefenstette, and Nal Kalchbrenner. A convolutional neural network for
modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Compu-
tational Linguistics, 2014.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large anno-
tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Association for Computational
Linguistics, 2015.
Christian Buck, Kenneth Heafield, and Bas van Ooyen. N-gram counts and language models from
the common crawl. In Proceedings of the Language Resources and Evaluation Conference, 2014.
Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep
neural networks with multitask learning. In Proceedings of the 25th International Conference on
Machine Learning, 2008.
Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and Richard A.
Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 1990.
Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise timing with lstm
recurrent networks. Journal of machine learning research, 2002.
Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. Word embeddings as metric
recovery in semantic spaces. Transactions of the Association for Computational Linguistics, 2016.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. Deep unordered com-
position rivals syntactic methods for text classification. In Proceedings of the Association for
Computational Linguistics, 2015.
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Tor-
ralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing
systems, 2015.
Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceed-
ings of The 31st International Conference on Machine Learning, 2014.
J. Lei Ba, JR Kiros, and GE Hinton. Layer Normalization. ArXiv e-prints, 2016.
10
Page 11
Published as a conference paper at ICLR 2017
Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Ad-
vances in Neural Information Processing Systems, 2014.
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher
Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp.
142–150. Association for Computational Linguistics, 2011.
Matt Mahoney. Wikipedia text preprocess script. http://mattmahoney.net/dc/textdata.html, 2008.
Accessed Mar-2015.
Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto
Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on
full sentences through semantic relatedness and textual entailment. SemEval-2014, 2014.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in Neural Information
Processing Systems, 2013a.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space
word representations. In Proceedings of the Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, 2013b.
Jeff Mitchell and Mirella Lapata. Vector-based models of semantic composition. In Association for
Computational Linguistics, 2008.
Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive
science, 2010.
Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme2, and Chris Callison-
Burch. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings,
and style classification. Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2015.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word
representation. Proceedings of the Empiricial Methods in Natural Language Processing, 2014.
Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf.
Journal of documentation, 2004.
Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng. Dy-
namic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in
Neural Information Processing Systems, 2011.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment
treebank. In Proceedings of the conference on empirical methods in natural language processing
(EMNLP), 2013.
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng.
Grounded compositional semantics for finding and describing images with sentences. Trans-
actions of the Association for Computational Linguistics, 2014.
Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval.
Journal of documentation, 1972.
Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations
from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.
Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic
classification. In Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers-Volume 2, pp. 90–94. Association for Computational Linguistics, 2012.
11
Page 12
Published as a conference paper at ICLR 2017
Yashen Wang, Heyan Huang, Chong Feng, Qiang Zhou, Jiahui Gu, and Xiong Gao. Cse: Conceptual
sentence embeddings based on attention model. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2016.
John Wieting, Mohit Bansal, Kevin Gimpel, Karen Livescu, and Dan Roth. From paraphrase
database to compositional paraphrase model and back. Transactions of the Association for Com-
putational Linguistics, 2015.
John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic
sentence embeddings. In International Conference on Learning Representations, 2016.
Wikimedia. English Wikipedia dump. http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-
pages-articles.xml.bz2, 2012. Accessed Mar-2015.
Wei Xu, Chris Callison-Burch, and William B Dolan. Semeval-2015 task 1: Paraphrase and semantic
similarity in twitter (pit). Proceedings of SemEval, 2015.
Tae Yano, William W Cohen, and Noah A Smith. Predicting response to political blog posts with
topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference
of the North American Chapter of the Association for Computational Linguistics, 2009.
12
Page 13
Published as a conference paper at ICLR 2017
AD ETAILS OF E XPERIMENTAL S ETTING
A.1 U NSUPERVISED T ASK : T EXTUAL S IMILARITY
The competitors. We give a brief overview of the competitors. RNN is the classical recurrent
neural network:
h t = f(W x W x t
w + W h h t−1 + b)
where f is the activation, W x ,W h and b are parameters, and x t is the t-th token in the sentence.
The sentence embedding of RNN is just the hidden vector of the last token. iRNN is a special RNN
with the activation being the identity, the weight matrices initialized to identity, and b initialized to
zero. LSTM ( Hochreiter & Schmidhuber, 1997 ) is a recurrent neural network architecture designed
to capture long-distance dependencies. Here, the version from ( Gers et al., 2002) is used.
The supervised methods are initialized with PARAGRAM-SL999 (PSL) vectors, and trained using
the approach of ( Wieting et al. , 2016) on the XL section of the PPDB data (Pavlick et al., 2015)
which contains about 3 million unique phrase pairs. After training, the final models can be used to
generate sentence embeddings on the test data. For hyperparameter tuning they used 100k examples
sampled from PPDB XXL and trained for 5 epochs. Then after finding the hyperparameters that
maximize Spearmans coefficients on the Pavlick et al. PPDB task, they are trained on the entire XL
section of PPDB for 10 epochs. See ( Wieting et al., 2016) and related papers for more details about
these methods.
The tfidf-GloVe method is a weighted average of the GloVe vectors, where the weights are defined
by the TF-IDF scheme. More precisely, the embedding of a sentence s is
v s =
1
|s|
∑
w∈s
IDF w v w
(7)
where IDF w is the inverse document frequency of w, and |s| denotes the number of words in the
sentence. Here, the TF part of the TF-IDF scheme is taken into account by the sum over w ∈ s.
Furthermore, when computing IDF w , each sentence is viewed as a 「document」:
IDF w := log
1 + N
1 + N w
where N is the total number of sentences and N w is the number of sentences containing w, and
1 is added to avoid division by 0. In the experiments, we use all the textual similarity datasets to
compute IDF w .
Detailed experimental results. In the main body we present the average results for STS tasks by
year. Each year there are actually 4 to 6 STS tasks, as shown in Table 4. Note that tasks with the
same name in different years are actually different tasks. Here we provide the results for each tasks
in Table 5 . PSL+WR achieves the best results on 12 out of 22 tasks, PP and PP-proj. achieves on 3,
tfidf-GloVe achieves on 2, and DAN, iRNN, and GloVe+WR achieves on 1. In general, our method
improves the performance significantly compared to the unweighted average, though on rare cases
such as MSRpar it can decrease the performance.
STS'12
STS'13
STS'14
STS'15
MSRpar
headline deft forum
anwsers-forums
MSRvid
OnWN
deft news
answers-students
SMT-eur
FNWN
headline
belief
OnWN
SMT
images
headline
SMT-news
OnWN
images
tweet news
Table 4: The STS tasks by year. Note that tasks with the same name in different years are actually
different tasks.
13
Page 14
Published as a conference paper at ICLR 2017
Results collected from ( Wieting et al. , 2016) except tfidf-GloVe
Our approach
Supervised
Su.
Un.
Se.
Un.
Se.
or not
Tasks
PP
PP
DAN
RNN
iRNN
LSTM
LSTM
ST
avg-
tfidf-
avg-
GloVe
PSL
-proj.
(no)
(og)
GloVe
GloVe
PSL
+WR
+WR
MSRpar
42.6
43.7
40.3
18.6
43.4
16.1
9.3
16.8
47.7
50.3
41.6
35.6
43.3
MSRvid
74.5
74.0
70.0
66.5
73.4
71.3
71.3
41.7
63.9
77.9
60.0
83.8
84.1
SMT-eur
47.3
49.4
43.8
40.9
47.1
41.8
44.3
35.2
46.0
54.7
42.4
49.9
44.8
OnWN
70.6
70.1
65.9
63.1
70.1
65.2
56.4
29.7
55.1
64.7
63.0
66.2
71.8
SMT-news
58.4
62.8
60.0
51.3
58.1
60.8
51.0
30.8
49.6
45.7
57.0
45.6
53.6
STS'12
58.7
60.0
56.0
48.1
58.4
51.0
46.4
30.8
52.5
58.7
52.8
56.2
59.5
headline
72.4
72.6
71.2
59.5
72.8
57.4
48.5
34.6
63.8
69.2
68.8
69.2
74.1
OnWN
67.7
68.0
64.1
54.6
69.4
68.5
50.4
10.0
49.0
72.9
48.0
82.8
82.0
FNWN
43.9
46.8
43.1
30.9
45.3
24.7
38.4
30.4
34.2
36.6
37.9
39.4
52.4
SMT
39.2
39.8
38.3
33.8
39.4
30.1
28.8
24.3
22.3
29.6
31.0
37.9
38.5
STS'13
55.8
56.8
54.2
44.7
56.7
45.2
41.5
24.8
42.3
52.1
46.4
56.6
61.8
deft forum
48.7
51.1
49.0
41.5
49.0
44.2
46.1
12.9
27.1
37.5
37.2
41.2
51.4
deft news
73.1
72.2
71.7
53.7
72.4
52.8
39.1
23.5
68.0
68.7
67.0
69.4
72.6
headline
69.7
70.8
69.2
57.5
70.2
57.5
50.9
37.8
59.5
63.7
65.3
64.7
70.1
images
78.5
78.1
76.9
67.6
78.2
68.5
62.9
51.2
61.0
72.5
62.0
82.6
84.8
OnWN
78.8
79.5
75.7
67.7
78.8
76.9
61.7
23.3
58.4
75.2
61.1
82.8
84.5
tweet news
76.4
75.8
74.2
58.0
76.9
58.7
48.2
39.9
51.2
65.1
64.7
70.1
77.5
STS'14
70.9
71.3
69.5
57.7
70.9
59.8
51.5
31.4
54.2
63.8
59.5
68.5
73.5
answers-forum
68.3
65.1
62.6
32.8
67.4
51.9
50.7
36.1
30.5
45.6
38.8
63.9
70.1
answers-student
78.2
77.8
78.1
64.7
78.2
71.5
55.7
33.0
63.0
63.9
69.2
70.4
75.9
belief
76.2
75.4
72.0
51.9
75.9
61.7
52.6
24.6
40.5
49.5
53.2
71.8
75.3
headline
74.8
75.2
73.5
65.3
75.1
64.0
56.6
43.6
61.8
70.9
69.0
70.7
75.9
images
81.4
80.3
77.5
71.4
81.1
70.4
64.2
17.7
67.5
72.9
69.9
81.5
84.1
STS'15
75.8
74.8
72.7
57.2
75.6
63.9
56.0
31.0
52.7
60.6
60.0
71.7
76.3
SICK'14
71.6
71.6
70.7
61.2
71.2
63.9
59.0
49.8
65.9
69.4
66.4
72.2
72.9
Twitter'15
52.9
52.8
53.7
45.1
52.9
47.6
36.1
24.7
30.3
33.8
36.3
48.0
49.0
Table 5: Experimental results (Pearson's r × 100) on textual similarity tasks. The highest score in
each row is in boldface. The methods can be supervised (denoted as Su.), semi-supervised (Se.),
or unsupervised (Un.). 「GloVe+WR」 stands for the sentence embeddings obtained by applying our
method to the GloVe word vectors; 「PSL+WR」 is for PSL word vectors. See the main text for the
description of the methods.
Effects of smooth inverse frequency and common component removal. There are two key ideas
in our methods: smooth inverse frequency weighting (W) and common component removal (R). It is
instructive to see their effects separately. Let GloVe+W denote the embeddings using only smooth
inverse frequency, and GloVe+R denote that using only common component removal. Similarly
define PSL+W and PSL+R. The results for these methods are shown in Table 6. When using GloVe
vectors, W alone improves the performance of the unweighted average baseline by about 5%, R
alone improves by 10%, W and R together improves by 13%. When using PSL vectors, W gets
10%, R gets 10%, W and R together gets 13%. In summary, both techniques are important for
obtaining significant advantage over the unweighted average.
A.2 S UPERVISED TASKS
Setup of supervised tasks mostly follow ( Wieting et al. , 2016) to allow fair comparison: the sentence
embeddings are fixed and fed into some classifier which are trained. For the SICK similarity task,
given a pair of sentences with embeddings v L and v R , first do a linear projection:
h L = W p v L ,h R = W p v R
where W p is of size 300 × d p and is learned during training. d p = 2400 matches the dimension of
the skip-thought vectors. Then h L and h R are used in the objective function from ( Tai et al. , 2015).
Almost the same approach is used for the entailment task.For the sentiment task, the classifier has a
fully-connected layer with a sigmoid activation followed by a softmax layer.
Recall that our method has two steps: smooth inverse frequency weighting and common component
removal. For the experiments here, we do not perform the common component removal, since it can
be absorbed into the projection step. For the weighted average, the hyperparameter a is enumerated
in {10 −i , 3 × 10 −i : 2 ≤ i ≤ 3}. The other hyperparameters are enumerated as in ( Wieting et al.,
2016) , and the same validation approach is used to select the final values.
14
Page 15
Published as a conference paper at ICLR 2017
Unsupervised
Semi-supervised
Tasks
avg-GloVe
GloVe+W
GloVe+R
GloVe+WR
avg-PSL
PSL+W
PSL+R
PSL+WR
MSRpar
47.7
43.6
36.4
35.6
41.6
40.9
42.5
43.3
MSRvid
63.9
78.7
79.4
83.8
60.0
80.4
76.4
84.1
SMT-eur
46.0
51.1
48.5
49.9
42.4
45.0
45.1
44.8
OnWN
55.1
54.3
68.3
66.2
63.0
67.8
71.0
71.8
SMT-news
49.6
42.2
45.6
45.6
57.0
56.2
50.7
53.6
STS'12
52.5
54.0
55.6
56.2
52.8
58.1
57.2
59.5
headline
63.8
63.8
68.9
69.2
68.8
72.6
72.7
74.1
OnWN
49.0
68.0
75.4
82.8
48.0
69.8
73.5
82.0
FNWN
34.2
23.0
34.9
39.4
37.9
49.3
40.7
52.4
SMT
22.3
29.5
36.4
37.9
31.0
39.2
37.3
38.5
STS'13
42.3
44.0
53.9
56.6
46.4
57.7
56.0
61.8
deft forum
27.1
29.1
39.8
41.2
37.2
45.8
45.3
51.4
deft news
68.0
68.5
66.6
69.4
67.0
75.1
67.4
72.6
headline
59.5
59.3
64.6
64.7
65.3
68.9
68.5
70.1
images
61.0
74.1
78.4
82.6
62.0
82.9
80.2
84.8
OnWN
58.4
68.0
77.6
82.8
61.1
77.6
77.7
84.5
tweet news
51.2
57.3
73.2
70.1
64.7
73.6
77.9
77.5
STS'14
54.2
59.4
66.7
68.5
59.5
70.7
69.5
73.5
answers-forum
30.5
41.4
58.4
63.9
38.8
56.0
61.0
70.1
answers-student
63.0
61.5
73.2
70.4
69.2
73.3
76.8
75.9
belief
40.5
47.7
69.5
71.8
53.2
64.3
71.3
75.3
headline
61.8
64.0
70.1
70.7
69.0
74.5
74.6
75.9
images
67.5
75.4
77.9
81.5
69.9
83.4
79.9
84.1
STS'15
52.7
58.0
69.8
71.7
60.0
70.3
72.7
76.3
SICK'14
65.9
70.5
70.6
72.2
66.4
73.1
70.3
72.9
Twitter'15
30.3
33.8
50.6
48.0
36.3
45.7
51.9
49.0
Table 6: Experimental results (Pearson's r × 100) on textual similarity tasks using only smooth
inverse frequency, using only common component removal, or using both.
A.3 A DDITIONAL S UPERVISED T ASKS
Here we report the experimental results on two more datasets, comparing to known results on them.
SNLI. The first experiment is for the 3-class classification task on the SNLI dataset (Bowman
et al ., 2015). To compare to the results in (Bowman et al., 2015), we used their experimental setup.
In particular, we applied our method to 300 dimensional GloVe vectors and used an additional tanh
neural network layer to map these 300d embeddings into 300 dimensional space, then used the code
provided by the authors of ( Bowman et al., 2015), trained the classifier on our 100 dimensional
sentence embedding for 120 passes over the data, using their default hyperparameters. The results
are shown in Table 7. Our method indeed gets slightly better performance.
Our test accuracy is worse than those using more sophisticated models (eg, using attention mech-
anism), which are typically 83% - 88%; see the website of the SNLI project 8 for a summary. An
interesting direction is to study whether our idea can be combined with these sophisticated models
to get improved performance.
Sentence model
Train Test
100d Sum of words
79.3
75.3
100d RNN
73.1
72.2
100d LSTM RNN
84.8
77.6
Our method
83.9
78.2
Table 7: Accuracy in 3-class classification on the SNLI dataset for each model. The results in
the first three rows are collected from (Bowman et al ., 2015). All methods used 100 dimensional
sentence embeddings.
IMDB. The second experiment is the sentiment analysis task on the IMDB dataset, studied
in (Wang & Manning, 2012). Since the intended application is semi-supervised or transfer learning,
we also compared performance with fewer labeled examples.
8 http://nlp.stanford.edu/projects/snli/
15
Page 16
Published as a conference paper at ICLR 2017
# labeled examples NB-SVM Our method
50k
0.91
0.85
1k
0.84
0.82
200
0.73
0.77
Table 8: Accuracy in sentiment analysis on the IMDB dataset for NB-SVM (Wang & Manning,
2012) and our method.
Our method gets worse performance on the full dataset, but its decrease in performance is better
with less labeled examples, showing the benefit of using word embeddings. Note that our sentence
embeddings are unsupervised, while that in the NB-SVM method takes advantage of the labels.
Another comment is that sentiment analysis appears to be the best case for Bag-Of-Word methods,
whereas it may be the worst case for word embedding methods (See Table 2) due to the well-known
antonymy problem —distributional hypothesis fails for distinguishing 「good」 from 「bad.」
16
html