word2vec學習筆記

時間 2019-12-14

標籤 word2vec word vec 學習筆記欄目 Microsoft Office 简体版

原文原文鏈接

word2vec學習筆記

前言

最近一個月事情多，心力交瘁，臨近過年這幾天進入到啥也不想幹的狀態，要想擺脫這種狀態最好的方法就是趕忙看書寫東西，給本身一些正反饋，走出負面循環。過完年要作一些NLP相關的事情了，全部要大體瞭解下相關內容，第一個準備深刻了解的就是word2vec，這是一種詞嵌入模型主要做用就是爲語言單詞尋找一種儘量合理的向量化表示，一方面能保持單詞的一些語義特徵(如類似性)；另外一方面能是向量維度大小比較合理。Word2vec是身兼這兩種特色的詞嵌入表示。固然沒有免費的午飯，咱們要經過訓練獲得這種表達。NLP和CV對待特徵的思路很不同，這也是我剛入NLP的感受。前端

word2vec理論

這部分要仔細寫起來很糾結，網上也有一堆相似的教程，我就不作詳細介紹了，這裏只講個大概。一下內容大多來自standford CS224d lecture1。NLP須要先將文檔進行分詞而後對分詞進行編碼，編碼最簡單的就是One-hot vector一個單詞佔一個坑，可是這樣一方面一個單詞的維度太高，另外一方面沒法表達向量之間的關係。word2vec有前端和後端之分，前端有CBOW和SKIP-GRAM這兩種模型,後端有負採樣和哈弗曼樹這兩種模型，前端和後端能夠自由組合。不過經常使用的高效實現都是採用Skip-gram + 負採樣.python

Skip-gram

Skip-gram的原理是對輸入的單詞預測其上下文，好比有一句話是{「The」, 「cat」, 」jumped」，」over」, 「the」, 「puddle」}，skip-gram模型對輸入中心詞語"jumped"進行預測輸出"jumped"的上下文「The」, 「cat」, 」over」, 「the」, 「puddle」，聽起來感受很神奇。下面這張圖片表示了Skip-gram模型運行的過程。Skip-gram本質上就是一個邏輯迴歸。

Skip-gram的運行方式主要有如下幾步驟:git

對單詞生成one-hot輸入向量\(x_k\)
獲得上下文的嵌入詞向量\(v_c = Vx\)
經過\(u = Uu_c\)產生2m個得分向量\(u_{c-m},...,u_{c-1},u_{c+1},...,u_{c+m}\)
將分向量轉換成機率分佈\(y=softmax(u)\)
最後將產生的機率與真實的機率分佈作匹配
Skip-gram的目標/損失函數以下:
\[ \begin{eqnarray} minimize L &=& -logP(w_{c-m},...,w_{c-1},w_{c+1},...,w_{c+m}|w_c) \\ &=& -log\prod_{j=0,j\not=m}^{2m}P(w_{c-m+j}|w_c)\\ &=& -log\prod_{j=0,j\not=m}^{2m}P(u_{c-m+j}|u_c)\\ &=& -log\prod_{j=0,j\not=m}^{2m}\frac{exp(u^T_{c-m+j}v_c)}{\sum^{|V|}_{k=1}exp(u_k^Vv_c)}\\ &=& -\sum_{j=0,j\not=m}^{2m}u^T_{c-m+j}v_c + 2mlog\sum_{k=1}^{|V|}exp(u_k^Tv_c) \end{eqnarray} \]

負採樣

上面的目標/損失函數須要對整個詞彙表\(|V|\)進行計算，代價很是的高，所以引入了負採樣。負採樣的思想是：咱們不用去循環整個單詞表，而只是採樣一些負面的樣本就夠了，其機率分佈與單詞表中的頻率相匹配。考慮一個詞的"詞－上下文"對\((w,c)\)，令\(P(D=1|w,c)\)爲\((w,c)\)來自語料庫的機率，則\(P(D=1|w,c)\)爲不是來自語料庫的機率，咱們有:後端

\[ P(D=1|w,c,\theta)=\frac{1}{1+e^{-v^T_cv_w}} \]
咱們須要創建一個新的目標函數。若是\((w,c)\)真是來自與語料庫，目標函數可以最大化\(P(D=1|w,c)\)。咱們能夠採用最大似然估計來獲得模型參數。session

\[ \begin{eqnarray} \theta &=&\mathop{argmax}_{\theta}\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde{D}}P(D=0|w,c,\theta)\\ &=&\mathop{argmax}_{\theta}\prod_{(w,c)\in D}P(D=1|w,c,\theta)\prod_{(w,c)\in \tilde{D}}(1-P(D=1|w,c,\theta))\\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\frac{1}{1+exp(-u^T_wv_c)}+\sum_{(w,c)\in \tilde{D}}log(1-\frac{1}{1+exp(-u^T_wv_c)}) \\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\frac{1}{1+exp(-u^T_wv_c)}+\sum_{(w,c)\in \tilde{D}}log\frac{1}{1+exp(u^T_wv_c)} \\ &=&\mathop{argmax}_{\theta}\sum_{(w,c)\in D}log\sigma(-u^T_wv_c)+\sum_{(w,c)\in \tilde{D}}log\sigma(u^T_wv_c)\\ \end{eqnarray} \]app

這是的\(\theta\)能夠看作是上面的\(U,V\)，\(\tilde{D}\)表示負面的語料庫。咱們可一進一步把目標函數寫成：cors

\[ \begin{eqnarray} log\sigma(-u_{c-m+j}^Tv_c) + \sum^{K}_{k=1}log\sigma(\tilde{u}^T_kv_c) \end{eqnarray} \]
這裏\(\tilde{u}_k\)是由負採樣獲得。dom

基於tensorflow的word2vec實現

上面大概介紹了一下word2vec的原理，講的很簡略，要想仔細瞭解仍是去看看網上的《word2vec的數學原理》一文，下面介紹tensorflow裏面自帶的例子word2vec的實現。ide

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
import seaborn as sbn
from matplotlib import pylab
%config InlineBackend.figure_format = 'svg'
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))

上面的代碼主要功能是下載數據集而且讀取數據，載入內存的是一個很長的文本序列。svg

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

上面的代碼短主要功能是爲數據集進行編碼，其中使用了most_common，因此單詞會按照在文檔中出現的次數進行編碼，具體來講就是出現次數多的單詞的編碼會相對小一些，這個在後面負採樣中會用到。

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span) # deque窗口　　大小爲 ２*skip_window + 1
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):  #兩層循環，一個batch有batch/num_skips個數據,每一個數據的label大小爲num_skips
    target = skip_window  # target label at the center of the buffer
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print(batch)
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

對於data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']上面的操做會造成一個這樣的輸出 batch中存儲的是id，假設咱們去skip_size = 4, skip_window = 2那麼，單詞 as 所對應的context的word個數就是4個，因此batch中有4個as，所對應的就是context中的word
12 as -> 195 term
12 as -> 5239 anarchism
12 as -> 6 a
12 as -> 3084 originated
6 a -> 12 as
6 a -> 3084 originated
6 a -> 2 of
6 a -> 195 term

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset) #其實就是按照train_dataset順序返回embeddings中的第train_dataset行。
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.nce_loss(softmax_weights, softmax_biases, embed,
                               train_labels, num_sampled, vocabulary_size))#是對類別太多的狀況下loss計算的一種加速方法，具體能夠參考文檔

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

上面的代碼就是tensorflow實現的word2vec的skip-gram模型，本質上就是一個邏輯迴歸啊，和上面的理論仍是有區別的，不過這裏用的到了nce_loss，這個函數裏面包括了negtive sample，後面會詳細介紹。

num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()


num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])

def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.savefig('softmax_loss.svg', format='svg')
  pylab.show()
  

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

最後獲得的結果以下

nce_loss

nce_loss的源碼以下

def nce_loss(weights, #[num_classes, dim] dim就是emdedding_size
             biases,  #[num_classes] num_classes就是word的個數（不包括重複的）
             inputs， #[batch_size, dim]
             labels,  #[batch_size, num_true] 這裏，咱們的num_true設置爲1，就是一個輸入對應一個輸出
             num_sampled,#要取的負樣本的個數（per batch）
             num_classes,#類別的個數（在這裏就是word的個數（不包含重複的））
             num_true=1,
             sampled_values=None,
             remove_accidental_hits=False,
             partition_strategy="mod",
             name="nce_loss"):
      logits, labels = _compute_sampled_logits(
      weights,
      biases,
      inputs,
      labels,
      num_sampled,
      num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = sigmoid_cross_entropy_with_logits(
      logits, labels, name="sampled_losses") 
      #此函數返回的tensor與輸入logits同維度。 _sum_rows以後，就獲得了每一個樣本的corss entropy。
  # sampled_losses is batch_size x {true_loss, sampled_losses...}
  # We sum out true and sampled losses.
  return _sum_rows(sampled_losses)
  #在word2vec中對此函數的返回調用了reduce_mean() 就得到了平均 cross entropy

# _compute_sampled_logits源碼以下
def _compute_sampled_logits(weights,
                            biases,
                            inputs,
                            labels,
                            num_sampled,
                            num_classes,
                            num_true=1,
                            sampled_values=None,
                            subtract_log_q=True,
                            remove_accidental_hits=False,
                            partition_strategy="mod",
                            name=None):
  if not isinstance(weights, list):
    weights = [weights]

  with ops.op_scope(weights + [biases, inputs, labels], name,
                    "compute_sampled_logits"):
    if labels.dtype != dtypes.int64:
      labels = math_ops.cast(labels, dtypes.int64)
    labels_flat = array_ops.reshape(labels, [-1])

    # Sample the negative labels.
    #   sampled shape: [num_sampled] tensor
    #   true_expected_count shape = [batch_size, 1] tensor
    #   sampled_expected_count shape = [num_sampled] tensor
    if sampled_values is None:
      sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
          true_classes=labels,
          num_true=num_true,
          num_sampled=num_sampled,
          unique=True,
          range_max=num_classes)

NOTE：這個函數是經過log-uniform進行取樣的\(P(class)=\frac{(log(class+2)−log(class+1))}{log(rang\_max+1)}\)，取樣範圍是[0, range_max] ,用這種方法取樣就要求咱們的word是按照頻率從高到低排列的。以前對word的處理的確是這樣,class越小取的機率越大。

sampled_softmax_loss

tensorflow的word2vec有的版本的損失函數用到了sampled_softmax_loss他和nce_loss很類似，參數是如出一轍的。

def sampled_softmax_loss(weights,
                         biases,
                         labels,
                         inputs,
                         num_sampled,
                         num_classes,
                         num_true=1,
                         sampled_values=None,
                         remove_accidental_hits=True,
                         partition_strategy="mod",
                         name="sampled_softmax_loss"):
  logits, labels = _compute_sampled_logits(
      weights=weights,
      biases=biases,
      labels=labels,
      inputs=inputs,
      num_sampled=num_sampled,
      num_classes=num_classes,
      num_true=num_true,
      sampled_values=sampled_values,
      subtract_log_q=True,
      remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy,
      name=name)
  sampled_losses = nn_ops.softmax_cross_entropy_with_logits(labels=labels,
                                                            logits=logits)
  # sampled_losses is a [batch_size] tensor.
  return sampled_losses

主要區別就是sigmoid_cross_entropy_with_logits和softmax_cross_entropy_with_logits，前者不要求類別之間是互斥的，後者要求是互斥的。nce_loss獲得的結果會更加平滑一些。下面貼出了用sampled_softmax_loss獲得的結果