





  • 輸入:大量已分詞的文本
  • 輸出:用一個稠密向量來表示每一個詞






word2vec的詳細實現,簡而言之,就是一個三層的神經網絡。要理解word2vec的實現,須要的預備知識是神經網絡和Logistic Regression。app



上圖是Word2vec的簡要流程圖。首先假設,詞庫裏的詞數爲10000; 詞向量的長度爲300(根據斯坦福CS224d的講解,詞向量通常爲25-1000維,300維是一個好的選擇)。下面以單個訓練樣本爲例,依次介紹每一個部分的含義。

  1. 輸入層:輸入爲一個詞的one-hot向量表示。這個向量長度爲10000。假設這個詞爲ants,ants在詞庫中的ID爲i,則輸入向量的第i個份量爲1,其他爲0。[0, 0, ..., 0, 0, 1, 0, 0, ..., 0, 0]
  2. 隱藏層:隱藏層的神經元個數就是詞向量的長度。隱藏層的參數是一個[10000 ,300]的矩陣。 實際上,這個參數矩陣就是詞向量。回憶一下矩陣相乘,一個one-hot行向量和矩陣相乘,結果就是矩陣的第i行。通過隱藏層,實際上就是把10000維的one-hot向量映射成了最終想要獲得的300維的詞向量。


  3. 輸出層: 輸出層的神經元個數爲總詞數10000,參數矩陣尺寸爲[300,10000]。詞向量通過矩陣計算後再加上softmax歸一化,從新變爲10000維的向量,每一維對應詞庫中的一個詞與輸入的詞(在這裏是ants)共同出如今上下文中的機率。





  4. 訓練:訓練樣本(x, y)有輸入也有輸出,咱們知道哪一個詞實際上跟ants共現,所以y也是一個10000維的向量。損失函數跟Logistic Regression類似,是神經網絡的最終輸出向量和y的交叉熵(cross-entropy)。最後用隨機梯度降低來求解。





  • skip-gram: 核心思想是根據中心詞來預測周圍的詞。假設中心詞是cat,窗口長度爲2,則根據cat預測左邊兩個詞和右邊兩個詞。這時,cat做爲神經網絡的input,預測的詞做爲label。下圖爲一個例子:




  • CBOW(continuous-bag-of-words):若是理解了skip-gram,那CBOW模型其實就是倒過來,用周圍的全部詞來預測中心詞。這時候,每一次中心詞的移動,只能產生一個訓練樣本。若是仍是用上面的例子,則CBOW模型會產生下列4個訓練樣本:

    1. ([quick, brown], the)
    2. ([the, brown, fox], quick)
    3. ([the, quick, fox, jumps], brown)
    4. ([quick, brown, jumps, over], fox)



負採樣(Negative Sampling)

負採樣的思想很是簡單,簡單地使人髮指:咱們知道最終神經網絡通過softmax輸出一個向量,只有一個機率最大的對應正確的單詞,其他的稱爲negative sample。如今只選擇5個negative sample,因此輸出向量就只是一個6維的向量。要考慮的參數不是300萬個,而減小到了1800個! 這樣作看上去很偷懶,實際效果卻很好,大大提高了運算效率。
負採樣是有效的,咱們不須要那麼多negative sample。Mikolov等人在論文中說:對於小數據集,負採樣的個數在5-20個;對於大數據集,負採樣的個數在2-5個。






最後用tensorflow動手實踐一下。參考Udacity Deep Learning的一次做業


# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

Download the data from the source website if necessary.

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)
Found and verified text8.zip

Read the data into a string.

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
words = read_data(filename)
print('Data size %d' % len(words))
Data size 17005207

Build the dictionary and replace rare words with UNK token.

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

Function to generate a training batch for the skip-gram model.

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        data_index = (data_index + 1) % len(data)
    return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']

with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
    labels: ['anarchism', 'as', 'originated', 'a', 'as', 'term', 'a', 'of']

with num_skips = 4 and skip_window = 2:
    batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
    labels: ['originated', 'term', 'anarchism', 'a', 'of', 'as', 'originated', 'term']


Train a skip-gram model.

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))

num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001

with tf.Session(graph=graph) as session:
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
  final_embeddings = normalized_embeddings.eval()
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)



data_index_cbow = 0

def get_cbow_batch(batch_size, num_skips, skip_window):
    global data_index_cbow
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        data_index_cbow = (data_index_cbow + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        data_index_cbow = (data_index_cbow + 1) % len(data)
    cbow_batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    cbow_labels = np.ndarray(shape=(batch_size // (skip_window * 2), 1), dtype=np.int32)
    for i in range(batch_size):
        cbow_batch[i] = labels[i]
    cbow_batch = np.reshape(cbow_batch, [batch_size // (skip_window * 2), skip_window * 2])
    for i in range(batch_size // (skip_window * 2)):
        # center word
        cbow_labels[i] = batch[2 * skip_window * i]
    return cbow_batch, cbow_labels
# actual batch_size = batch_size // (2 * skip_window)
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))

num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), skip_window * 2])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  # Variables.
    embeddings = tf.Variable(
      tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
      tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  # Model.
  # Look up embeddings for inputs.
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    # reshape embed
    embed = tf.reshape(embed, (skip_window * 2, batch_size // (skip_window * 2), embedding_size))
    # average embed
    embed = tf.reduce_mean(embed, 0)
  # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(
      tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001

with tf.Session(graph=graph) as session:
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = get_cbow_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
  final_embeddings = normalized_embeddings.eval()
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
words = [reverse_dictionary[i] for i in range(200, num_points+1)]
plot(two_d_embeddings, words)



參數名稱     默認值     用途
sentences     None     訓練的語料,一個可迭代對象。對於從磁盤加載的大型語料最好用gensim.models.word2vec.BrownCorpus,gensim.models.word2vec.Text8Corpus ,gensim.models.word2vec.LineSentence 去生成sentences
size     100     生成詞向量的維度
alpha     0.025     初始學習率
window     5     句子中當前和預測單詞之間的最大距離,取詞窗口大小
min_count     5     文檔中總頻率低於此值的單詞忽略
max_vocab_size     None     構建詞彙表最大數,詞彙大於這個數按照頻率排序,去除頻率低的詞彙
sample     1e-3     高頻詞進行隨機下采樣的閾值,範圍是(0, 1e-5)
seed     1     向量初始化的隨機數種子
workers     3     幾個CPU進行跑
min_alpha     0.0001     隨着學習進行,學習率線性降低到這個最小數
sg     0     訓練時算法選擇 0:skip-gram, 1: CBOW
hs     0     0: 當這個爲0 而且negative 參數不爲零,用負採樣,1:層次 softmax
negative     5     負採樣,大於0是使用負採樣,當爲負數值就會進行增長噪音詞
ns_exponent     0.75     負採樣指數,肯定負採樣抽樣形式:1.0:徹底按比例抽,0.0對全部詞均等採樣,負值對低頻詞更多的採樣。流行的是0.75
cbow_mean     1     0:使用上下文單詞向量的總和,1:使用均值; 只適用於cbow
hashfxn     hash     希函數用於隨機初始化權重,以提升訓練的可重複性。
iter     5     迭代次數,epoch
null_word     0     空填充數據
trim_rule     None     詞彙修剪規則,指定某些詞語是否應保留在詞彙表中,默認是 詞頻小於 min_count則丟棄,能夠是本身定義規則
sorted_vocab     1     1:按照降序排列,0:不排序;實現方法:gensim.models.word2vec.Word2VecVocab.sort_vocab()
batch_words     10000     詞數量大小,大於10000 cython會進行截斷
compute_loss     False     損失(loss)值,若是是True 就會保存
callbacks     ()     在訓練期間的特定階段執行的回調序列~gensim.models.callbacks.CallbackAny2Vec
max_final_vocab     None     經過自動選擇匹配的min_count將詞彙限制爲目標詞彙大小,若是min_count有參數就用給定的數值

~gensim.models.word2vec.Word2Vec.save 保存模型
~gensim.models.word2vec.Word2Vec.load 加載模型

gensim.models.keyedvectors.KeyedVectors.save_word2vec_format實現原始 word2vec

word2vec 的保存
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format 單詞向量的加載

wv: 是類 ~gensim.models.keyedvectors.Word2VecKeyedVectors生產的對象,在word2vec是一個屬性
爲了在不一樣的訓練算法(Word2Vec,Fastext,WordRank,VarEmbed)之間共享單詞向量查詢代碼,gensim將單詞向量的存儲和查詢分離爲一個單獨的類 KeyedVectors     


model_w2v.wv.most_similar("民生銀行")  # 找最類似的詞
model_w2v.wv.get_vector("民生銀行")  # 查看向量
model_w2v.wv.syn0  #  model_w2v.wv.vectors 同樣都是查看向量
model_w2v.wv.vocab  # 查看詞和對應向量
model_w2v.wv.index2word  # 每一個index對應的詞

    須要注意的是word2vec採用的是標準hash table存放方式,hash碼重複後挨着放 取的時候根據拿出index找到詞表裏真正單詞,對比一下
    syn0 :就是詞向量的大矩陣,第i行表示vocab中下標爲i的詞
    syn1neg:negative sampling算法時用到的輔助矩陣

vocabulary:是類 ~gensim.models.word2vec.Word2VecVocab
trainables 是類 ~gensim.models.word2vec.Word2VecTrainables



from gensim.models import Word2Vec
# sentences只須要是一個可迭代對象就能夠
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)  # 執行這一句的時候就是在訓練模型了

Gemsim 的輸入只要求序列化的句子,而不須要將全部輸入都存儲在內存中。簡單來講,能夠輸入一個句子,處理它,刪除它,再載入另一個句子。
gensim.models.word2vec.BrownCorpus: BrownCorpus是一個英國語料庫,能夠用這個直接處理
gensim.models.word2vec.Text8Corpus ,

 # 使用LineSentence()
sentences = LineSentence('a.txt')   #  文本格式是 單詞空格分開,一行爲一個文檔
 # 使用Text8Corpus()
sentences = Text8Corpus('a.txt')   #  文本格式是 單詞空格分開,一行爲一個文
model = Word2Vec(sentences, min_count=1)  # 執行這一句的時候就是在訓練模型了
