從鍋爐工到AI專家(9)

無監督學習

前面已經說過了無監督學習的概念。無監督學習在實際的工做中應用仍是比較多見的。
從典型的應用上說,監督學習比較多用在「分類」上,利用給定的數據,作出一個決策,這個決策在有限的給定可能性中選擇其中一種。各種識別、自動駕駛等都屬於這一類。
無監督學習則是「聚類」,算法自行尋找輸入數據集的規律,並把它們按照規律分別組合,一樣特徵的放到一個類羣。像天然語言理解、推薦算法、數據畫像等,都屬於這類(實際實現中仍是比較多用半監督學習,但最先概念的導入仍是屬於無監督學習)。
無監督學習的確是沒有人工的標註,但全部輸入的數據都必須保持原有的、必然存在的內在規律。爲了保持這些規律或者挑選典型的規律,常常仍是須要一些人力。
介於二者之間的還有半監督學習,好比一半數據有標註,一半數據無標註。經過已標註數據分類,而後將無標註數據「聚類」到已知類型中去。從實現原理上或者組合了兩種算法,或者實際上更傾向於監督學習,這裏就不單獨拿出來講了。
前面看過了很多監督學習的例子,但尚未展現過無監督學習。今天就來剖析一個。html

單詞向量化(Vector Representations of Words)

單詞向量化是比較典型的無監督學習。這個概念的本意是這樣:在天然語言處理(NLP)中,理解單詞的含義是重要的一部分工做。由於咱們說過,機器學習的本質是數學運算,解方程。此外單詞的長度都不一致,根據歸一化的原則,首先要作的事情就是把單詞數字化成爲統一的維度和數量級,就是每一個單詞用一個數字代替。幾十年前的電報編碼其實就是這個意思,通常經常使用的單詞會用比較短的數字,這樣數字化以後的長度更短,經常使用單詞由於靠前,被檢索的速度也會快。
可是這樣也帶來一個大問題,就是單詞本來是有一些內部隱藏含義的,好比man/woman。明顯有些相關性的單詞,數字化以後假設一個是56,一個是34,其中內部的含義就徹底丟失了。cat / dog /animal這樣的單詞也是一樣的,這丟失掉的信息對於NLP來說,實際是很重要的部分。
所以單詞向量化的解決方法就是,把全部的單詞嵌入到(embeding)到一個連續的向量空間中去。詞義相近或者單詞有潛在關聯的單詞,在向量空間中兩個單詞之間的距離就近。這個距離也能夠做爲衡量兩個單詞類似程度的標準。由此,單詞向量化,也稱爲「word embedding」。
由於單詞向量化的工做是如此重要,TensorFlow官方提供了從低到高一整套示範或工具。python

  • 首先是一個簡單的示範實現word2vec_basic.py,咱們本篇主要看這個例子。
  • 由於word2vec工做很是耗時,官方又提供了一個升級版本word2vec_optimized.py,這個版本用c++重寫了耗時的代碼,做爲TensorFlow的外部模塊來使用,提供了較爲正式的功能。由於這個代碼在機器學習上並無太多新概念。而又較多的使用了c++開發python外圍模塊的技術,更多的用於外圍model的編寫示例,因此這裏只作一個簡單介紹就跳過。有興趣的朋友能夠本身研究。
  • 官方提供了一個正式的命令行工具word2vec,可使用pip安裝,用於正式的一些單詞向量化的工做。由於在NLP項目中,單詞向量化一般都是第一步的工做,爲NLP後續工做提供數據預處理。有了這個工具,不少工做就能夠直接開始,再也不另行編程。

基本原理

幾乎全部實現單詞向量化的算法都依賴於分佈式假設,其核心思想爲假定出現於相同上下文情景中的詞彙都有相相似的語義。這個概念可能有些含糊,舉個例子「我吃了個蘋果」是一句話,另一句話是「我吃了個香蕉」。做爲非監督學習,這兩句話不會作任何標註,可是通過訓練的模型應當能理解「蘋果」跟「香蕉」這兩個詞具備高度類似性,換言之,這兩個詞在向量空間中,應當具備很接近的距離。
爲了用算法實現這個概念,一般有兩種方法:計數法和推理法。
計數法:在大型語料庫中對某詞彙及其臨近詞彙進行統計計數,記下多種指標好比出現頻率等,而後再根據這些量把全部單詞映射到向量空間中去。
推理法:也叫預測法,首先假設已經存在一個向量空間,利用這個空間中已經有的數據,經過某詞彙的臨近詞彙,對詞彙自己進行預測,對錯誤的預測在向量空間中調整其位置。
其實這兩種方法常常是結合使用的。
在word2vec實例中,使用了基於極大似然法的機率化語言模型對連續單詞進行關聯性預測。極大似然法的資料能夠參考最下面的參考連接部分,有一組公式用於實現這個算法。
隨後咱們有了任何一個單詞以後,根據單詞上下文,上下文的定義在程序中是能夠設置的,咱們採用單詞左邊1個及右邊1個單詞做爲上下文。舉例說有一句話:
the quick brown fox jumped over the lazy dog
以上下文相關的方式對每一個單詞進行分組能夠獲得:
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
而後咱們就能夠利用the brown預測quick,用quick fox預測brown。這種預測方式也叫連續詞袋模型(CBOW)。還有一種方式是反過來,同上例,好比咱們用quick預測the brown,這樣叫:Skip-Gram模型。
從時間複雜性上說,CBOW算法適合較小的數據集,但準確度更高(用多個單詞預測1個單詞),Skip-Gram則適合較大數據集(用1個單詞預測多個單詞)。c++

源碼

#!/usr/bin/env python
# -*- coding=UTF-8 -*-

# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Basic word2vec example."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import os
import random
from tempfile import gettempdir
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf

# Step 1: Download the data.
url = 'http://mattmahoney.net/dc/'

# 從上面的URL下載給定的語料庫
# 爲了提升速度,這裏手工下載後屏蔽了本函數,防止每次運行都重複下載速度太慢
# pylint: disable=redefined-outer-name
def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  local_filename = os.path.join(gettempdir(), filename)
  if not os.path.exists(local_filename):
    local_filename, _ = urllib.request.urlretrieve(url + filename,
                                                   local_filename)
  statinfo = os.stat(local_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception('Failed to verify ' + local_filename +
                    '. Can you get to it with a browser?')
  return local_filename

#單詞數據包實際下載路徑:http://mattmahoney.net/dc/text8.zip
#在這裏下載後放到當前目錄,因此下面filename作了修改,而且再也不調用maybe_download函數
#filename = maybe_download('text8.zip', 31344016)
filename = "./text8.zip"

#從zip包中第一個文件讀取全部的數據(實際只有一個文本文件),
#全部的數據只有詞,以空格分割,沒有標點符號。
#單詞之間有語序關係,意思是某文章去掉標點符號以後,每句話中單詞的語序仍然存在。
#爲了加深印象,能夠解壓縮text8.zip包,而後顯示文本文件看一下,
#文件很大,建議使用more只查看一部分
# Read the data into a list of strings.
def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words."""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data

#讀取全部單詞到字符串數組
vocabulary = read_data(filename)
print('Data size', len(vocabulary))

#總體數據,按照這個下載包是17005207個單詞,下面50000是爲了演示速度,限制了有效詞數
# Step 2: Build the dictionary and replace rare words with UNK token.
vocabulary_size = 50000

def build_dataset(words, n_words):
  """Process raw inputs into a dataset."""
  count = [['UNK', -1]]
  #按照相同詞統計數,進行排序,經常使用詞在前面,最前面固然是UNK
  #其次是the/of/and/one,排序靠前的5個單詞後面會顯示出來...
  count.extend(collections.Counter(words).most_common(n_words - 1))
  dictionary = dict()
  for word, _ in count:
    #計數增長,用於給每一個詞編一個惟一數字代碼
    #UNK是第0個,編碼是0,由於加入第一個UNK的時候,dictionary是空,因此len是0。
    #輸出看的時候,由於是以單詞順序列出來,因此看着順序混亂,
    #實際看反查表由於數字在前,看起來會更明顯。
    dictionary[word] = len(dictionary) 
  data = list()
  #data最終是數字化以後的words,也就是數字化以後的原文,
  #其中按照原文順序,每一個元素,是該單詞的數字編碼
  #數字編碼是從dictionary中查表找到的,也就是本函數前面數字化單詞的過程獲得的
  unk_count = 0
  for word in words:
    index = dictionary.get(word, 0)
    if index == 0:  # dictionary['UNK']
      unk_count += 1
    data.append(index)
  count[0][1] = unk_count
  #這個逆轉字典是從數字到單詞來對應,雙向查表用的
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reversed_dictionary

# Filling 4 global variables:
# data - list of codes (integers from 0 to vocabulary_size-1).
#   This is the original text but words are replaced by their codes
# count - map of words(strings) to count of occurrences
# dictionary - map of words(strings) to their codes(integers)
# reverse_dictionary - maps codes(integers) to words(strings)
#使用build_dataset函數填充4個全局變量,
#這些全局變量的內容剛纔在函數註釋中咱們都介紹過了
#也能夠參考上面官方本來的英文註釋
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                            vocabulary_size)
#作完了上面的數字化,原文其實就沒用了,這裏刪除以節省內存
del vocabulary  # Hint to reduce memory.
#count表上面說了,是統計出現次數,這裏列出最常出現的5個單詞
print('Most common words (+UNK)', count[:5])
#數字化後的前10個單詞及查表得出的原文,
#注意後面的逆向表查表部分是python特有的語法,其它語言中很少見
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

data_index = 0

# Step 3: Function to generate a training batch for the skip-gram model.
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1  # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  if data_index + span > len(data):
    data_index = 0
  buffer.extend(data[data_index:data_index + span])
  data_index += span
  for i in range(batch_size // num_skips):
    context_words = [w for w in range(span) if w != skip_window]
    words_to_use = random.sample(context_words, num_skips)
    for j, context_word in enumerate(words_to_use):
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[context_word]
    if data_index == len(data):
      buffer.extend(data[0:span])
      data_index = span
    else:
      buffer.append(data[data_index])
      data_index += 1
  # Backtrack a little bit to avoid skipping words in the end of a batch
  data_index = (data_index + len(data) - span) % len(data)
  return batch, labels

#生成一批用於學習的數據集,這裏首先生成一批很小的量
#而後在下面顯示出來,用於人爲觀察生成的數據集是否合理
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
  print(batch[i], reverse_dictionary[batch[i]],
        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

# Step 4: Build and train a skip-gram model.
#這裏定義的常量,纔是真正學習的時候生成數據集的尺寸等參數
batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
# 左右各考慮1個單詞
skip_window = 1       # How many words to consider left and right.
# 本窗口完成跳2個單詞取樣下一個窗口
num_skips = 2         # How many times to reuse an input to generate a label.
num_sampled = 64      # Number of negative examples to sample.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. These 3 variables are used only for
# displaying model accuracy, they don't affect calculation.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)


graph = tf.Graph()

with graph.as_default():

  # Input data.
  train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

  # Ops and variables pinned to the CPU because of missing GPU implementation
  with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)

    # Construct the variables for the NCE loss
    nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

  # Compute the average NCE loss for the batch.
  # tf.nce_loss automatically draws a new sample of the negative labels each
  # time we evaluate the loss.
  # Explanation of the meaning of NCE loss:
  #   http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

  # Construct the SGD optimizer using a learning rate of 1.0.
  optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

  # Compute the cosine similarity between minibatch examples and all embeddings.
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
  similarity = tf.matmul(
      valid_embeddings, normalized_embeddings, transpose_b=True)

  # Add variable initializer.
  init = tf.global_variables_initializer()

# Step 5: Begin training.
num_steps = 100001

with tf.Session(graph=graph) as session:
  # We must initialize all variables before we use them.
  init.run()
  print('Initialized')

  average_loss = 0
  for step in xrange(num_steps):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    #能夠在tensorflow運算過程當中逐批次喂入的數據集是由tf.placeholder定義的,
    #這裏把全部要喂入的數據先包裝成dict
    feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    #運行,並逐批次喂入數據
    _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
      if step > 0:
        average_loss /= 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      #這裏顯示的是每2000批次平均出來的代價函數返回值
      print('Average loss at step ', step, ': ', average_loss)
      average_loss = 0

    # Note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in xrange(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8  # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k + 1]
        log_str = 'Nearest to %s:' % valid_word
        for k in xrange(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log_str = '%s %s,' % (log_str, close_word)
        print(log_str)
  final_embeddings = normalized_embeddings.eval()

# Step 6: Visualize the embeddings.


# pylint: disable=missing-docstring
# Function to draw visualization of distance between embeddings.
def plot_with_labels(low_dim_embs, labels, filename):
  assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
  plt.figure(figsize=(18, 18))  # in inches
  for i, label in enumerate(labels):
    x, y = low_dim_embs[i, :]
    plt.scatter(x, y)
    plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

  plt.savefig(filename)

try:
  # pylint: disable=g-import-not-at-top
  from sklearn.manifold import TSNE
  import matplotlib.pyplot as plt

  tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
  plot_only = 500
  low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
  labels = [reverse_dictionary[i] for i in xrange(plot_only)]
  plot_with_labels(low_dim_embs, labels, './tsne.png')

except ImportError as ex:
  print('Please install sklearn, matplotlib, and scipy to show embeddings.')
  print(ex)

源碼沒有使用從main()開始的函數式編程風格,較多的使用了過程式語言的方式。一塊功能定義一個函數,而後接着就在python的全局開始初始化和調用剛纔的函數,隨後接着是下一個函數和相應的調用。
除了之前見過的部分,源碼中都作了比較多的註釋。下面再對一些重點部分作一個講解。
講解以前爲了理解方便,這裏先把語料庫摘個開頭貼一下:git

anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist society might work vary considerably especially with respect to economics there is also disagreement about how a free society might be brought about origins and predecessors kropotkin and others argue that before recorded history human society was organized on anarchist principles most anthropologists follow kropotkin and engels in believing that hunter gatherer bands were egalitarian and lacked division of labour accumulated wealth or decreed law and had equal access to resources william godwin anarchists including the the anarchy organisation and rothbard find anarchist attitudes in taoism from ancient china kropotkin found similar ideas in stoic zeno of citium according to kropotkin zeno repudiated the omnipotence of the state its intervention and regimentation and proclaimed the sovereignty of the moral law of the individual the anabaptists of one six th century europe are sometimes considered to be religious forerunners of modern anarchism bertrand russell in his history of western philosophy writes that the anabaptists repudiated all law since they held that the good man will be guided at every moment by the holy spirit from

語料庫是一個連續的文本文件,其中每一個單詞之間用一個空格隔開,沒有標點符號、沒有換行符等控制字符(因此上面的摘錄,在終端中看是不少行,在這裏顯示爲1行)。
參考官方的講解,咱們這裏也把程序分紅6個部分:github

  1. 檢測本地若是沒有語料庫,則去網上下載,下載路徑是:http://mattmahoney.net/dc/text8.zip
    同之前的例子相同,由於這個下載包壓縮後30多M,我手工下載了語料庫,簡單的修改了程序,直接從當前目錄打開text8.zip文件,以便節省時間。
    比後面進階示例好的地方是,本例中使用了zipfile包來直接讀取壓縮包中的語料庫,不用再解壓出來,不然但是100多M的一個文本文件。
    單詞會讀到vocabulary數組,每一個單詞佔用一個數組元素。數組的順序就是原來在語料庫中單詞出現的順序。算法

  2. 進行基本的數據整理, 示例起見,這裏只使用前面的50000個單詞對模型進行訓練。在訓練集中,統計單詞的出現頻率,並根據頻率生成字典dictionary。字典中頻率高的靠前放,在字典中的排名將是這個單詞的編號。出現不多的單詞替換爲「UNK」(由於這種出現很是少的單詞沒有參考對象,沒法進行訓練和預測。所以乾脆用UNK代替,等因而剔除)。UNK在字典中是第1個,編號是0;後面則按照出現頻率排序。程序開始運行時會顯示前5個高頻詞,應當以下:
    Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
    也說明the的編號爲1,of是2,and是3,one是4。
    隨後使用這個字典,對整個語料庫進行數字化,數字化的結果存在data之中,其中稀有詞UNK已經被去掉。完成後將是相似這樣:
    5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156,這組數字表明原文中的前10個單詞:
    'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against'
    最後,考慮到單詞數字化以後,還會須要被逆轉成單詞,所以又生成了reversed_dictionary字典,其中鍵值是數字,值則是單詞,用於逆向檢索。express

  3. 定義了一個函數,用於生成訓練用的數據集。根據訓練的特色,訓練集是批次生成的。定義完這個函數,使用了一個很小的量(程序中是8)實驗生成了一下。這裏重點須要理解函數的3個參數:batch_size是每批次生成的單詞量;num_skips表明單詞窗口移動時跳過的單詞數;skip_window是當前單詞左右幾個單詞做爲本單詞的上下文。前面講過了,Skip_gram算法是用當前單詞,在訓練好的模型中預測這個上下文。apache

  4. 構建用於訓練的skip-gram模型,重點2個:
  • 沒有使用咱們熟悉的softmax分類,而是用了nce_loss,一樣具體公式在網上查(其實官方word2vec課程中就有,這裏略去了)。主要緣由是softmax對於一個巨大的分類系統工做很是緩慢,此外nce_loss算法的數學特徵使得預測命中的詞給出高几率(極大似然法,公式最終結果是可能性機率),給沒有命中的詞(噪音詞)給出低機率,從而達成抑制噪音的目的。
  • tf.nn.embedding_lookup是一個新接觸到的函數,爲了便於理解,咱們後面給一個小例子。其他部分雖然模型不常見,但基本都應當能讀懂。
  1. 訓練模型,最終向量化的結果,從TensorFlow輸出保存到python變量final_embeddings裏面。中間每10000步會列出16個單詞及其類似單詞,這個功能能夠清晰的看到類似度正確率的提升。固然做爲一個示例程序,離正式的應用仍是有很大差距的。
  2. 將獲得的向量化結果,抽取前500個,繪製出來,輸出爲png圖片。從圖片上看,可以更形象的理解單詞向量化的概念。

還有一點要講解的是,咱們前面的例子一直強調歸一化的重要性,在word2vec中,除了數字化,基本沒有別的歸一化動做。緣由不少,最主要的是,在之前的例子中,咱們更關注量的概念,擬合到比較接近的數值就算很好的結果。而對數字化以後的單詞來說,每一個整數對應一個單詞,不可能有小數,就算值相差1,也表明了徹底不一樣的單詞。所以在本例中沒有傳統轉換成浮點數那種歸一化操做。編程

tf.nn.embedding_lookup的功能

來看個小例子:api

#!/usr/bin/env/python
# coding=utf-8
import tensorflow as tf
import numpy as np

input_ids = tf.placeholder(dtype=tf.int32, shape=[None])

#定義一個5x5對角矩陣,樣式能夠看運行結果第一個輸出 
embedding = tf.Variable(np.identity(5, dtype=np.int32))
#使用embedding_lookup檢索矩陣,檢索數據集是input_ids
input_embedding = tf.nn.embedding_lookup(embedding, input_ids)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
print(embedding.eval())
print(sess.run(input_embedding, feed_dict={input_ids:[1, 2, 3, 0, 3, 2, 1]}))

執行結果:

embedding = [[1 0 0 0 0]
             [0 1 0 0 0]
             [0 0 1 0 0]
             [0 0 0 1 0]
             [0 0 0 0 1]]
input_embedding = [[0 1 0 0 0]
                   [0 0 1 0 0]
                   [0 0 0 1 0]
                   [1 0 0 0 0]
                   [0 0 0 1 0]
                   [0 0 1 0 0]
                   [0 1 0 0 0]]

embedding_lookup的功能,就是根據input_ids中的id,尋找embedding中的對應行的元素,逐行結果組合在一塊兒,成爲一個新的矩陣返回。好比上面就是embedding第一、二、三、0、三、二、1行的結果,從新組合成一個7行的矩陣返回給input_embedding。

進階實現

進階版本源碼是一個基本能夠應用的實例,在項目頁面的介紹中有使用辦法,但在macOS中運行有些問題,這裏作個說明。
首先是編譯,我是用的TensorFlow1.4.1版本沒有這個方法tf.sysconfig.get_compile_flags(),沒法獲得正確的編譯參數,最後只好寫了一個腳本進行編譯:

#!/bin/sh

TF_CFLAGS="-I/usr/local/lib/python2.7/site-packages/tensorflow/include"
TF_LFLAGS="-L/usr/local/lib/python2.7/site-packages/tensorflow"

g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS} ${TF_LFLAGS} -O2 -D_GLIBCXX_USE_CXX11_ABI=0 -undefined dynamic_lookup

方法就是人工找到INCLUDE和LIB路徑,將路徑設置爲常量,在編譯中直接給定。
須要注意是在macOS上編譯,必須使用-undefined dynamic_lookup,否則連接的時候會報錯。
編譯以後獲得的so文件,會在python程序中使用以下方法調用:

word2vec = tf.load_op_library(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'word2vec_ops.so'))
...
    (words, counts, words_per_epoch, current_epoch, total_words_processed,
     examples, labels) = word2vec.skipgram_word2vec(filename=opts.train_data,
                                                    batch_size=opts.batch_size,
                                                    window_size=opts.window_size,
                                                    min_count=opts.min_count,
                                                    subsample=opts.subsample)

數據文件的準備使用官方給出的命令沒有問題:

curl http://mattmahoney.net/dc/text8.zip > text8.zip
unzip text8.zip
curl https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip > source-archive.zip
unzip -p source-archive.zip  word2vec/trunk/questions-words.txt > questions-words.txt
rm text8.zip source-archive.zip

用於評估訓練結果的問題集由於在google的服務器上,可能須要FQ才能下載。

最後word2vec_optimized.py的執行結果以下:

2018-01-16 13:17:14.277603: I word2vec_kernels.cc:200] Data file: /Users/andrew/dev/tensorFlow/word2vec/text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words.
Data file:  /Users/andrew/dev/tensorFlow/word2vec/text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Eval analogy file:  /Users/andrew/dev/tensorFlow/word2vec/questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   150943: lr = 0.024 words/sec =    31527
Eval 1469/17827 accuracy =  8.2%
Epoch    2 Step   301913: lr = 0.023 words/sec =    25120
Eval 2395/17827 accuracy = 13.4%
Epoch    3 Step   452887: lr = 0.021 words/sec =     8842
Eval 3014/17827 accuracy = 16.9%
Epoch    4 Step   603871: lr = 0.020 words/sec =     6615
Eval 3532/17827 accuracy = 19.8%
Epoch    5 Step   754815: lr = 0.019 words/sec =     3007
Eval 3994/17827 accuracy = 22.4%
Epoch    6 Step   905787: lr = 0.018 words/sec =    26590
Eval 4320/17827 accuracy = 24.2%
Epoch    7 Step  1056767: lr = 0.016 words/sec =    35439
Eval 4714/17827 accuracy = 26.4%
Epoch    8 Step  1207755: lr = 0.015 words/sec =      401
Eval 4965/17827 accuracy = 27.9%
Epoch    9 Step  1358735: lr = 0.014 words/sec =    36991
Eval 5276/17827 accuracy = 29.6%
Epoch   10 Step  1509744: lr = 0.013 words/sec =    25069
Eval 5415/17827 accuracy = 30.4%
Epoch   11 Step  1660729: lr = 0.011 words/sec =    28271
Eval 5649/17827 accuracy = 31.7%
Epoch   12 Step  1811667: lr = 0.010 words/sec =    29973
Eval 5880/17827 accuracy = 33.0%
Epoch   13 Step  1962606: lr = 0.009 words/sec =    10225
Eval 6015/17827 accuracy = 33.7%
Epoch   14 Step  2113546: lr = 0.008 words/sec =    21419
Eval 6270/17827 accuracy = 35.2%
Epoch   15 Step  2264489: lr = 0.006 words/sec =    27059
Eval 6434/17827 accuracy = 36.1%

程序看上去要複雜不少。主要目的是展現把耗時的操做、而TensorFlow中又沒有實現的算法,用c++寫成TensorFlow擴展包的形式來實現一個複雜的機器學習模型。因此這裏不過多說源碼,有興趣的讀者能夠自行分析。
最後看一下用於評估的問題庫的格式:

: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba
Athens Greece Helsinki Finland
Athens Greece Islamabad Pakistan
Athens Greece Kabul Afghanistan
Athens Greece London England
Athens Greece Madrid Spain
Athens Greece Moscow Russia
Athens Greece Oslo Norway
Athens Greece Ottawa Canada
Athens Greece Paris France
Athens Greece Rome Italy
Athens Greece Stockholm Sweden
Athens Greece Tehran Iran
Athens Greece Tokyo Japan
Baghdad Iraq Bangkok Thailand
Baghdad Iraq Beijing China
Baghdad Iraq Berlin Germany
Baghdad Iraq Bern Switzerland
...

其中使用冒號「:」開頭的行是註釋行,程序中會跳過。
隨後是城市-首都 城市-首都這樣形式的關聯對,4個詞在一行。預測方法就是用前三個詞,預測最後一個詞,若是預測對了,則正確率+1。可見在訓練語料庫text8跟評估使用的問題集questions-words.txt徹底不一樣、且沒有任何關聯性的兩個數據集中,達到36.1%的預測正確率是多麼不容易(另外這個示例也沒有完成所有的訓練,不然正確率還能夠提升)。
依賴這種特徵,單詞向量化也常常用於呼叫中心知識庫的智能檢索,實現智能回答機器人的一些實現中。

(待續...)

引文及參考

TensorFlow中文社區word2vec講解
圖解word2vec
極大似然法
Dependency-Based Word Embeddings

相關文章
相關標籤/搜索