nlp領域裏,語義理解仍然是難題!python
給你一篇文章或者一個句子,人們在理解這些句子時,頭腦中會進行上下文的搜索和知識聯想。一般狀況下,人在理解語義時頭腦中會搜尋與之相關的知識。知識圖譜的創始人人爲,構成這個世界的是實體,而不是字符串,這從根本上改變了過去搜索的體系。語義理解實際上是基於知識,概念和這些概念間的關係。人們在解答問題時,每每會講述與這個問題相關的知識,這是語義理解的過程。這種機制徹底不一樣於人對圖像或者語音的認識。CNN在圖像或者語音領域取得成果是不足爲奇的,由於生物學家已經對人腦神經元在圖像識別過程當中的機制很是熟悉,可是對於人腦如何理解文字的神經元機制卻知之甚少,因此致使了目前nlp語義理解方面進展很是緩慢。不少人嘗試CNN引入nlp效果不佳,發現多層的CNN和單層的CNN幾乎沒有差異,緣由得從人腦的神經元機制提及。生搬硬套是必然失敗的!深度學習的本質並非神經元層數多這麼簡單,可以從最基本的特徵,逐層抽取出高階特徵,最後進行分類,這是深度學習取得成功的關鍵。git
有一部分人質疑word2vector不是深度學習,說層數太淺達不到深度的級別,這是一種誤解。word2vector是地地道道的深度學習,可以抽取出詞的高階特徵。他的成功,關鍵是基於他的核心思想:相同語境出現的詞語義相近。從第一層one-hot到embedding層,就是高階特徵抽取的過程。前面說過,層數多了不必定帶來效果的提高。詞embedding已是高階特徵了,文字比圖像要複雜不少,目前CNN在nlp中的引入,方向多是錯誤的。必須深刻研究人腦對文字理解的神經元機制,弄清楚生物學模型,而後才能從中抽象出數學模型,就像CNN同樣,不然nlp不會有長足的進展。目前來看,LSTM以及Attention Model是比較成功的,可是仍然基於形式化的,對於深層語義仍然沒有解決。github
目前來看,深度學習算法LSTM,Attention Model等在nlp中的應用,僅限於上下文和詞,句子向量。計算一下句子類似度,聚類之類的,要想真正讓機器理解文字,還達不到。也就是說只在語義表示層作文章是遠遠不夠的,底層的知識圖譜是關鍵。Google提出的知識圖譜是一種變革,nlp是一個完整的生態圈,從最底層的存儲,GDB三元組(entry,relation,entry),到上層的語義表示(這個階段能夠藉助深度學習直接在語義層進行訓練),好比(head,relation,tail)三元組表示的圖結構,表達了實體與實體間的關係,能夠用深度學習訓練出一個模型:h + r = t,獲取語義表示。這樣在預測時,獲得了兩個實體的語義表示,進行減法運算就能夠知道二者的關係。這個不一樣於word2vector,可是仍是有共性的。word2vector的CBOW就是訓練x1 + x2 + …… = y這個模型。目前知網也在作這些事情。算法
語義表示是深度學習在nlp應用中的重中之重。以前在詞embedding上word2vector獲取了巨大成功,如今主要方向是由詞embedding遷移到句子或者文章embedding。獲取句子的embedding,以前的博客,siamese lstm已經有論述了,在2014~2015年間,國外的學者探索了各類方法,好比tree-lstm,convnet,skip-thougt,基於ma機構的siamese lstm來計算句子或者文章的類似度。目前從數據來看,基於ma結構的siamese lstm效果最好,最適應nlp的規律。在github上已經有了siamese lstm的實驗,進一步改進但是基於BiLSTM,至於增長層數是否可以帶來準確率的提高,有待於進一步論證,我的持中立態度。本文主要探討word2vector。關於他的核心思想前面已經提到了,這是道的層面,具體推導,好比CBOW ,skip-gram的優化:negative sampleing和哈夫曼樹softmax,這是術的層面。如今上傳用tensorflow實現的word2vector代碼:session
data-helper.py:app
import collections import os import random import zipfile import numpy as np import urllib.request as request import tensorflow as tf url = 'http://mattmahoney.net/dc/' def maybe_download(filename,expected_bytes): if not os.path.exists(filename): filename,_ = request.urlretrieve(url+filename,filename) statinfo = os.stat(filename) if statinfo.st_size == expected_bytes: print('Found and verified',filename) else: print(statinfo.st_size) raise Exception('Failed to verify' + filename + '.Can you get to it with a browser?') return filename def read_data(filename): with zipfile.ZipFile(filename) as f: data = tf.compat.as_str(f.read(f.namelist()[0])).split() return data vocabulary_size = 50000 def build_dataset(words): count = [['UNK',-1]] count.extend(collections.Counter(words).most_common(vocabulary_size - 1)) dictionary = dict(zip(list(zip(*count))[0],range(len(list(zip(*count))[0])))) data = list() un_count = 0 for word in words: if word in dictionary: index = dictionary[word] else: index = 0 un_count += 1 data.append(index) count[0][1] = un_count reverse_dictionary = dict(zip(dictionary.values(),dictionary.keys())) return data,reverse_dictionary,dictionary,count data_index = 0 def generate_batch(data,batch_size,num_skips,skip_window): filename = maybe_download('text8.zip', 31344016) words = read_data(filename) global data_index assert num_skips <= 2 * skip_window assert batch_size % num_skips == 0 span = 2 * skip_window + 1 batch = np.ndarray(shape=[batch_size],dtype=np.int32) labels = np.ndarray(shape=[batch_size,1],dtype=np.int32) buffer = collections.deque(maxlen=span) #初始化 for i in range(span): buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) #移動窗口,獲取批量數據 for i in range(batch_size // num_skips): target = skip_window avoid_target = [skip_window] for j in range(num_skips): while target in avoid_target: target = np.random.randint(0,span) avoid_target.append(target) batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j,0] = buffer[target] buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) return batch,labels
w2vModel.pydom
import tensorflow as tf import w2v.data_helper as da import numpy as np import math #filename = da.maybe_download('text8.zip', 31344016) words = da.read_data("text8.zip") assert words is not None data,reverse_dictionary,dictionary,count = da.build_dataset(words) class config(object): batch_size = 128 embedding_size = 128 skip_window = 1 num_skips = 2 valid_size = 16 valid_window = 100 valid_examples = np.random.choice(valid_window, valid_size, replace=False) num_sampled = 64 vocabulary_size = 50000 num_steps = 10001 class w2vModel(object): def __init__(self,config): self.train_inputs = train_inputs = tf.placeholder(tf.int32, shape=[config.batch_size]) self.train_labels = train_labels = tf.placeholder(tf.int32, shape=[config.batch_size, 1]) self.valid_dataset = valid_dataset = tf.constant(config.valid_examples, dtype=tf.int32) with tf.device('/cpu:0'): embeddings = tf.Variable( tf.random_uniform(shape=[config.vocabulary_size, config.embedding_size], minval=-1.0, maxval=1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) nce_weights = tf.Variable( tf.truncated_normal([config.vocabulary_size, config.embedding_size], stddev=1.0 / math.sqrt(config.embedding_size))) nce_bias = tf.Variable(tf.zeros([config.vocabulary_size])) loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_bias, labels=train_labels, inputs=embed, num_sampled=config.num_sampled, num_classes=config.vocabulary_size)) optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) tf.add_to_collection("embedding",embeddings) self.saver = saver = tf.train.Saver(tf.global_variables())
train.py:學習
import tensorflow as tf import w2v.w2vmodel as model import w2v.data_helper as da config = model.config() with tf.Graph().as_default() as g: Model = model.w2vModel(config) with tf.Session(graph=g) as session: tf.global_variables_initializer().run() print("initialized") average_loss = 0.0 for step in range(config.num_steps): batch_inputs,batch_labels = da.generate_batch(model.data,config.batch_size,config.num_skips,config.skip_window) feed_dict = {Model.train_inputs:batch_inputs,Model.train_labels:batch_labels} _,loss_val = session.run([Model.optimizer,Model.loss],feed_dict=feed_dict) average_loss += loss_val if step % 2000 == 0: if step > 0: average_loss /= 2000 print("Average loss at step",step,":",average_loss) average_loss = 0 if step % 10000 == 0: sim = Model.similarity.eval() for i in range(config.valid_size): valid_word = model.reverse_dictionary[config.valid_examples[i]] top_k = 8 nearest = (-sim[i,:]).argsort()[1:top_k+1] log_str = "Nearest to %s:" % valid_word for k in range(top_k): close_word = model.reverse_dictionary[nearest[k]] log_str = "%s %s," % (log_str,close_word) print(log_str) Model.saver.save(session, "E:/word2vector/models/model.ckpt") #final_embeddings = model.normalized_embeddings.eval()
代碼實現比較簡單,先對樣本統計,而後降序排列,在獲得dictionary{詞:索引},接下把樣本中的詞轉換成索引,進行訓練。詞向量就是神經元參數embedding,在預測時,只須要拿出embedding和dictionary,就能夠獲得詞向量,比biLSTM和siamese lstm簡單多了!可是,他在語義理解上有致命的缺點:對於詞典中沒出現的詞的語義表示用0代替,明顯是不穩當的,能力有限!因此如今國內有少數的學者研究把神經機率語義表示和符號語義表示結合起來,難度不小!優化
期待nlp語義理解出現變革……ui