做者|Marcellus Ruben
編譯|VK
來源|Towards Datas Sciencepython
當你聽到「茶」和「咖啡」這個詞時,你會怎麼看這兩個詞?也許你會說它們都是飲料,含有必定量的咖啡因。關鍵是,咱們能夠很容易地認識到這兩個詞是相互關聯的。然而,當咱們把「tea」和「coffee」兩個詞提供給計算機時,它沒法像咱們同樣識別這兩個詞之間的關聯。git
單詞不是計算機天然就能理解的東西。爲了讓計算機理解單詞背後的意思,這個單詞須要被編碼成數字形式。這就是詞嵌入的用武之地。github
詞嵌入是天然語言處理中經常使用的一種技術,將單詞轉換成向量形式的數值。這些向量將以必定的維數佔據嵌入空間。web
若是兩個詞有類似的語境,好比「tea」和「coffee」,那麼這兩個詞在嵌入空間中的距離將彼此接近,而與具備不一樣語境的其餘詞之間的距離則會更遠。算法
在這篇文章中,我將逐步向你展現如何可視化嵌入這個詞。因爲本文的重點不是詳細解釋詞嵌入背後的基本理論,你能夠在本文和本文中閱讀更多關於該理論的內容。瀏覽器
爲了可視化詞嵌入,咱們將使用常見的降維技術,如PCA和t-SNE。爲了將單詞映射到嵌入空間中的向量表示,咱們使用預訓練詞嵌入GloVe 。bash
在可視化詞嵌入以前,一般咱們須要先訓練模型。然而,詞嵌入訓練在計算上是很是昂貴的。所以,一般使用預訓練好的詞嵌入模型。它包含嵌入空間中的單詞及其相關的向量表示。markdown
GloVe是斯坦福大學研究人員在Google開發的word2vec以外開發的一種流行的預訓練詞嵌入模型。在本文中,實現了GloVe預訓練的詞嵌入,你能夠在這裏下載它。app
https://nlp.stanford.edu/projects/glove/dom
同時,咱們可使用Gensim庫來加載預訓練好的詞嵌入模型。可使用pip命令安裝庫,以下所示。
pip install gensim
做爲第一步,咱們須要將GloVe文件格式轉換爲word2vec文件格式。經過word2vec文件格式,咱們可使用Gensim庫將預訓練好的詞嵌入模型加載到內存中。因爲每次調用此命令時,加載此文件都須要一些時間,所以,若是僅爲此目的而使用單獨的Python文件,則會更好。
import pickle from gensim.test.utils import datapath, get_tmpfile from gensim.models import KeyedVectors from gensim.scripts.glove2word2vec import glove2word2vec glove_file = datapath('C:/Users/Desktop/glove.6B.100d.txt') word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt") glove2word2vec(glove_file, word2vec_glove_file) model = KeyedVectors.load_word2vec_format(word2vec_glove_file) filename = 'glove2word2vec_model.sav' pickle.dump(model, open(filename, 'wb'))
如今咱們有了一個Python文件來加載預訓練的模型,接下來咱們能夠在另外一個Python文件中調用它來根據輸入詞生成最類似的單詞。輸入詞能夠是任何詞。
輸入單詞後,下一步就是建立一個代碼來讀取它。而後,咱們須要根據模型生成的每一個輸入詞指定類似單詞的數量。最後,咱們將類似單詞的結果存儲在一個列表中。下面是實現此目的的代碼。
import pickle filename = 'glove2word2vec_model.sav' model = pickle.load(open(filename, 'rb')) def append_list(sim_words, words): list_of_words = [] for i in range(len(sim_words)): sim_words_list = list(sim_words[i]) sim_words_list.append(words) sim_words_tuple = tuple(sim_words_list) list_of_words.append(sim_words_tuple) return list_of_words input_word = 'school' user_input = [x.strip() for x in input_word.split(',')] result_word = [] for words in user_input: sim_words = model.most_similar(words, topn = 5) sim_words = append_list(sim_words, words) result_word.extend(sim_words) similar_word = [word[0] for word in result_word] similarity = [word[1] for word in result_word] similar_word.extend(user_input) labels = [word[2] for word in result_word] label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))]) color_map = [label_dict[x] for x in labels]
舉個例子,假設咱們想找出與「school」相關聯的5個最類似的單詞。所以,「school」將是咱們的輸入詞。咱們的結果是‘college’, ‘schools’, ‘elementary’, ‘students’, 和‘student’。
如今,咱們已經有了輸入詞和基於它生成的類似詞。下一步,是時候讓咱們把它們在嵌入空間中可視化了。
經過預訓練的模型,每一個單詞均可以用向量表示映射到嵌入空間中。然而,詞嵌入具備很高的維數,這意味着沒法可視化單詞。
一般採用主成分分析(PCA)等方法來下降詞嵌入的維數。簡言之,PCA是一種特徵提取技術,它將變量組合起來,而後在保留變量中有價值的部分的同時去掉最不重要的變量。若是你想深刻研究PCA,我推薦這篇文章。
https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
有了PCA,咱們能夠在2D或3D中可視化詞嵌入,所以,讓咱們建立代碼,使用咱們在上面代碼塊中調用的模型來可視化詞嵌入。在下面的代碼中,只顯示三維可視化。爲了在二維可視化主成分分析,只應用微小的改變。你能夠在代碼的註釋部分找到須要更改的部分。
import plotly import numpy as np import plotly.graph_objs as go from sklearn.decomposition import PCA def display_pca_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10): if words == None: if sample > 0: words = np.random.choice(list(model.vocab.keys()), sample) else: words = [ word for word in model.vocab ] word_vectors = np.array([model[w] for w in words]) three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3] # 對於2D,將three_dim變量改成two_dim,以下所示: # two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2] data = [] count = 0 for i in range (len(user_input)): trace = go.Scatter3d( x = three_dim[count:count+topn,0], y = three_dim[count:count+topn,1], z = three_dim[count:count+topn,2], text = words[count:count+topn], name = user_input[i], textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 10, 'opacity': 0.8, 'color': 2 } ) #對於2D,不是使用go.Scatter3d,咱們須要用go.Scatter並刪除變量z。另外,不要使用變量three_dim,而是使用前面聲明的變量(例如two_dim) data.append(trace) count = count+topn trace_input = go.Scatter3d( x = three_dim[count:,0], y = three_dim[count:,1], z = three_dim[count:,2], text = words[count:], name = 'input words', textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 10, 'opacity': 1, 'color': 'black' } ) # 對於2D,不是使用go.Scatter3d,咱們須要用go.Scatter並刪除變量z。另外,不要使用變量three_dim,而是使用前面聲明的變量(例如two_dim) data.append(trace_input) # 配置佈局 layout = go.Layout( margin = {'l': 0, 'r': 0, 'b': 0, 't': 0}, showlegend=True, legend=dict( x=1, y=0.5, font=dict( family="Courier New", size=25, color="black" )), font = dict( family = " Courier New ", size = 15), autosize = False, width = 1000, height = 1000 ) plot_figure = go.Figure(data = data, layout = layout) plot_figure.show() display_pca_scatterplot_3D(model, user_input, similar_word, labels, color_map)
舉個例子,讓咱們假設咱們想把最類似的5個詞與「ball」、「school」和「food」聯繫起來。下面是二維可視化的例子。
下面是同一組單詞的三維可視化。
從視覺上,咱們如今能夠看到關於這些詞所佔空間的模式。與「ball」相關的單詞彼此靠近,由於它們具備類似的上下文。同時,它們與「學校」和「食物」相關的詞之間的距離因它們的語境不一樣而進一步不一樣。
除了PCA,另外一種經常使用的降維技術是t分佈隨機鄰域嵌入(t-SNE)。PCA和t-SNE的區別是它們實現降維的基本技術。
PCA是一種線性降維方法。將高維空間中的數據線性映射到低維空間,同時使數據的方差最大化。同時,t-SNE是一種非線性降維方法。該算法利用t-SNE計算高維和低維空間的類似性。其次,利用一種優化方法,例如梯度降低法,最小化兩個空間中的類似性差別。
用t-SNE實現詞嵌入的可視化代碼與PCA的代碼很是類似。在下面的代碼中,只顯示三維可視化。爲了使t-SNE在2D中可視化,只需應用微小的更改。你能夠在代碼的註釋部分找到須要更改的部分。
import plotly import numpy as np import plotly.graph_objs as go from sklearn.manifold import TSNE def display_tsne_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, perplexity = 0, learning_rate = 0, iteration = 0, topn=5, sample=10): if words == None: if sample > 0: words = np.random.choice(list(model.vocab.keys()), sample) else: words = [ word for word in model.vocab ] word_vectors = np.array([model[w] for w in words]) three_dim = TSNE(n_components = 3, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:3] # 對於2D,將three_dim變量改成two_dim,以下所示: # two_dim = TSNE(n_components = 2, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:2] data = [] count = 0 for i in range (len(user_input)): trace = go.Scatter3d( x = three_dim[count:count+topn,0], y = three_dim[count:count+topn,1], z = three_dim[count:count+topn,2], text = words[count:count+topn], name = user_input[i], textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 10, 'opacity': 0.8, 'color': 2 } ) # 對於2D,不是使用go.Scatter3d,咱們須要用go.Scatter並刪除變量z。另外,不要使用變量three_dim,而是使用前面聲明的變量(例如two_dim) data.append(trace) count = count+topn trace_input = go.Scatter3d( x = three_dim[count:,0], y = three_dim[count:,1], z = three_dim[count:,2], text = words[count:], name = 'input words', textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 10, 'opacity': 1, 'color': 'black' } ) # 對於2D,不是使用go.Scatter3d,咱們須要用go.Scatter並刪除變量z。另外,不要使用變量three_dim,而是使用前面聲明的變量(例如two_dim) data.append(trace_input) # 配置佈局 layout = go.Layout( margin = {'l': 0, 'r': 0, 'b': 0, 't': 0}, showlegend=True, legend=dict( x=1, y=0.5, font=dict( family="Courier New", size=25, color="black" )), font = dict( family = " Courier New ", size = 15), autosize = False, width = 1000, height = 1000 ) plot_figure = go.Figure(data = data, layout = layout) plot_figure.show() display_tsne_scatterplot_3D(model, user_input, similar_word, labels, color_map, 5, 500, 10000)
與PCA可視化中相同的示例,即與「ball」、「school」和「food」相關的前5個最類似的單詞的可視化結果以下所示。
下面是同一組單詞的三維可視化。
與PCA相同,注意具備類似上下文的單詞彼此靠近,而具備不一樣上下文的單詞則距離更遠。
到目前爲止,咱們已經成功地建立了一個Python腳本,用PCA或t-SNE將詞嵌入到2D或3D中。接下來,咱們能夠建立一個Python腳原本構建一個web應用程序,以得到更好的用戶體驗。
這個web應用程序使咱們可以用大量的功能和交互性來可視化詞嵌入。例如,用戶能夠鍵入本身的輸入詞,也能夠選擇與將返回的每一個輸入詞相關聯的前n個最類似的單詞。
可使用破折號或Streamlit建立web應用程序。在本文中,我將向你展現如何構建一個簡單的交互式web應用程序,以可視化Streamlit的詞嵌入。
首先,咱們將使用以前建立的全部Python代碼,並將它們放入一個Python腳本中。接下來,咱們能夠開始建立幾個用戶輸入參數,以下所示:
降維技術,用戶能夠選擇使用PCA仍是t-SNE。由於只有兩個選項,因此咱們可使用Streamlit中的selectbox屬性。
可視化的維度,在這個維度中,用戶能夠選擇將詞嵌入2D仍是3D顯示。與以前同樣,咱們可使用selectbox屬性。
輸入單詞。這是一個用戶輸入參數,它要求用戶鍵入他們想要的輸入詞,例如「ball」、「school」和「food」。所以,咱們可使用text_input屬性。
Top-n最類似的單詞,其中用戶須要指定將返回的每一個輸入單詞關聯的類似單詞的數量。由於咱們能夠選擇任何數字。
接下來,咱們須要考慮在咱們決定使用t-SNE時會出現的參數。在t-SNE中,咱們能夠調整一些參數以得到最佳的可視化結果。這些參數是複雜度、學習率和優化迭代次數。所以,在每一個狀況下,讓用戶指定這些參數的最佳值是不存在的。
由於咱們使用的是Scikit learn,因此咱們能夠參考文檔來找出這些參數的默認值。perplexity 的默認值是30,可是咱們能夠在5到50之間調整該值。學習率的默認值是300,可是咱們能夠在10到1000之間調整該值。最後,迭代次數的默認值是1000,但咱們能夠將該值調整爲250。咱們可使用slider屬性來建立這些參數值。
import streamlit as st dim_red = st.sidebar.selectbox( 'Select dimension reduction method', ('PCA','TSNE')) dimension = st.sidebar.selectbox( "Select the dimension of the visualization", ('2D', '3D')) user_input = st.sidebar.text_input("Type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,)",'') top_n = st.sidebar.slider('Select the amount of words associated with the input words you want to visualize ', 5, 100, (5)) annotation = st.sidebar.radio( "Enable or disable the annotation on the visualization", ('On', 'Off')) if dim_red == 'TSNE': perplexity = st.sidebar.slider('Adjust the perplexity. The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity', 5, 50, (30)) learning_rate = st.sidebar.slider('Adjust the learning rate', 10, 1000, (200)) iteration = st.sidebar.slider('Adjust the number of iteration', 250, 100000, (1000))
如今咱們已經介紹了構建咱們的web應用程序所需的全部部分。最後,咱們能夠把這些東西打包成一個完整的腳本,以下所示。
import plotly import plotly.graph_objs as go import numpy as np import pickle import streamlit as st from sklearn.decomposition import PCA from sklearn.manifold import TSNE filename = 'glove2word2vec_model.sav' model = pickle.load(open(filename, 'rb')) def append_list(sim_words, words): list_of_words = [] for i in range(len(sim_words)): sim_words_list = list(sim_words[i]) sim_words_list.append(words) sim_words_tuple = tuple(sim_words_list) list_of_words.append(sim_words_tuple) return list_of_words def display_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, annotation='On', dim_red = 'PCA', perplexity = 0, learning_rate = 0, iteration = 0, topn=0, sample=10): if words == None: if sample > 0: words = np.random.choice(list(model.vocab.keys()), sample) else: words = [ word for word in model.vocab ] word_vectors = np.array([model[w] for w in words]) if dim_red == 'PCA': three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3] else: three_dim = TSNE(n_components = 3, random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:3] color = 'blue' quiver = go.Cone( x = [0,0,0], y = [0,0,0], z = [0,0,0], u = [1.5,0,0], v = [0,1.5,0], w = [0,0,1.5], anchor = "tail", colorscale = [[0, color] , [1, color]], showscale = False ) data = [quiver] count = 0 for i in range (len(user_input)): trace = go.Scatter3d( x = three_dim[count:count+topn,0], y = three_dim[count:count+topn,1], z = three_dim[count:count+topn,2], text = words[count:count+topn] if annotation == 'On' else '', name = user_input[i], textposition = "top center", textfont_size = 30, mode = 'markers+text', marker = { 'size': 10, 'opacity': 0.8, 'color': 2 } ) data.append(trace) count = count+topn trace_input = go.Scatter3d( x = three_dim[count:,0], y = three_dim[count:,1], z = three_dim[count:,2], text = words[count:], name = 'input words', textposition = "top center", textfont_size = 30, mode = 'markers+text', marker = { 'size': 10, 'opacity': 1, 'color': 'black' } ) data.append(trace_input) # 配置佈局 layout = go.Layout( margin = {'l': 0, 'r': 0, 'b': 0, 't': 0}, showlegend=True, legend=dict( x=1, y=0.5, font=dict( family="Courier New", size=25, color="black" )), font = dict( family = " Courier New ", size = 15), autosize = False, width = 1000, height = 1000 ) plot_figure = go.Figure(data = data, layout = layout) st.plotly_chart(plot_figure) def horizontal_bar(word, similarity): similarity = [ round(elem, 2) for elem in similarity ] data = go.Bar( x= similarity, y= word, orientation='h', text = similarity, marker_color= 4, textposition='auto') layout = go.Layout( font = dict(size=20), xaxis = dict(showticklabels=False, automargin=True), yaxis = dict(showticklabels=True, automargin=True,autorange="reversed"), margin = dict(t=20, b= 20, r=10) ) plot_figure = go.Figure(data = data, layout = layout) st.plotly_chart(plot_figure) def display_scatterplot_2D(model, user_input=None, words=None, label=None, color_map=None, annotation='On', dim_red = 'PCA', perplexity = 0, learning_rate = 0, iteration = 0, topn=0, sample=10): if words == None: if sample > 0: words = np.random.choice(list(model.vocab.keys()), sample) else: words = [ word for word in model.vocab ] word_vectors = np.array([model[w] for w in words]) if dim_red == 'PCA': two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2] else: two_dim = TSNE(random_state=0, perplexity = perplexity, learning_rate = learning_rate, n_iter = iteration).fit_transform(word_vectors)[:,:2] data = [] count = 0 for i in range (len(user_input)): trace = go.Scatter( x = two_dim[count:count+topn,0], y = two_dim[count:count+topn,1], text = words[count:count+topn] if annotation == 'On' else '', name = user_input[i], textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 15, 'opacity': 0.8, 'color': 2 } ) data.append(trace) count = count+topn trace_input = go.Scatter( x = two_dim[count:,0], y = two_dim[count:,1], text = words[count:], name = 'input words', textposition = "top center", textfont_size = 20, mode = 'markers+text', marker = { 'size': 25, 'opacity': 1, 'color': 'black' } ) data.append(trace_input) # 配置佈局 layout = go.Layout( margin = {'l': 0, 'r': 0, 'b': 0, 't': 0}, showlegend=True, hoverlabel=dict( bgcolor="white", font_size=20, font_family="Courier New"), legend=dict( x=1, y=0.5, font=dict( family="Courier New", size=25, color="black" )), font = dict( family = " Courier New ", size = 15), autosize = False, width = 1000, height = 1000 ) plot_figure = go.Figure(data = data, layout = layout) st.plotly_chart(plot_figure) dim_red = st.sidebar.selectbox( 'Select dimension reduction method', ('PCA','TSNE')) dimension = st.sidebar.selectbox( "Select the dimension of the visualization", ('2D', '3D')) user_input = st.sidebar.text_input("Type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,)",'') top_n = st.sidebar.slider('Select the amount of words associated with the input words you want to visualize ', 5, 100, (5)) annotation = st.sidebar.radio( "Enable or disable the annotation on the visualization", ('On', 'Off')) if dim_red == 'TSNE': perplexity = st.sidebar.slider('Adjust the perplexity. The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity', 5, 50, (30)) learning_rate = st.sidebar.slider('Adjust the learning rate', 10, 1000, (200)) iteration = st.sidebar.slider('Adjust the number of iteration', 250, 100000, (1000)) else: perplexity = 0 learning_rate = 0 iteration = 0 if user_input == '': similar_word = None labels = None color_map = None else: user_input = [x.strip() for x in user_input.split(',')] result_word = [] for words in user_input: sim_words = model.most_similar(words, topn = top_n) sim_words = append_list(sim_words, words) result_word.extend(sim_words) similar_word = [word[0] for word in result_word] similarity = [word[1] for word in result_word] similar_word.extend(user_input) labels = [word[2] for word in result_word] label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))]) color_map = [label_dict[x] for x in labels] st.title('Word Embedding Visualization Based on Cosine Similarity') st.header('This is a web app to visualize the word embedding.') st.markdown('First, choose which dimension of visualization that you want to see. There are two options: 2D and 3D.') st.markdown('Next, type the word that you want to investigate. You can type more than one word by separating one word with other with comma (,).') st.markdown('With the slider in the sidebar, you can pick the amount of words associated with the input word you want to visualize. This is done by computing the cosine similarity between vectors of words in embedding space.') st.markdown('Lastly, you have an option to enable or disable the text annotation in the visualization.') if dimension == '2D': st.header('2D Visualization') st.write('For more detail about each point (just in case it is difficult to read the annotation), you can hover around each points to see the words. You can expand the visualization by clicking expand symbol in the top right corner of the visualization.') display_pca_scatterplot_2D(model, user_input, similar_word, labels, color_map, annotation, dim_red, perplexity, learning_rate, iteration, top_n) else: st.header('3D Visualization') st.write('For more detail about each point (just in case it is difficult to read the annotation), you can hover around each points to see the words. You can expand the visualization by clicking expand symbol in the top right corner of the visualization.') display_pca_scatterplot_3D(model, user_input, similar_word, labels, color_map, annotation, dim_red, perplexity, learning_rate, iteration, top_n) st.header('The Top 5 Most Similar Words for Each Input') count=0 for i in range (len(user_input)): st.write('The most similar words from '+str(user_input[i])+' are:') horizontal_bar(similar_word[count:count+5], similarity[count:count+5]) count = count+top_n
如今可使用Conda提示符運行web應用程序。在提示符中,轉到Python腳本的目錄並鍵入如下命令:
$ streamlit run your_script_name.py
接下來,會自動彈出一個瀏覽器窗口,你能夠在這裏本地訪問你的web應用程序。下面是你可使用該web應用程序執行的操做的快照。
就這樣!你已經建立了一個簡單的web應用程序,它具備不少交互性,能夠用PCA或t-SNE可視化詞嵌入。
若是你想看到這個詞嵌入可視化的所有代碼,你能夠在個人GitHub頁面上訪問它。
https://github.com/marcellusruben/Word_Embedding_Visualization
原文連接:https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5
歡迎關注磐創AI博客站:
http://panchuang.net/
sklearn機器學習中文官方文檔:
http://sklearn123.com/
歡迎關注磐創博客資源彙總站:
http://docs.panchuang.net/