詞向量（三）

時間 2020-04-06

標籤向量简体版

原文原文鏈接

本文做者：在線實驗室javascript

文章結構：java

詞向量git

背景介紹
效果展現
模型概覽
數據準備
編程實現
模型應用
總結
參考文獻

模型應用github

在模型訓練後，咱們能夠用它作一些預測。預測下一個詞：咱們能夠用咱們訓練過的模型，在得知以前的 N-gram 後，預測下一個詞。編程

def infer(use_cuda, params_dirname=None):
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

    exe = fluid.Executor(place)

    inference_scope = fluid.core.Scope()
    with fluid.scope_guard(inference_scope):
        # 使用fluid.io.load_inference_model獲取inference program，
        # feed變量的名稱feed_target_names和從scope中fetch的對象fetch_targets
        [inferencer, feed_target_names,
         fetch_targets] = fluid.io.load_inference_model(params_dirname, exe)

        # 設置輸入，用四個LoDTensor來表示4個詞語。這裏每一個詞都是一個id，
        # 用來查詢embedding表獲取對應的詞向量，所以其形狀大小是[1]。
        # recursive_sequence_lengths設置的是基於長度的LoD，所以都應該設爲[[1]]
        # 注意recursive_sequence_lengths是列表的列表
        data1 = [[211]]  # 'among'
        data2 = [[6]]  # 'a'
        data3 = [[96]]  # 'group'
        data4 = [[4]]  # 'of'
        lod = [[1]]

        first_word = fluid.create_lod_tensor(data1, lod, place)
        second_word = fluid.create_lod_tensor(data2, lod, place)
        third_word = fluid.create_lod_tensor(data3, lod, place)
        fourth_word = fluid.create_lod_tensor(data4, lod, place)

        assert feed_target_names[0] == 'firstw'
        assert feed_target_names[1] == 'secondw'
        assert feed_target_names[2] == 'thirdw'
        assert feed_target_names[3] == 'fourthw'

        # 構造feed詞典 {feed_target_name: feed_target_data}
        # 預測結果包含在results之中
        results = exe.run(
            inferencer,
            feed={
                feed_target_names[0]: first_word,
                feed_target_names[1]: second_word,
                feed_target_names[2]: third_word,
                feed_target_names[3]: fourth_word
            },
            fetch_list=fetch_targets,
            return_numpy=False)

        print(numpy.array(results[0]))
        most_possible_word_index = numpy.argmax(results[0])
        print(most_possible_word_index)
        print([
            key for key, value in six.iteritems(word_dict)
            if value == most_possible_word_index
        ][0])

因爲詞向量矩陣自己比較稀疏，訓練的過程若是要達到必定的精度耗時會比較長。爲了能簡單看到效果，教程只設置了通過不多的訓練就結束並獲得以下的預測。咱們的模型預測 among a group of 的下一個詞是the。這比較符合文法規律。若是咱們訓練時間更長，好比幾個小時，那麼咱們會獲得的下一個預測是 workers。預測輸出的格式以下所示:網絡

[[0.03768077 0.03463154 0.00018074 ... 0.00022283 0.00029888 0.02967956]]
0
the

其中第一行表示預測詞在詞典上的機率分佈，第二行表示機率最大的詞對應的id，第三行表示機率最大的詞。fetch

整個程序的入口很簡單：ui

def main(use_cuda, is_sparse):
    if use_cuda and not fluid.core.is_compiled_with_cuda():
        return

    params_dirname = "word2vec.inference.model"

    train(
        if_use_cuda=use_cuda,
        params_dirname=params_dirname,
        is_sparse=is_sparse)

    infer(use_cuda=use_cuda, params_dirname=params_dirname)


main(use_cuda=use_cuda, is_sparse=True)

總結spa

本章中，咱們介紹了詞向量、語言模型和詞向量的關係、以及如何經過訓練神經網絡模型得到詞向量。在信息檢索中，咱們能夠根據向量間的餘弦夾角，來判斷query和文檔關鍵詞這兩者間的相關性。在句法分析和語義分析中，訓練好的詞向量能夠用來初始化模型，以獲得更好的效果。在文檔分類中，有了詞向量以後，能夠用聚類的方法將文檔中同義詞進行分組，也能夠用 N-gram 來預測下一個詞。但願你們在本章後可以自行運用詞向量進行相關領域的研究。code

參考文獻

Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
Mikolov T, Kombrink S, Deoras A, et al. Rnnlm-recurrent neural network language modeling toolkit[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
Maaten L, Hinton G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605.
https://en.wikipedia.org/wiki/Singular_value_decomposition

本教程由 PaddlePaddle 創做，採用知識共享署名-相同方式共享 4.0 國際許可協議進行許可。