Context-based word embedding learning approaches can model rich semantic and syntactic information.git
However, it is problematic for sentiment analysis because the words with similar contexts but opposite sentiment polarities, such as good and bad, are mapped into close word vectors in the embedding space.github
Recently, some sentiment embedding learning methods have been proposed, but most of them are designed to work well on sentence-level texts.架構
Directly applying those models to document-level texts often leads to unsatisfied results.app
To address this issue, we present a sentiment-specific word embedding learning architecture that utilizes local context informationas well as global sentiment representation.less
The architecture is applicable for both sentence-level and document-level texts.ide
We take global sentiment representation as a simple average of word embeddings in the text, and use a corruption strategy as a sentiment-dependent regularization.工具
Extensive experiments conducted on several benchmark datasets demonstrate that the proposed architecture outperforms the state-of-the-art methods for sentiment classification.學習
《經過全局情緒表示學習特定情緒詞的嵌入》ui
基於上下文的詞嵌入學習方法能夠對豐富的語義和句法信息進行建模。this
然而,對於情感分析卻存在問題,由於在嵌入空間中,具備類似語境但對立情感極性的詞(如好詞和壞詞)被映射到封閉詞向量中。
近年來,人們提出了一些情緒嵌入學習方法,但大多數方法都是爲了在句子級的文本中發揮做用。
直接將這些模型應用於文檔級文本一般會致使不滿意的結果。
爲了解決這個問題,咱們提出了一種情緒特定的詞嵌入學習架構,它利用了局部的背景信息以及全局情緒表示。
該體系結構適用於句子級和文檔級文本。
咱們將全球情緒表徵做爲文字嵌入的簡單平均值,並將腐敗策略做爲情緒依賴的規範化。
在多個基準數據集上進行的大量實驗代表,所提出的架構優於最早進的情感分類方法。
Co-occurrences between two words provide useful insights into the semantics of those words. Consequently, numerous prior work on word embedding learning has used co-occurrences between two words as the training signal for learning word embeddings. However, in natural language texts it is common for multiple words to be related and co-occurring in the same context. We extend the notion of co-occurrences to cover k(≥2)-way co-occurrences among a set of k-words. Specifically, we prove a theoretical relationship between the joint probability of k(≥2) words, and the sum of l_2 norms of their embeddings. Next, we propose a learning objective motivated by our theoretical result that utilize k-way co-occurrences for learning word embeddings. Our experimental results show that the derived theoretical relationship does indeed hold empirically, and despite data sparsity, for some smaller k(≤5) values, k-way embeddings perform comparably or better than 2-way embeddings in a range of tasks.
Representing the semantics of words is a fundamental task in text processing.
Several research studies have shown that text and knowledge bases (KBs) are complementary sources for word embedding learning.
Most existing methods only consider relationships within word-pairs in the usage of KBs.
We argue that the structural information of well-organized words within the KBs is able to convey more effective and stable knowledge in capturing semantics of words.
In this paper, we propose a semantic structure-based word embedding method, and introduce concept convergence and word divergence to reveal semantic structures in the word embedding learning process.
To assess the effectiveness of our method, we use WordNet for training and conduct extensive experiments on word similarity, word analogy, text classification and query expansion.
The experimental results show that our method outperforms state-of-the-art methods, including the methods trained solely on the corpus, and others trained on the corpus and the KBs.
《基於語義結構的概念收斂和詞嵌入發散》
表示詞的語義是文本處理中的一項基本任務。
一些研究代表,文本和知識庫(KBS)是詞嵌入學習的補充源。
在使用KBS時,大多數現有方法只考慮單詞對兒內部的關係。
咱們認爲,知識庫中組織良好的單詞的結構信息可以更有效、更穩定地傳遞捕獲單詞語義的知識。
本文提出了一種基於語義結構的嵌入方法,並引入概念收斂和單詞發散來揭示嵌入學習過程當中的語義結構。
爲了評估該方法的有效性,咱們使用WordNet進行了訓練,並在單詞類似性、單詞類比、文本分類和查詢擴展等方面進行了普遍的實驗。
實驗結果代表,咱們的方法優於最早進的方法,包括只在語料庫上訓練的方法,以及在語料庫和KBS上訓練的方法。
In this work, we investigate word embedding algorithms in the context of natural language processing. In particular, we examine the notion of ``negative examples'', the unobserved or insignificant word-context co-occurrences, in spectral methods. we provide a new formulation for the word embedding problem by proposing a new intuitive objective function that perfectly justifies the use of negative examples. In fact, our algorithm not only learns from the important word-context co-occurrences, but also it learns from the abundance of unobserved or insignificant co-occurrences to improve the distribution of words in the latent embedded space. We analyze the algorithm theoretically and provide an optimal solution for the problem using spectral analysis. We have trained various word embedding algorithms on articles of Wikipedia with 2.1 billion tokens and show that negative sampling can boost the quality of spectral methods. Our algorithm provides results as good as the state-of-the-art but in a much faster and efficient way.
Linguistic Inquiry and Word Count (LIWC) is a word counting software tool which has been used for quantitative text analysis in many fields.
Due to its success and popularity, the core lexicon has been translated into Chinese and many other languages.
However, the lexicon only contains several thousand of words, which is deficient compared with the number of common words in Chinese.
Current approaches often require manually expanding the lexicon, but it often takes too much time and requires linguistic experts to extend the lexicon.
To address this issue, we propose to expand the LIWC lexicon automatically.
Specifically, we consider it as a hierarchical classification problem and utilize the Sequence-to-Sequence model to classify words in the lexicon.
Moreover, we use the sememe information with the attention mechanism to capture the exact meanings of a word, so that we can expand a more precise and comprehensive lexicon.
The experimental results show that our model has a better understanding of word meanings with the help of sememes and achieves significant and consistent improvements compared with the state-of-the-art methods.
The source code of this paper can be obtained from https://github.com/thunlp/Auto_CLIWC.
《基於詞組嵌入層次分類的漢語詞彙擴展》
語言查詢和字數統計(LIWC)是一種在許多領域用於定量文本分析的字數統計軟件工具。
因爲它的成功和普及,核心詞彙已被翻譯成漢語和許多其餘語言。
然而,詞彙中只有幾千個詞,與漢語經常使用詞相比,這是一個不足之處。
目前的方法一般須要手動擴展詞彙,但一般須要花費大量的時間,而且須要語言專家來擴展詞彙。
爲了解決這個問題,咱們提出自動擴展LIWC詞典。
具體地說,咱們將其視爲一個層次分類問題,並利用序列-序列模型對詞彙進行分類。
此外,咱們還利用具備註意機制的義位信息來捕獲一個詞的確切含義,從而擴展出一個更精確、更全面的詞彙。
實驗結果代表,咱們的模型在義原的幫助下,對詞義有了更好的理解,與目前最早進的方法相比,獲得了顯著和一致的改進。
本文的源代碼可從 https://github.com/thunlp/auto_cliwc 得到。
Word embedding has been widely used in many natural language processing tasks. In this paper, we focus on learning word embeddings through selective higher-order relationships in sentences to improve the embeddings to be less sensitive to local context and more accurate in capturing semantic compositionality. We present a novel multi-order dependency-based strategy to composite and represent the context under several essential constraints. In order to realize selective learning from the word contexts, we automatically assign the strengths of different dependencies between co-occurred words in the stochastic gradient descent process. We evaluate and analyze our proposed approach using several direct and indirect tasks for word embeddings. Experimental results demonstrate that our embeddings are competitive to or better than state-of-the-art methods and significantly outperform other methods in terms of context stability. The output weights and representations of dependencies obtained in our embedding model conform to most of the linguistic characteristics and are valuable for many downstream tasks.
Multimodal models have been proven to outperform text-based models on learning semantic word representations. Almost all previous multimodal models typically treat the representations from different modalities equally. However, it is obvious that information from different modalities contributes differently to the meaning of words. This motivates us to build a multimodal model that can dynamically fuse the semantic representations from different modalities according to different types of words. To that end, we propose three novel dynamic fusion methods to assign importance weights to each modality, in which weights are learned under the weak supervision of word association pairs. The extensive experiments have demonstrated that the proposed methods outperform strong unimodal baselines and state-of-the-art multimodal models.
Representing the semantics of words is a long-standing problem for the natural language processing community.
Most methods compute word semantics given their textual context in large corpora.
More recently, researchers attempted to integrate perceptual and visual features.
Most of these works consider the visual appearance of objects to enhance word representations but they ignore the visual environment and context in which objects appear.
We propose to unify text-based techniques with vision-based techniques by simultaneously leveraging textual and visual context to learn multimodal word embeddings.
We explore various choices for what can serve as a visual context and present an end-to-end method to integrate visual context elements in a multimodal skip-gram model.
We provide experiments and extensive analysis of the obtained results.
《基於視覺語境的多模態詞表示學習》
詞的語義表示是天然語言處理界長期存在的問題。
大多數方法根據大語料庫中的文本上下文計算單詞語義。
最近,研究人員試圖整合感知和視覺特徵。
這些做品大多考慮對象的視覺外觀以加強單詞表示,但它們忽略了對象出現的視覺環境和上下文。
咱們建議將基於文本的技術與基於視覺的技術統一塊兒來,同時利用文本和視覺上下文來學習多模態單詞嵌入。
咱們探討了做爲可視上下文的各類選擇,並提出了一種將可視上下文元素集成到多模式跳圖模型中的端到端方法。
咱們對所得結果進行了實驗和普遍的分析。