Align, Disambiguate and Walk:A Unified Approach for Measuring Semantic Similarit

正文以前

2013年ACL的一篇文章,內容很容易理解,簡潔幹練,我是真有點喜歡了,系統類的文章實在是太難讀了!!node

Pilehvar M T, Jurgens D, Navigli R. Align, disambiguate and walk: A unified approach for measuring semantic similarity[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013, 1: 1341-1351.算法

正文

摘要

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.網絡

語義類似性是許多天然語言處理應用程序的重要組成部分。然而,用於計算語義類似性的現有方法一般在不一樣級別操做,例如單個單詞或整個文檔,這須要針對每種數據類型調整方法。咱們提出了一種統一的語義類似性方法,能夠在多個層面上進行操做,從比較詞義到比較文本文檔。咱們的方法利用對詞義的共同機率表示來比較不一樣類型的語言數據。這種統一的表示顯示了三個任務的最早進的表現:語義文本類似性,單詞類似性和詞義粗化。app

1 Introduction

1簡介

Semantic similarity is a core technique for many topics in Natural Language Processing such as Textual Entailment (Berant et al., 2012), Semantic Role Labeling (Furstenau and Lapata, 2012), ¨ and Question Answering (Surdeanu et al., 2011). For example, textual similarity enables relevant documents to be identified for information retrieval (Hliaoutakis et al., 2006), while identifying similar words enables tasks such as paraphrasing (Glickman and Dagan, 2003), lexical substitution (McCarthy and Navigli, 2009), lexical simplification (Biran et al., 2011), and Web search result clustering (Di Marco and Navigli, 2013).dom

語義類似性是天然語言處理中許多主題的核心技術,如文本蘊涵(Berant等,2012),語義角色標籤(Furstenau和Lapata,2012),¨和問題回答(Surdeanu等,2011)。例如,文本類似性使得相關文檔可以被識別用於信息檢索(Hliaoutakis等,2006),而識別類似詞則能夠實現諸如釋義(Glickman和Dagan,2003),詞彙替換(McCarthy和Navigli,2009)等任務,詞彙簡化(Biran et al。,2011)和Web搜索結果聚類(Di Marco和Navigli,2013)。分佈式

Approaches to semantic similarity have often operated at separate levels: methods for word similarity are rarely applied to documents or even single sentences (Budanitsky and Hirst, 2006; Radinsky et al., 2011; Halawi et al., 2012), while document-based similarity methods require more linguistic features, which often makes them inapplicable at the word or microtext level (Salton et al., 1975; Maguitman et al., 2005; Elsayed et al., 2008; Turney and Pantel, 2010). Despite the potential advantages, few approaches to semantic similarity operate at the sense level due to the challenge in sense-tagging text (Navigli, 2009); for example, none of the top four systems in the recent SemEval-2012 task on textual similarity compared semantic representations that incorporated sense information (Agirre et al., 2012).ide

語義類似性的方法一般在不一樣的層面上運做:詞語類似性的方法不多應用於文檔甚至單個句子(Budanitsky和Hirst,2006; Radinsky等,2011; Halawi等,2012),而基於文檔的類似性方法須要更多的語言特徵,這一般使得它們不適用於單詞或縮微文本級別(Salton等,1975; Maguitman等,2005; Elsayed等,2008; Turney和Pantel,2010)。儘管具備潛在的優點,但因爲語義標記文本的挑戰,不多有語義類似性方法在語義層面上運做(Navigli,2009);例如,最近SemEval-2012關於文本類似性的任務中的前四個系統都沒有比較包含語義信息的語義表示(Agirre等,2012)。函數

We propose a unified approach to semantic similarity across multiple representation levels from senses to documents, which offers two significant advantages. First, the method is applicable independently of the input type, which enables meaningful similarity comparisons across different scales of text or lexical levels. Second, by operating at the sense level, a unified approach is able to identify the semantic similarities that exist independently of the text’s lexical forms and any semantic ambiguity therein. For example, consider the sentences:性能

咱們提出了從語義到文檔的多個表示級別的語義類似性的統一方法,這提供了兩個顯着的優勢。首先,該方法可獨立於輸入類型應用,這使得可以在不一樣的文本或詞彙級別上進行有語義的類似性比較。其次,經過在語義層面上操做,統一的方法可以識別獨立於文本的詞彙形式和其中的任何語義歧義而存在的語義類似性。例如,考慮句子:測試

  • t1. A manager fired the worker. T1。一位經理解僱了這名工人。

  • t2. An employee was terminated from work by his boss. T2。一名員工被老闆解僱了。

A surface-based approach would label the sentences as dissimilar due to the minimal lexical overlap. However, a sense-based representation enables detection of the similarity between the meanings of the words, e.g., fire and terminate. Indeed, an accurate, sense-based representation is essential for cases where different words are used to convey the same meaning.

因爲最小的詞彙重疊,基於表面的方法會將句子標記爲不類似。然而,基於語義的表示使得可以檢測單詞的含義之間的類似性,例如,開火和終止。實際上,對於使用不一樣詞語來表達相同含義的狀況,準確的,基於語義的表示是必不可少的。

The contributions of this paper are threefold. First, we propose a new unified representation of the meaning of an arbitrarily-sized piece of text, referred to as a lexical item, using a sense-based probability distribution. Second, we propose a novel alignment-based method for word sense disambiguation during semantic comparison. Third, we demonstrate that this single representation can achieve state-of-the-art performance on three similarity tasks, each operating at a different lexical level: (1) surpassing the highest scores on the SemEval-2012 task on textual similarity (Agirre et al., 2012) that compares sentences, (2) achieving a near-perfect performance on the TOEFL synonym selection task proposed by Landauer and Dumais (1997), which measures word pair similarity, and also obtaining state-of-the-art performance in terms of the correlation with human judgments on the RG-65 dataset (Rubenstein and Goodenough, 1965), and finally (3) surpassing the performance of Snow et al. (2007) in a sensecoarsening task that measures sense similarity

本文的貢獻有三個方面。首先,咱們使用基於語義的機率分佈提出一種新的統一表示,其中任意大小的文本的語義被稱爲詞彙項。其次,咱們提出了一種新的基於對齊的語義比較中的詞義消歧方法。第三,咱們證實這種單一表示能夠在三個類似性任務上實現最早進的性能,每一個任務在不一樣的詞彙級別上運行:(1)超過SemEval-2012任務中關於文本類似性的最高分(Agirre et比較句子,(2)在Landauer和Dumais(1997)提出的託福同義詞選擇任務中實現近乎完美的表現,該任務測量詞對類似性,而且還得到最早進的表現就RG-65數據集(Rubenstein和Goodenough,1965)與人類判斷的相關性而言,最後(3)超越了Snow等人的表現。 (2007)在一個測量語義類似性的語義訓練任務中

2 A Unified Semantic Representation

2 統一語義表示

We propose a representation of any lexical item as a distribution over a set of word senses, referred to as the item’s semantic signature. We begin with a formal description of the representation at the sense level (Section 2.1). Following this, we describe our alignment-based disambiguation algorithm which enables us to produce sense-based semantic signatures for those lexical items (e.g., words or sentences) which are not sense annotated (Section 2.2). Finally, we propose three methods for comparing these signatures (Section 2.3). As our sense inventory, we use WordNet 3.0 (Fellbaum, 1998).

咱們建議將任何詞彙項目表示爲一組詞義的分佈,稱爲項目的語義簽名。咱們首先對語義層面的表示進行正式描述(第2.1節)。接下來,咱們描述了基於對齊的消歧算法,該算法使咱們可以爲那些沒有語義註釋的詞彙項(例如,單詞或句子)產生基於語義的語義簽名(第2.2節)。最後,咱們提出了三種比較這些簽名的方法(第2.3節)。做爲咱們的語義庫存,咱們使用WordNet 3.0(Fellbaum,1998)。

2.1 Semantic Signatures

2.1語義簽名

The WordNet ontology provides a rich network structure of semantic relatedness, connecting senses directly with their hypernyms, and providing information on semantically similar senses by virtue of their nearby locality in the network. Given a particular node (sense) in the network, repeated random walks beginning at that node will produce a frequency distribution over the nodes in the graph visited during the walk. To extend beyond a single sense, the random walk may be initialized and restarted from a set of senses (seed nodes), rather than just one; this multi-seed walk produces a multinomial distribution over all the senses in WordNet with higher probability assigned to senses that are frequently visited from the seeds. Prior work has demonstrated that multinomials generated from random walks over WordNet can be successfully applied to linguistic tasks such as word similarity (Hughes and Ramage, 2007; Agirre et al., 2009), paraphrase recognition, textual entailment (Ramage et al., 2009), and pseudoword generation (Pilehvar and Navigli, 2013).

WordNet本體提供了豐富的語義相關性網絡結構,將語義直接與其上位詞聯繫起來,並憑藉其在網絡中的附近位置提供語義類似的語義信息。給定網絡中的特定節點(語義),從該節點開始的重複隨機遊走將在步行期間訪問的圖中的節點上產生頻率分佈。爲了超越單一語義,能夠從一組語義(種子節點)初始化並從新啓動隨機遊走,而不只僅是一個;這種多種子步行產生了WordNet中全部語義的多項分佈,而且更高的機率分配給常常從種子訪問的語義。以前的工做代表,經過隨意漫遊WordNet生成的多項式能夠成功應用於語言任務,如單詞類似性(Hughes和Ramage,2007; Agirre等,2009),釋義識別,文本蘊涵(Ramage等,2009) )和僞詞生成(Pilehvar和Navigli,2013)。

Formally, we define the semantic signature of a lexical item as the multinomial distribution generated from the random walks over WordNet 3.0 where the set of seed nodes is the set of senses present in the item. This representation encompasses both when the item is itself a single sense and when the item is a sense-tagged sentence.

在形式上,咱們將詞彙項的語義簽名定義爲從WordNet 3.0上的隨機遍歷生成的多項分佈,其中種子節點集是項中存在的一組語義。該表示包括當項目自己是單一語義時以及該項目是有語義標記的句子時。

To construct each semantic signature, we use the iterative method for calculating topic-sensitive PageRank (Haveliwala, 2002). Let M be the adjacency matrix for the WordNet network, where edges connect senses according to the relations defined in WordNet (e.g., hypernymy and meronymy). We further enrich M by connecting a sense with all the other senses that appear in its disambiguated gloss_1. Let ~v(0) denote the probability distribution for the starting location of the random walker in the network. Given the set of senses S in a lexical item, the probability mass of ~v(0) is uniformly distributed across the senses si ∈ S, with the mass for all sj ∈ S set to zero. The PageRank may then be computed using:

爲了構造每一個語義簽名,咱們使用迭代方法來計算主題敏感的PageRank(Haveliwala,2002)。 設M是WordNet網絡的鄰接矩陣,其中邊根據WordNet中定義的關係(例如,hypernymy和meronymy)鏈接語義。 咱們經過將一種語義與其消除歧義的註解中出現的全部其餘語義聯繫起來進一步豐富了M。令~v(0)表示隨機遊走在網絡中的起始位置的機率分佈。 給定詞彙項中的一組語義S,機率質量〜v(0)均勻分佈在語義si∈S上,全部sj /∈ S的質量設置爲零。 而後可使用如下公式計算PageRank:

where at each iteration the random walker may jump to any node si ∈ S with probability α/|S|. We follow standard convention and set α to 0.15. We repeat the operation in Eq. 1 for 30 iterations, which is sufficient for the distribution to converge. The resulting probability vector ~v(t) is the semantic signature of the lexical item, as it has aggregated its senses’ similarities over the entire graph. For our semantic signatures we used the UKB2 off-the-shelf implementation of topic-sensitive PageRank.

在每次迭代中,隨機遊走者能夠以機率α / | S |跳轉到任何節點si∈S。 咱們遵循標準慣例並將α設置爲0.15。 咱們在方程式中重複操做。 1次進行30次迭代,這足以使分佈收斂。 獲得的機率向量~v(t)是詞彙項的語義簽名,由於它在整個圖上聚合了它的語義類似性。 對於咱們的語義簽名,咱們使用了UKB2現成的topic-Sensitive PageRank實現。

2.2 Alignment-Based Disambiguation

2.2基於對齊的消歧

Commonly, semantic comparisons are between word pairs or sentence pairs that do not have their lexical content sense-annotated, despite the potential utility of sense annotation in making semantic comparisons. However, traditional forms of word sense disambiguation are difficult for short texts and single words because little or no contextual information is present to perform the disambiguation task. Therefore, we propose a novel alignment-based sense disambiguation that leverages the content of the paired item in order to disambiguate each element. Leveraging the paired item enables our approach to disambiguate where traditional sense disambiguation methods can not due to insufficient context.

一般,語義比較是在沒有詞彙內容語義註釋的詞對或句子對之間進行的,儘管語義註釋在進行語義比較時具備潛在的效用。然而,對於短文本和單個單詞來講,傳統形式的詞義消歧是困難的,由於不多或沒有上下文信息來執行消歧任務。所以,咱們提出了一種新穎的基於對齊的語義消歧,它利用配對項的內容來消除每一個元素的歧義。利用配對項使咱們的方法可以消除傳統語義消歧方法因上下文不足而沒法消除歧義的方法。

We view sense disambiguation as an alignment problem. Given two arbitrarily ordered texts, we seek the semantic alignment that maximizes the similarity of the senses of the context words in both texts. To find this maximum we use an alignment procedure which, for each word type wi in item T1, assigns wi to the sense that has the maximal similarity to any sense of the word types in the compared text T2. Algorithm 1 formalizes the alignment process, which produces a sense disambiguated representation as a result. Senses are compared in terms of their semantic signatures, denoted as function R. We consider multiple definitions of R, defined later in Section 2.3.

咱們將語義消歧視爲對齊問題。給定兩個任意排序的文本,咱們尋求語義對齊,以最大化兩個文本中的上下文詞的語義的類似性。爲了找到這個最大值,咱們使用對齊程序,對於項目T1中的每一個單詞類型wi,將wi分配給與比較文本T2中的單詞類型的任何語義具備最大類似性的語義。算法1將對齊過程形式化,從而產生有語義的消歧表示。將語義的語義簽名進行比較,表示爲函數R. 咱們考慮R的多個定義,稍後在2.3節中定義。

As a part of the disambiguation procedure, we leverage the one sense per discourse heuristic of Yarowsky (1995); given all the word types in two compared lexical items, each type is assigned a single sense, even if it is used multiple times. Additionally, if the same word type appears in both sentences, both will always be mapped to the same sense. Although such a sense assignment is potentially incorrect, assigning both types to the same sense results in a representation that does no worse than a surface-level comparison.

做爲消歧程序的一部分,咱們利用Yarowsky(1995)的每一個話語啓發式的一種語義;給定兩個比較詞彙項中的全部單詞類型,即便屢次使用,每種類型也被賦予單一語義。另外,若是兩個句子中出現相同的單詞類型,則二者將始終映射到相同的語義。雖然這種語義分配可能不正確,但將兩種類型分配給相同的語義會致使表示不會比表面級別比較更糟糕。

We illustrate the alignment-based disambiguation procedure using the two example sentences t1 and t2 given in Section 1. Figure 1(a) illustrates example alignments of the first sense of manager to the first two senses of the word types in sentence t2 along with the similarity of the two senses’ semantic signatures. For the senses of manager, sense manager 1-n obtains the maximal similarity value to boss1-n among all the possible pairings of the senses for the word types in sentence t2, and as a result is selected as the sense labeling for manager in sentence t1. Figure 1(b) shows the final, maximally-similar sense alignment of the word types in t1 and t2. The resulting alignment produces the following sets of senses:

咱們使用第1節中給出的兩個例句t1和t2來講明基於對齊的消歧程序。圖1(a)示出了第一句manager與句子t2中單詞類型的前兩個語義的示例對齊以及兩種語義的語義特徵的類似性。對於manager的語義,語義manager 1-n在句子t2中的詞類型的全部可能的語義配對中得到與boss1-n的最大類似度值,而且所以被選擇做爲句子t1中的manager的語義標記。圖1(b)顯示了t1和t2中單詞類型的最終,最大類似的語義對齊。生成的對齊產生如下幾組語義:

2.3 Semantic Signature Similarity

2.3語義簽名類似度

Cosine Similarity. In order to compare semantic signatures, we adopt the Cosine similarity measure as a baseline method. The measure is computed by treating each multinomial as a vector and then calculating the normalized dot product of the two signatures’ vectors.

餘弦類似度。爲了比較語義簽名,咱們採用餘弦類似性度量做爲基線方法。經過將每一個多項式做爲向量處理,而後計算兩個簽名向量的歸一化點積來計算度量。

However, a semantic signature is, in essence, a weighted ranking of the importance of WordNet senses for each lexical item. Given that the WordNet graph has a non-uniform structure, and also given that different lexical items may be of different sizes, the magnitudes of the probabilities obtained may differ significantly between the two multinomial distributions. Therefore, for computing the similarity of two signatures, we also consider two nonparametric methods that use the ranking of the senses, rather than their probability values, in the multinomial.

然而,語義簽名實質上是WordNet語義對於每一個詞彙項的重要性的加權排序。鑑於WordNet圖具備不均勻的結構,而且還給出不一樣的詞彙項可能具備不一樣的大小,所得到的機率的大小可能在兩個多項分佈之間顯着不一樣。所以,爲了計算兩個簽名的類似性,咱們還考慮了兩個非參數方法,這些方法使用多項式中的語義排名而不是機率值。

Weighted Overlap. Our first measure provides a nonparametric similarity by comparing the similarity of the rankings for intersection of the sensesin both semantic signatures. However, we additionally weight the similarity such that differences in the highest ranks are penalized more than differences in lower ranks. We refer to this measure as the Weighted Overlap. Let S denote the intersection of all senses with non-zero probability in both signatures and rji denote the rank of sense si ∈ Sin signature j, where rank 1 denotes the highest rank. The sum of the two ranks r1iand r2i for a sense is then inverted, which (1) weights higher ranks more and (2) when summed, provides the maximal value when a sense has the same rank in both signatures. The un normalized weighted overlap is then calculated as P|S|i=1(r1i r2i)−1. Then, to bound the similarity value in [0, 1], we normalize the sum by its maximum value, P|S|i=1(2i)−1,which occurs when each sense has the same rank in both signatures.

加權重疊。咱們的第一個度量經過比較兩個語義簽名中語義的交集的排名的類似性來提供非參數類似性。然而,咱們另外對類似性進行加權,使得最高等級中的差別比低等級中的差別受到更多懲罰。咱們將此度量稱爲加權重疊。設S表示在兩個簽名中具備非零機率的全部語義的交集,而且rji表示語義si∈Sin簽名j的秩,其中秩1表示最高的k。而後反轉兩個等級r1和r2ifor asense的總和,其中(1)更高權重更高和(2)當求和時,當語義在兩個簽名中具備相同等級時提供最大值。而後將非標準化加權重疊計算爲P | S | i = 1(r1i r2i)-1。而後,爲了限制[0,1]中的類似度值,咱們將和值歸一化其最大值P | S | i = 1(2i)-1,這在每一個語義在兩個簽名中具備相同的秩時發生。

Top-k Jaccard. Our second measure uses the ranking to identify the top-k senses in a signature, which are treated as the best representatives of the conceptual associates. We hypothesize that a specific rank ordering may be attributed to small differences in the multinomial probabilities, which can lower rank-based similarities when one of the compared orderings is perturbed due to slightly different probability values. Therefore, we consider the top-k senses as an unordered set, with equal importance in the signature. To compare two signatures, we compute the Jaccard Index of the two signatures’ sets:

Top-k Jaccard。咱們的第二項措施是使用這一標準來識別簽名中的top-k語義,這些語義被視爲概念夥伴的最佳表明。咱們假設特定的排序可能歸因於多項機率中的小差別,當比較的一個比例因爲略微不一樣的機率值而被擾動時,這能夠下降基於秩的類似性。所以,咱們認爲top-k語義是無序集合,在簽名中具備相同的重要性。爲了比較兩個簽名,咱們計算兩個簽名集的Jaccard索引:

where Uk denotes the set of k senses with the highest probability in the semantic signature U.

其中Uk表示在語義簽名U中具備最高几率的k個語義的集合。

3 Experiment 1: Textual Similarity

3實驗1:文本類似性

Measuring semantic similarity of textual items has applications in a wide variety of NLP tasks. Asour benchmark, we selected the recent SemEval2012 task on Semantic Textual Similarity (STS),which was concerned with measuring the semantic similarity of sentence pairs. The task received considerable interest by facilitating a meaningful comparison between approaches.

測量文本項的語義類似性在各類NLP任務中具備應用。在Asour基準測試中,咱們選擇了最近的SemEval2012語義文本類似度(STS)任務,該任務涉及測量句子對的語義類似性。經過促進方法之間有語義的比較,該任務得到了使人感興趣的語義。

3.1 Experimental Setup

3.1實驗設置

Data. We follow the experimental setup used inthe STS task (Agirre et al., 2012), which provided five test sets, two of which had accompanying training data sets for tuning system performance. Each sentence pair in the datasets was given a score from 0 to 5 (low to high similarity) by human judges, with a high inter-annotator agreement of around 0.90 when measured using the Pearson correlation coefficient. Table 1 lists the number of sentence pairs in training and test portions of each dataset

數據。咱們遵循STS任務中使用的實驗設置(Agirre等,2012),其提供了五個測試集,其中兩個具備用於調整系統性能的伴隨訓練數據集。數據集中的每一個句子對由人類評判者給出0到5(從低到高的類似性)的得分,當使用Pearson相關係數測量時具備約0.90的高註釋器外協議。表1列出了每一個數據集的訓練和測試部分中的句子對數

Comparison Systems. The top-ranking participating systems in the SemEval-2012 task were generally supervised systems utilizing a variety of lexical resources and similarity measurement techniques. We compare our results against the top three systems of the 88 submissions: TLsim and TLsyn, the two systems of Sari ˇ c et al. (2012), and ´the UKP2 system (Bar et al., 2012). UKP2 utilizes ¨extensive resources among which are a Distributional Thesaurus computed on 10M dependency parsed English sentences. In addition, the system utilizes techniques such as Explicit Semantic Analysis (Gabrilovich and Markovitch, 2007) and makes use of resources such as Wiktionary and Wikipedia, a lexical substitution system based on supervised word sense disambiguation (Biemann,2013), and a statistical machine translation system. The TLsim system uses the New York Times Annotated Corpus, Wikipedia, and Google BookNgrams. The TLsyn system also uses GoogleBook Ngrams, as well as dependency parsing and named entity recognition.

比較系統。 SemEval-2012任務中排名靠前的參與系統是利用各類資源和類似性測量技術的通常監督系統。咱們將咱們的結果與88個提交的前三個系統進行比較:TLsim和TLsyn,Sari c等人的兩個系統。 (2012)和'UKP2系統(Bar et al。,2012)。 UKP2利用了大量資源,其中包括根據10M依賴性分析英語句子計算的分佈式詞庫。此外,該系統利用顯式語義分析(Gabrilovich和Markovitch,2007)等技術,並利用維基詞典和維基百科等資源,基於監督詞義

~後續沒什麼好寫的,不存在閱讀問題,我就不翻譯了。~

正文以後

溜了,看論文去了,這個文章雖然老了點,可是仍是蠻有借鑑意義的說

相關文章
相關標籤/搜索