翻譯 | Placing Search in Context The Concept Revisited

翻譯 | Placing Search in Context The Concept Revisited

原文

摘要

[1] Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval.網絡

[2] Although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs.app

[3] This paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results.dom

[4] This paradigm is implemented in practice in the IntelliZap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document (「the context」).ide

[5] The context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries.工具

[6] The latter are submitted to a host of general and domain-specific search engines.性能

[7] Search results are then semantically reranked, using context. Experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the Web.ui

模型改進

第一節

[1] The core of IntelliZap technology is a semantic network, which provides a metric for measuring distances between pairs of words.this

[2] The basic semantic network is implemented using a vector-based approach, where each word is represented as a vector in multi-dimensional space.搜索引擎

[3] To assign each word a vector representation, we first identified 27 knowledge domains (such as computers, business and entertainment) roughly partitioning the whole variety of topics.lua

[4] We then sampled a large set of documents in these domains on the Internet Word vectors were obtained by recording the frequencies of each word in each knowledge domain.

[5] Each domain can therefore be viewed as an axis in the multi-dimensional space.

[6] The distance measure between word vectors is computed using a correlation-based metric:

第二節

[1] Unfortunately, there are no accepted procedures for evaluating performance of semantic metrics.

[2] Following Resnik [1999], we evaluated different metrics by computing correlation between their scores and human-assigned scores for a list of word pairs.

[3] The intuition behind this approach is that a good metric should approximate human judgments well.

[4] While Resnik used a list of 30 noun pairs from Miller and Charles [1991], we opted for a more comprehensive evaluation.

[5] To this end, we prepared a diverse list of 350 noun pairs representing various degrees of similarity,10 and employed 16 subjects to estimate the 「relatedness」 of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words).

[6] The vector-based metric achieved 41% correlation with averaged human scores, and the WordNet-based metric achieved 39% correlation11,12 A linear combination of the two metrics achieved 55% correlation with human scores.

[7] Currently, our semantic network is defined for the English language, though the technology can be adapted for other languages with minimal effort.

[8] This would require training the network using textual data for the desired language, properly partitioned into domains.

[9] Linguistic information can be added, subject to the availability of adequate tools for the target language (e.g., EuroWordNet for European languages [Euro WordNet] or EDR for Japanese [Yokoi 1995]).

翻譯

摘要

[1] 基於關鍵字的搜索引擎做爲一種流行的基於Web的信息檢索手段,在今天獲得了普遍的應用。

[2] 雖然這樣的系統看起來彷佛很簡單,但爲了知足非瑣碎的信息需求,須要大量的技巧。

[3] 本文提出了一種新的在上下文中執行搜索的概念範式,它在很大程度上自動化了搜索過程,甚至爲非專業用戶召回了高度相關的結果。

[4] 這種範例是在 Intellizap 系統中實現的。在該系統中,搜索從用戶在其所查看的文檔中標記的文本查詢開始,並由該文檔中標記的查詢周圍的文本(「上下文」)來引導。

[5] 上下文驅動的信息檢索過程包括語義關鍵字提取和聚類,從而自動生成新的、擴充的查詢。

[6] 後者被提交給一系列通用和特定於域的搜索引擎。

[7] 而後使用上下文對搜索結果進行語義從新排序。實驗結果代表,利用上下文來引導搜索,甚至能夠有效地爲沒有經驗的用戶提供一種先進的網絡搜索工具。

模型改進

第一節

[1] Intellizap技術的核心是一個語義網絡,它爲測量成對詞之間的距離提供了一個度量標準。

[2] 基本語義網絡是使用基於向量的方法實現的,其中每一個詞在多維空間中表示爲一個向量。

[3] 爲了給每一個單詞分配一個向量表示,咱們首先肯定了27個知識域(如計算機、商業和娛樂),大體劃分了各類主題。

[4] 而後,咱們對這些領域中的大量文檔進行了抽樣,經過記錄每一個知識領域中每一個單詞的頻率,得到了互聯網上的單詞向量。

[5] 所以,能夠將每一個域看做多維空間中的一個軸。

[6] 單詞向量之間的距離度量是使用基於相關性的度量來計算的:

第二節

[1] 不幸的是,沒有能夠被接受的手段來評估語義度量的性能。

[2] 繼 Resnik[1999] 以後,咱們經過計算機器打分與人類對指定的單詞打分列表之間的相關性,來評估不一樣的指標。

[3] 這種方法背後的直覺是,一個好的度量應該很好地近似人類的判斷。

[4] 雖然 Resnik 使用了 Miller 和 Charles[1991] 的 30 個名詞對列表,但咱們選擇了更全面的評估。

[5] 爲此,咱們準備了一份 350 個不一樣的名詞詞對的列表,分別表明不一樣程度的類似性,由 10 個和 16 個受試者,以從0(徹底無關的詞)到10(很是相關或相同的詞)的尺度來估計詞對間的「相關性」。

[6] 基於向量的度量與平均人類分數的相關性達到41%,基於 WordNet 的度量與平均人類分數的相關性達到 39%,11,12這兩個度量的線性組合與人類分數的相關性達到55%。

[7] 目前,咱們的語義網絡是爲英語定義的,儘管這項技術能夠用最少的努力適應其餘語言。

[8] 這須要使用目標語言的文本數據對網絡進行培訓,並將其正確劃分爲域。

[9] 可根據目標語言的適當工具(例如,歐洲語言的 EurowordNet [歐元wordNet] 或日語的 EDR[Yokoi 1995])添加語言信息。

相關文章
相關標籤/搜索