[1] Keyword-based search engines are in widespread use today as a popular means for Web-based information retrieval.網絡
[2] Although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs.app
[3] This paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results.dom
[4] This paradigm is implemented in practice in the IntelliZap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document (「the context」).ide
[5] The context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries.工具
[6] The latter are submitted to a host of general and domain-specific search engines.性能
[7] Search results are then semantically reranked, using context. Experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the Web.ui
[1] The core of IntelliZap technology is a semantic network, which provides a metric for measuring distances between pairs of words.this
[2] The basic semantic network is implemented using a vector-based approach, where each word is represented as a vector in multi-dimensional space.搜索引擎
[3] To assign each word a vector representation, we first identified 27 knowledge domains (such as computers, business and entertainment) roughly partitioning the whole variety of topics.lua
[4] We then sampled a large set of documents in these domains on the Internet Word vectors were obtained by recording the frequencies of each word in each knowledge domain.
[5] Each domain can therefore be viewed as an axis in the multi-dimensional space.
[6] The distance measure between word vectors is computed using a correlation-based metric:
[1] Unfortunately, there are no accepted procedures for evaluating performance of semantic metrics.
[2] Following Resnik [1999], we evaluated different metrics by computing correlation between their scores and human-assigned scores for a list of word pairs.
[3] The intuition behind this approach is that a good metric should approximate human judgments well.
[4] While Resnik used a list of 30 noun pairs from Miller and Charles [1991], we opted for a more comprehensive evaluation.
[5] To this end, we prepared a diverse list of 350 noun pairs representing various degrees of similarity,10 and employed 16 subjects to estimate the 「relatedness」 of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words).
[6] The vector-based metric achieved 41% correlation with averaged human scores, and the WordNet-based metric achieved 39% correlation11,12 A linear combination of the two metrics achieved 55% correlation with human scores.
[7] Currently, our semantic network is defined for the English language, though the technology can be adapted for other languages with minimal effort.
[8] This would require training the network using textual data for the desired language, properly partitioned into domains.
[9] Linguistic information can be added, subject to the availability of adequate tools for the target language (e.g., EuroWordNet for European languages [Euro WordNet] or EDR for Japanese [Yokoi 1995]).
[1] 基於關鍵字的搜索引擎做爲一種流行的基於Web的信息檢索手段,在今天獲得了普遍的應用。
[2] 雖然這樣的系統看起來彷佛很簡單,但爲了知足非瑣碎的信息需求,須要大量的技巧。
[3] 本文提出了一種新的在上下文中執行搜索的概念範式,它在很大程度上自動化了搜索過程,甚至爲非專業用戶召回了高度相關的結果。
[4] 這種範例是在 Intellizap 系統中實現的。在該系統中,搜索從用戶在其所查看的文檔中標記的文本查詢開始,並由該文檔中標記的查詢周圍的文本(「上下文」)來引導。
[5] 上下文驅動的信息檢索過程包括語義關鍵字提取和聚類,從而自動生成新的、擴充的查詢。
[6] 後者被提交給一系列通用和特定於域的搜索引擎。
[7] 而後使用上下文對搜索結果進行語義從新排序。實驗結果代表,利用上下文來引導搜索,甚至能夠有效地爲沒有經驗的用戶提供一種先進的網絡搜索工具。
[1] Intellizap技術的核心是一個語義網絡,它爲測量成對詞之間的距離提供了一個度量標準。
[2] 基本語義網絡是使用基於向量的方法實現的,其中每一個詞在多維空間中表示爲一個向量。
[3] 爲了給每一個單詞分配一個向量表示,咱們首先肯定了27個知識域(如計算機、商業和娛樂),大體劃分了各類主題。
[4] 而後,咱們對這些領域中的大量文檔進行了抽樣,經過記錄每一個知識領域中每一個單詞的頻率,得到了互聯網上的單詞向量。
[5] 所以,能夠將每一個域看做多維空間中的一個軸。
[6] 單詞向量之間的距離度量是使用基於相關性的度量來計算的:
[1] 不幸的是,沒有能夠被接受的手段來評估語義度量的性能。
[2] 繼 Resnik[1999] 以後,咱們經過計算機器打分與人類對指定的單詞打分列表之間的相關性,來評估不一樣的指標。
[3] 這種方法背後的直覺是,一個好的度量應該很好地近似人類的判斷。
[4] 雖然 Resnik 使用了 Miller 和 Charles[1991] 的 30 個名詞對列表,但咱們選擇了更全面的評估。
[5] 爲此,咱們準備了一份 350 個不一樣的名詞詞對的列表,分別表明不一樣程度的類似性,由 10 個和 16 個受試者,以從0(徹底無關的詞)到10(很是相關或相同的詞)的尺度來估計詞對間的「相關性」。
[6] 基於向量的度量與平均人類分數的相關性達到41%,基於 WordNet 的度量與平均人類分數的相關性達到 39%,11,12這兩個度量的線性組合與人類分數的相關性達到55%。
[7] 目前,咱們的語義網絡是爲英語定義的,儘管這項技術能夠用最少的努力適應其餘語言。
[8] 這須要使用目標語言的文本數據對網絡進行培訓,並將其正確劃分爲域。
[9] 可根據目標語言的適當工具(例如,歐洲語言的 EurowordNet [歐元wordNet] 或日語的 EDR[Yokoi 1995])添加語言信息。