AgreementMaker:Efficient Matching for Large Real-World 翻譯

正文以前

這篇文章仍是我看前幾天那個基於框架進行本體匹配的一個Previous Work裏面的一個Previous Work。能夠說有點菜,可是仍是比較有參考意義的, 因此我把源碼下載了下來,而後準備把對應的文章讀一讀,而後我我的比較喜歡中英對照,直接看中文的時候略過一些不重要的地方,在關鍵部位看原文。因此就有了這麼多的翻譯版本了。。node

引用以下:Cruz I F, Antonelli F P, Stroe C. AgreementMaker: efficient matching for large real-world schemas and ontologies[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1586-1589.python

谷歌借張圖片鎮樓好了

正文

Abstract

摘要

We present the AgreementMaker system for matching real world schemas and ontologies, which may consist of hundreds or even thousands of concepts. The end users of the system are sophisticated domain experts whose needs have driven the design and implementation of the system: they require a responsive, powerful, and extensible framework to perform, evaluate, and compare matching methods. The system comprises a wide range of matching methods addressing different levels of granularity of the components being matched (conceptual vs. structural), the amount of user intervention that they require (manual vs. automatic), their usage (stand-alone vs. composed), and the types of components to consider (schema only or schema and instances). Performance measurements (recall, precision, and runtime) are supported by the system, along with the weighted combination of the results provided by those methods. The AgreementMaker has been used and tested in practical applications and in the Ontology Alignment Evaluation Initiative (OAEI) competition. We report here on some of its most advanced features, including its extensible architecture that facilitates the integration and performance tuning of a variety of matching methods, its capability to evaluate, compare, and combine matching results, and its user interface with a control panel that drives all the matching methods and evaluation strategies.web

咱們提出了AgreementMaker系統,用於匹配真實世界模式和本體,可能包含數百甚至數千個概念。系統的最終用戶是複雜的領域專家,他們的需求推進了系統的設計和實現:他們須要一個響應迅速,功能強大且可擴展的框架來執行,評估和比較匹配方法。該系統包含多種匹配方法,能夠解決匹配的組件(概念與結構)的不一樣粒度級別,他們須要的用戶干預量(手動與自動),它們的使用(獨立與組合),以及要考慮的組件類型(僅架構或架構和實例)。系統支持性能測量(召回率,準確率和運行時性能),以及這些方法提供的結果的加權組合。 AgreementMaker已在實際應用和Ontology Alignment Evaluation Initiative(OAEI)競賽中使用和測試。咱們在此報告其一些最早進的功能,包括其可擴展的體系結構,有助於各類匹配方法的集成和性能調整,評估,比較和組合匹配結果的能力,以及控制全部匹配方法和評估策略的用戶界面和控制面板。算法

1. Introduction

1. 介紹

The issue of schema matching in databases [11], which has been investigated since the early 80’s, is fundamental to data integration, as is the closely-related issue of ontology alignment or matching [12]. The matching problem consists of defining mappings among schema or ontology elements that are semantically related. Such mappings are typically defined between two schemas or two ontologies at a time one being called the source and the other being called the target.數據庫

自80年代早期以來一直在研究的數據庫[11]中的模式匹配問題是數據集成的基礎,與本體對齊或匹配密切相關的問題也是如此[12]。匹配問題包括定義在語義上相關的 模式或本體元素之間 的映射。這種映射一般在兩個模式或兩個本體之間定義,一個被稱爲源本體,另外一個被稱爲目標本體。express

We have been developing the AgreementMaker matching system, whose name takes after agreement, the encoding of a mapping. The capabilities of our system have been driven by the real-world problems of end users who are sophisticated domain experts. We have considered a variety of domains and applications, including: geospatial [2], environmental [4], and biomedical [13]. The conceptual information for these applications is stored in the form of ontologies. However, as demonstrated by others, the same approach can be used for schema matching [1, 10]. To validate our approach, we competed against seven other systems in the biomedical track of the 2007 Ontology Alignment Evaluation Initiative (OAEI), to match ontologies describing the mouse adult anatomy of the Mouse Gene Expression Database Project (2744 classes) and the human anatomy of the National Cancer Institute (3304 classes). We came in third in terms of accuracy (F-measure) [5].數據結構

咱們一直在開發AgreementMaker匹配系統,其名稱取決於協議(映射的編碼)。咱們系統的功能受到最終用戶的現實問題的驅動,這些最終用戶是很是複雜的領域專家。咱們已經考慮了各類領域和應用,包括:地理空間[2],環境[4]和生物醫學[13]。這些應用程序的概念信息以本體的形式存儲。可是,正如其餘人所證實的那樣,相同的方法能夠用於模式匹配[1,10]。爲了驗證咱們的方法,咱們與2007年本體校準評估計劃(OAEI)的生物醫學行業中的其餘七個系統進行了競爭,以匹配描述小鼠基因表達數據庫項目(2744類)的成年小鼠解剖學的本體和國家癌症研究所(3304類)的人體解剖學分類本體。咱們在準確性方面排名第三(F-measure)[5]。架構

The AgreementMaker, which is currently in its third version, has been evolving to accommodate: (1) user requirements, as expressed by domain experts; (2) a wide range of input (ontology) and output (agreement file) formats; (3) a large choice of matching methods depending on the different granularity of the set of components being matched (local vs. global), on different features considered in the comparison (conceptual vs. structural), on the amount of intervention that they require from users (manual vs. automatic), on usage (stand-alone vs. composed), and on the types of components to consider (schema only or schema and instances); (4) improved performance, that is, accuracy (precision, recall, F-measure) and efficiency (execution time) for the automatic methods; (5) an extensible architecture to incorporate new methods easily and to tune their performance; (6) the capability to evaluate, compare, and combine different strategies and matching results; (7) a comprehensive user interface supporting both advanced visualization techniques and a control panel that drives all the matching methods and evaluation strategies.app

目前處於第三版的AgreementMaker正在不斷髮展以適應:(1)領域專家表達的用戶需求; (2)普遍的輸入(本體)和輸出(協議文件)格式; (3)根據不一樣粒度的組件集的匹配選項(本地與全局),在比較中考慮的不一樣特徵(概念與結構),他們須要的來自用戶的干預量(手動與自動),使用(獨立與組合),以及要考慮的組件類型(僅架構或架構和實例); (4)改進性能,即自動方法的準確度(精確度,召回率,F測量值)和效率(執行時間); (5)可擴展的架構,能夠輕鬆地整合新方法並調整其性能; (6)評估,比較和組合不一樣策略和匹配結果的能力; (7)全面的用戶界面,支持高級可視化技術和控制面板,驅動全部匹配方法和評估策略。框架

In this demo paper, we focus on the most recent developments of the system, which has been almost completely redesigned in the last year. In particular, we describe: (1) the user interface with particular emphasis on the control panel and improved visualization and interaction capabilities; (2) the automatic matching methods and execution capabilities; and (3) the evaluation strategies for determining the efficiency of the matching methods and for performing the combination of results.

在本演示文章中,咱們將重點介紹該系統的最新發展,該系統在去年幾乎徹底從新設計。特別是,咱們描述:(1)用戶界面,特別強調控制面板和改進的可視化和交互功能; (2)自動匹配方法和執行能力; (3)用於肯定匹配方法的效率和執行結果組合的評估策略。

2. RELATED WORK

2.相關工做

There are several notable systems related to ours, including Clio [6], COMA++ [1], Falcon-AO [7], and Ri MOM [14] (just to mention a few). Clio stands apart because of its single focus on database-specific constraints and operators (e.g., foreign keys, joins) to infer the mappings whereas constraints in ontologies (as implemented in the other three systems and in AgreementMaker) are of a different nature [12]. This different emphasis also permeates the remaining components of the various systems, as those that also support ontology matching implement a rich tool box of stringsimilarity and structural-based techniques and focus on performance. Consequently, some of these systems do not focus on user interaction: for example, Falcon-AO and Ri MOM provide simple interfaces that offer limited user interaction (e.g., no manual manipulation of the ontologies). However, what separates AgreementMaker from these other systems (including from COMA++, which has a more sophisticated user interface than the other two) is the degree to which it integrates the evaluation of the quality of the obtained mappings with the graphical user interface and therefore with the iterative matching process. This tight integration emerged from our work with domain experts, who required that the evaluation be an integral part of the matching process, not an 「add on」 capability.

有幾個與咱們相關的着名系統,包括Clio [6],COMA ++ [1],Falcon-AO [7]和Ri MOM [14](僅舉幾例)。 Clio之因此不同凡響,是由於它專一於特定於數據庫的約束和運算符(例如,外鍵,鏈接)來推斷映射,而本體中的約束(在其餘三個系統和AgreementMaker中實現)具備不一樣的性質[12 ]。這種不一樣的重點也滲透到各類系統的其他組件中,由於那些支持本體匹配的組件實現了豐富的類似性和基於結構的技術工具箱,並專一於性能。所以,這些系統中的一些不關注用戶交互:例如,Falcon-AO和Ri MOM提供了限制用戶交互的簡單接口(例如,沒有對本體的手動操縱)。然而,將AgreementMaker與其餘系統(包括COMA ++,其具備比其餘兩個更復雜的用戶界面)區別開來的是它將得到的映射的質量評估與圖形用戶界面集成的程度,所以迭代匹配過程(大意是能夠直接看到評估結果的改進?)。這種緊密集成源於咱們與領域專家的合做,他們要求評估是匹配過程當中不可或缺的一部分,而不是「附加」功能。

3. ARCHITECTURE

3.架構

The AgreementMaker supports a wide variety of methods or matchers. Our architecture (see Figure 1) allows for serial and parallel composition where, respectively, the output of one or more methods can be used as input to another one, or several methods can be used on the same input and then combined. A set of mappings may therefore be the result of a sequence of steps, called layers.

AgreementMaker支持各類方法或匹配器。咱們的體系結構(參見圖1)容許串行和並行組合,其中一個或多個方法的輸出能夠分別用做另外一個方法的輸入,或者能夠在同一輸入上使用多個方法而後組合。所以,一組映射多是一系列步驟的結果,稱爲層。

The matching process of a generic matcher (see Figure 2), can be divided into two main modules: (1) similarity computation in which each concept of the source ontology is compared with all the concepts of the target ontology, thus producing two similarity matrices (one for classes and the other one for properties), which contain a value for each pair of concepts; (2) mappings selection in which the matrix is scanned to select only the best mappings according to a given threshold and to the cardinality of the correspondences, for example, 1-1, 1-N, N-1, M-N

通用匹配器的匹配過程(見圖2)能夠分爲兩個主要模塊:(1)類似度計算,其中源本體的每一個概念與目標本體的全部概念進行比較,從而產生兩個類似性矩陣(一個用於類,另外一個用於屬性),其中包含每對概念的值; (2)映射選擇,掃描矩陣以根據給定閾值和對應關係的基數僅選擇最佳映射,例如1-1,1-N,N-1,M-N

To enable extensibility, we adopted the object-oriented template pattern by defining the skeleton of the matching process in a generic matcher, which defers only a few operations to the concrete matcher extensions (see Figure 3). This abstraction minimizes development effort by completely decoupling the structure of a single method from the architecture of the whole system, thus allowing reuse or any possible composition of matching modules.

爲了實現可擴展性,咱們經過在通用匹配器中定義匹配過程的框架來實現面向對象的模板模式(???不懂),該模式僅將少數操做推遲到具體的匹配器擴展(參見圖3)。這種抽象經過將單個方法的結構與整個系統的體系結構徹底解耦來最小化開發效率,從而容許重用或任何可能的匹配模塊組合。

A first layer matcher produces the similarity matrices, while the second and third layer matchers extend the first layer matchers. In particular, a second layer matcher improves on the results of a first layer matcher using conceptual or structural information, depending on whether it considers one concept alone or a concept and its neighbors. Finally, a third layer matcher combines the results of two or more matchers from the previous layers, in order to obtain a final matching or alignment, that is, a set of mappings.

第一層匹配器產生類似性矩陣,而第二和第三層匹配器擴展第一層匹配器。特別地,第二層匹配器使用概念或結構信息改進第一層匹配器的結果,這取決於它是單獨考慮一個概念仍是概念及其鄰居。最後,第三層匹配器組合來自先前層的兩個或更多個匹配器的結果,以便得到最終匹配或對齊,即一組映射。

4. USER INTERFACE

4.用戶界面

The source and target ontologies (in XML, RDFS, OWL, or N3) are visualized side by side using the familiar outline tree paradigm (see Figure 4). Agreements can be exported in different formats (e.g., XML, Excel). Because all the matching operations and their results are managed by this interface, we gave special consideration to its design [4]. We describe next two new features of the interface: the control panel and the visualization of non-hierarchical ontologies (e.g., due to multiple inheritance in OWL). The latter feature allows for specific subtrees to be visually duplicated. Because we adopt the Model-View-Control pattern, this duplication does not affect the underlying data structures. The control panel (see Figure 5) allows users to run and manage matching methods and their results. Users can select parameters common to all methods (such as threshold and cardinality) and method-specific parameters. When a method has run, a new row is dynamically added to the table that is part of the control panel at the same time that lines depicting the mappings between the concepts are added (see Figure 4). Each row is color coded and allows for its selection so that the corresponding mappings (of the same color) can be compared visually. Each row also displays the performance values for the associated methods, thus allowing for the comparison with those of other rows. In addition, users can modify at runtime the method parameters by changing directly their values in the table or by selecting previously calculated matchings as input to the methods to be applied next. Multiple matchings can also be combined manually or with an automatic combination matcher.

源和目標本體(在XML,RDFS,OWL或N3中)使用熟悉的大綱樹範例並排顯示(參見圖4)。匹配結果能夠以不一樣的格式導出(例如,XML,Excel)。因爲全部匹配操做及其結果均由此接口管理,所以咱們特別考慮了其設計[4]。咱們將介紹接口的下兩個新功能:控制面板和非分層結構的可視化(例如,因爲OWL中的多重繼承)。後一特徵容許在視覺上覆制特定的子樹。由於咱們採用模型-視圖-控制模式,因此這種應用不會影響基礎數據結構。控制面板(參見圖5)容許用戶運行和管理匹配方法及其結果。用戶能夠選擇全部方法共有的參數(例如閾值和基數)和特定於方法的參數。當一個方法運行時,一個新行被動態地添加到做爲控制面板一部分的表中,同時添加了描述概念之間映射的行(參見圖4)。每行都是彩色編碼的,並容許其選擇,以即可以在視覺上比較相應的映射(相同顏色)。每行還顯示相關方法的性能值,從而容許與其餘行的性能值進行比較。此外,用戶能夠在運行時經過直接更改表中的值或經過選擇先前計算的匹配結果做爲下一個要應用的方法的輸入來修改這個方法的參數。多個匹配也能夠手動組合或與自動組合匹配器組合。

5. MATCHING METHODS

5.匹配方法

First layer matchers compare concept features (e.g., label, comments, annotations, and instances) and use a variety of methods including syntactic and lexical comparison algorithms as well as the use of a lexicon like Word Net. Of those methods some were proposed by others (e.g., edit distance, Jaro-Winkler) and some devised by us, including a substring-based comparison that favors the length of the common substrings and a concept document-based comparison containing a wide range of features. Those features are represented as TF-IDF vectors and use a cosine similarity metric (see Figure 6).

第一層匹配器比較概念特徵(例如,標籤,註釋,註釋和實例)並使用各類方法,包括句法和詞彙比較算法以及Word Net等詞典的使用。其中一些方法是由其餘人提出的(例如,編輯距離,Jaro-Winkler)和咱們設計的一些方法,包括基於子串的比較,這有利於公共子串的長度和基於文件的概念等方面進行普遍特徵上的比較。這些特徵表示爲TF-IDF向量並使用餘弦類似性度量(參見圖6)。

Second layer matchers use structural properties of the ontologies. Our own methods include the Descendant’s Similarity Inheritance (DSI) and the Sibling’s Similarity Contribution (SSC) matchers [3].

第二層匹配器使用本體的結構屬性。咱們本身的方法包括後代的類似性遺傳(DSI)和兄弟姐妹的類似性貢獻(SSC)匹配[3]。

Finally, third layer matchers combine the results of two or more matchers so as to obtain a unique final matching in two steps. In the first step, a similarity matrix is built for each pair of concepts, using our Linear Weighted Combination (LWC) matcher, which processes the weighted average for the different similarity results (see Figure 7). Weights can be assigned manually or automatically, the latter assignment being determined using our evaluation methods. The second step uses that similarity matrix and takes into account a threshold value and the desired cardinality. When the cardinality is 1-1, we adopt the Shortest Augmenting Path algorithm [9] to find the optimal solution for this optimization problem (namely the assignment problem reduced to the maximum weight matching in a bipartite graph) in polynomial time.

最後,第三層匹配器組合兩個或更多匹配器的結果,以便在兩個步驟中得到惟一的最終匹配。在第一步中,使用咱們的線性加權組合(LWC)匹配器爲每對概念創建類似性矩陣,該匹配器處理不一樣類似性結果的加權平均值(參見圖7)。能夠手動或自動分配權重,後者分配使用咱們的評估方法肯定。第二步使用該類似性矩陣並考慮閾值和指望的基數。當基數爲1-1時,咱們採用最短增廣路徑算法[9],在多項式時間內找到該優化問題的最優解(即,將分配問題降級到二分圖中的最大權重匹配)。

6. EVALUATION

6.評估

The design of optimal methods to find correct and complete mappings between real-world ontologies is a hard task for several reasons. First of all, an algorithm may be effective for a given scenario, but not for others. Even within the same scenario, the use of different parameters can change significantly the outcome. Moreover, in interviewing domain experts in the geospatial domain, we discovered that they do not trust automatic methods unless quality metrics are associated with the matching results. These observations have motivated a variety of evaluation techniques, that determine runtime and accuracy (precision, recall, and F-measure).

因爲幾個緣由,設計在現實世界本體之間找到正確和完整映射的最佳方法是一項艱鉅的任務。首先,算法可能對給定場景有效,但對其餘場景則無效。即便在相同的狀況下,使用不一樣的參數也能夠顯着改變結果。此外,在訪問地理空間域中的域專家時,咱們發現他們不信任自動方法,除非質量度量與匹配結果相關聯。這些觀察結果激發了各類評估技術,這些技術決定了運行時間和準確性(精確度,召回率和F測量值)。

The most effective evaluation technique compares the mappings found by the system between the two ontologies with a reference matching or 「gold standard,」 which is a set of correct and complete mappings as built by domain experts. When a reference matching is available, the AgreementMaker can determine the quality of the found matching analytically or visually. A reference matching can also be used to tune algorithms by using a feedback mechanism provided by a succession of runs.

最有效的評估技術將系統在兩個本體之間發現的映射與參考匹配或「黃金標準」進行比較,後者是由領域專家構建的一組正確和完整的映射。當參考匹配可用時,AgreementMaker能夠分析或直觀地肯定找到的匹配的質量。參考匹配也能夠用於經過使用由一系列運行提供的反饋機制來調整算法。

When a gold standard is not available, 「inherent」 quality measures need to be considered. Quality measures can be defined at two levels as associated with the two main modules of a matcher (see Figure 2): similarity or selection level. We can consider local quality as associated with a correspondence at the similarity level (or mapping at the selection level) or global quality as associated with all the correspondences at the similarity level (or with all possible mappings at the selection level). We have incorporated in our system a global-selection quality measure proposed by others [8] and a local-similarity quality measure that we have devised. Experiments have shown that our quality measure is usually effective in defining weights for the LWC matcher.

若是沒有黃金標準,則須要考慮「固有的」質量措施。質量測量能夠在兩個級別定義,與匹配器的兩個主要模塊相關聯(參見圖2):類似性或選擇級別。咱們能夠將與類似性級別(或選擇級別的映射)的對應關聯的本地質量或與類似性級別(或選擇級別的全部可能映射)的全部對應關聯的全局質量相關聯【PS這什麼鬼!!!】。咱們已經在咱們的系統中歸入了其餘人提出的全球選擇質量測量[8]以及咱們設計的局部類似性質量測量。實驗代表,咱們的質量測量一般在定義LWC匹配器的權重方面是有效的。

7. DEMONSTRATION

7.演示

Our demo focuses on the matching methods and evaluation strategies for determining the efficiency of ontology matching methods. Due to the tight integration of the evaluation strategies with the graphical user interface, a unique feature of our system, all the steps will be performed through the interface. Users will start by uploading their own ontologies, load our own, or download ontologies from the web, thus taking advantage of the several standard formats supported. Users can then explore the interface freely or follow a walk-through, consisting of browsing the ontologies, expanding and contracting nodes, and customizing the display. They have access to the information associated with each concept to be aligned, including descriptions, annotations, and (context) relations, and they can use them to visually detect mappings.

咱們的演示側重於肯定本體匹配方法的效率的匹配方法和評估策略。因爲評估策略與圖形用戶界面(咱們系統的獨特功能)的緊密集成,全部步驟都將經過界面執行。用戶將首先上傳他們本身的本體(加載咱們提供的本體,或從網上下載的本體)從而利用支持的幾種標準格式。而後,用戶能夠自由地瀏覽界面或按照演練進行瀏覽,包括瀏覽本體,擴展和收縮節點以及自定義顯示。他們能夠訪問與要對齊的每一個概念相關的信息,包括描述,註釋和(上下文)關係,他們可使用它們來直觀地檢測映射。

正文以後

第一版是直接CAJViewer文字識別,而後用python進行清洗,而後谷歌文件直接翻譯,最後整合起來的。因此估摸着友好度比較低,等我看完以後慢慢一點點的改正吧。。

相關文章
相關標籤/搜索