源碼地址:https://github.com/XBWer/WordSimilarityjava
在基於代碼片斷的分類過程當中,因爲程序員對數據變量名的選取可能具備必定的規範性,在某一特定業務處理邏輯代碼中,可能多個變量名之間具備關聯性或類似性(如「trade」(商品交易)類中,可能存在「business」,「transaction」,「deal」等同義詞),在某些狀況下,它們以不一樣的詞語表達了相同的含義。所以,爲了可以對代碼片斷作出更加科學的類別判斷,更好地識別這些同義詞,咱們有必要尋找一種可以解決避免因爲同義詞的存在而致使誤分類的方法。說白了,就是要去判斷詞語之間的類似度(即肯定是否爲近義詞),並找出代碼段中出現次數最多的一組語義。node
即在給定的代碼段中,可以發現哪些詞是屬於同義詞,而且可以實現分類。git
Eg.public static void function(){程序員
String trade=」money」;github
Int deal=5;redis
Long long business=0xfffffff;網絡
Boolen transaction=TRUE;app
……less
}dom
Output:同義詞有:trade,deal,business,transaction
這段代碼極可能與trade有關
問題肯定了以後,經過網上的搜索,發現了WordNet和word2vec這兩個相關的詞彙。(後知後覺,這自己就是一個找近義詞的過程)
首先,來看WordNet。搜了一下相關介紹:
WordNet是一個由普林斯頓大學認識科學實驗室在心理學教授喬治·A·米勒的指導下創建和維護的英語字典。開發工做從1985年開始,今後之後該項目接受了超過300萬美圓的資助(主要來源於對機器翻譯有興趣的政府機構)。
因爲它包含了語義信息,因此有別於一般意義上的字典。WordNet根據詞條的意義將它們分組,每個具備相贊成義的字條組稱爲一個synset(同義詞集合)。WordNet爲每個synset提供了簡短,概要的定義,並記錄不一樣synset之間的語義關係。
WordNet的開發有兩個目的:
它既是一個字典,又是一個辭典,它比單純的辭典或詞典都更加易於使用。
支持自動的文本分析以及人工智能應用。
WordNet內部結構
在WordNet中,名詞,動詞,形容詞和副詞各自被組織成一個同義詞的網絡,每一個同義詞集合都表明一個基本的語義概念,而且這些集合之間也由各類關係鏈接。(一個多義詞將出如今它的每一個意思的同義詞集合中)。在WordNet的初版中(標記爲1.x),四種不一樣詞性的網絡之間並沒有鏈接。WordNet的名詞網絡是第一個發展起來的。
名詞網絡的主幹是蘊涵關係的層次(上位/下位關係),它佔據了關係中的將近80%。層次中的最頂層是11個抽象概念,稱爲基本類別始點(unique beginners),例如實體(entity,「有生命的或無生命的具體存在」),心理特徵(psychological feature,「生命有機體的精神上的特徵)。名詞層次中最深的層次是16個節點。
(wikipedia)
通俗地來講,WordNet是一個結構化很好的知識庫,它不但包括通常的詞典功能,另外還有詞的分類信息。目前,基於WordNet的方法相對來講比較成熟,好比路徑方法 (lch)、基於信息論方法(res)等。(詳見參考文獻)
有了WordNet ,也就等因而有了咱們所要的單詞庫。因此,暫時先不考慮類似度的計算,把WordNet下載下來再說。
參考http://hi.baidu.com/buptyoyo/item/f13dfe463c061e3afb896028。順利地下載,安裝以及跑demo。
以後,一塊兒來看一下WordNet的文件結構:
bin目錄下,有可執行文件WordNet 2.1.exe:
能夠看到,WordNet對全部的英文單詞都進行的分類,而且造成了一棵語義樹。在本例中,entity——>abstract entity——>abstraction——>attribute——>state——>feeling——> emotion——>love;
從葉子節點到根節點
WordNet名次分類中的25個基本類:
dict目錄裏面存放的就是資源庫了,能夠看到,它以形容詞,副詞,名詞,動詞來分類:
doc爲WordNet的用戶手冊文件文件夾
lib爲WordNet軟件使用Windows資源的函數庫
src爲源碼文件夾
咱們首先以 WordNet 的詞彙語義分類做爲基礎,抽取出其中的同義詞,而後採用基於向量空間的方法計算出類似度。工做流程以下:
如下摘自:《基於WordNet的英語詞語類似度計算》
與「trade」的類似度比較:
分析:
先看第一組:trade vs trade
本身和本身固然是類似度100%
再看第二組:trade#n#5 vs deal#n#1
類似度居然和第一組是同樣的!根據結果,trade做爲名詞時,它的第5種含義和deal做爲名詞時的第1種含義是徹底類似的。讓咱們去庫裏看個究竟:
trade#n#5:
deal#n#1:
再來看一組不是很好理解的:
trade#n#7 vs deal#n#2
他們的類似度達到了0.14+,算是比較高的了,這是爲何呢?
trade#n#7:
sunshine#n#2:
相信聰明的你必定明白了爲何。
與「cat」的類似度比較:
7.代碼分析
工程結構圖:
test.java
1 package JWordNetSim.test; 2 3 import java.io.FileInputStream; 4 import java.util.HashMap; 5 import java.util.Map; 6 7 import net.didion.jwnl.JWNL; 8 import net.didion.jwnl.data.IndexWord; 9 import net.didion.jwnl.data.POS; 10 import net.didion.jwnl.dictionary.Dictionary; 11 import shef.nlp.wordnet.similarity.SimilarityMeasure; 12 13 /** 14 * A simple test of this WordNet similarity library. 15 * @author Mark A. Greenwood 16 */ 17 public class Test 18 { 19 public static void main(String[] args) throws Exception 20 { 21 //在運行代碼前,必須在本機上安裝wordnet2.0,只能裝2.0,裝了2.1會出錯 22 JWNL.initialize(new FileInputStream("D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\wordnet.xml")); 23 24 //建議一個映射去配置相關參數 25 Map<String,String> params = new HashMap<String,String>(); 26 27 //the simType parameter is the class name of the measure to use 28 params.put("simType","shef.nlp.wordnet.similarity.JCn"); 29 30 //this param should be the URL to an infocontent file (if required 31 //by the similarity measure being loaded) 32 params.put("infocontent","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\ic-bnc-resnik-add1.dat"); 33 34 //this param should be the URL to a mapping file if the 35 //user needs to make synset mappings 36 params.put("mapping","file:D:\\JAVAProjectWorkSpace\\jwnl\\JWordNetSim\\test\\domain_independent.txt"); 37 38 //create the similarity measure 39 SimilarityMeasure sim = SimilarityMeasure.newInstance(params); 40 41 //取詞 42 // Dictionary dict = Dictionary.getInstance(); 43 // IndexWord word1 = dict.getIndexWord(POS.NOUN, "trade"); //這裏把trade和dog徹底定義爲名詞來進行處理 44 // IndexWord word2 = dict.getIndexWord(POS.NOUN,"dog"); // 45 // 46 // //and get the similarity between the first senses of each word 47 // System.out.println(word1.getLemma()+"#"+word1.getPOS().getKey()+"#1 " + word2.getLemma()+"#"+word2.getPOS().getKey()+"#1 " + sim.getSimilarity(word1.getSense(1), word2.getSense(1))); 48 //// 49 // //get similarity using the string methods (note this also makes use 50 // //of the fake root node) 51 // System.out.println(sim.getSimilarity("trade#n","deal#n")); 52 53 //get a similarity that involves a mapping 54 System.out.println(sim.getSimilarity("trade", "trade")); 55 System.out.println(sim.getSimilarity("trade", "deal")); 56 System.out.println(sim.getSimilarity("trade", "commerce")); 57 System.out.println(sim.getSimilarity("trade", "transaction")); 58 System.out.println(sim.getSimilarity("trade", "finance")); 59 System.out.println(sim.getSimilarity("trade", "financial")); 60 System.out.println(sim.getSimilarity("trade", "business")); 61 System.out.println(sim.getSimilarity("trade", "economy")); 62 System.out.println(sim.getSimilarity("trade", "school")); 63 System.out.println(sim.getSimilarity("trade", "dog")); 64 System.out.println(sim.getSimilarity("trade", "cat")); 65 System.out.println(sim.getSimilarity("trade", "book")); 66 System.out.println(sim.getSimilarity("trade", "sunshine")); 67 System.out.println(sim.getSimilarity("trade", "smile")); 68 System.out.println(sim.getSimilarity("trade", "nice")); 69 System.out.println(sim.getSimilarity("trade", "hardly")); 70 System.out.println(sim.getSimilarity("trade", "beautiful")); 71 } 72 }
SimilarityMeasure.java
1 package shef.nlp.wordnet.similarity; 2 3 import java.io.BufferedReader; 4 import java.io.InputStreamReader; 5 import java.net.URL; 6 import java.util.Arrays; 7 import java.util.HashMap; 8 import java.util.HashSet; 9 import java.util.LinkedHashMap; 10 import java.util.Map; 11 import java.util.Set; 12 13 import net.didion.jwnl.JWNLException; 14 import net.didion.jwnl.data.IndexWord; 15 import net.didion.jwnl.data.POS; 16 import net.didion.jwnl.data.Synset; 17 import net.didion.jwnl.dictionary.Dictionary; 18 19 /** 20 * An abstract notion of a similarity measure that all provided 21 * implementations extend. 22 * @author Mark A. Greenwood 23 */ 24 public abstract class SimilarityMeasure 25 { 26 /** 27 * A mapping of terms to specific synsets. Usually used to map domain 28 * terms to a restricted set of synsets but can also be used to map 29 * named entity tags to appropriate synsets. 30 */ 31 private Map<String,Set<Synset>> domainMappings = new HashMap<String,Set<Synset>>(); 32 33 /** 34 * The maximum size the cache can grow to 35 */ 36 private int cacheSize = 5000; 37 38 /** 39 * To speed up computation of the similarity between two synsets 40 * we cache each similarity that is computed so we only have to 41 * do each one once. 42 */ 43 private Map<String,Double> cache = new LinkedHashMap<String,Double>(16,0.75f,true) 44 { 45 public boolean removeEldestEntry(Map.Entry<String,Double> eldest) 46 { 47 //if the size is less than zero then the user is asking us 48 //not to limit the size of the cache so return false 49 if (cacheSize < 0) return false; 50 51 //if the cache has crown bigger than it's max size return true 52 return size() > cacheSize; 53 } 54 }; 55 56 /** 57 * Get a previously computed similarity between two synsets from the cache. 58 * @param s1 the first synset between which we are looking for the similarity. 59 * @param s2 the other synset between which we are looking for the similarity. 60 * @return The similarity between the two sets or null 61 * if it is not in the cache. 62 */ 63 protected final Double getFromCache(Synset s1, Synset s2) 64 { 65 return cache.get(s1.getKey()+"-"+s2.getKey()); 66 } 67 68 /** 69 * Add a computed similarity between two synsets to the cache so that 70 * we don't have to compute it if it is needed in the future. 71 * @param s1 one of the synsets between which we are storring a similarity. 72 * @param s2 the other synset between which we are storring a similarity. 73 * @param sim the similarity between the two supplied synsets. 74 * @return the similarity score just added to the cache. 75 */ 76 protected final double addToCache(Synset s1, Synset s2, double sim) 77 { 78 cache.put(s1.getKey()+"-"+s2.getKey(),sim); 79 80 return sim; 81 } 82 83 /** 84 * Configures the similarity measure using the supplied parameters. 85 * @param params a set of key-value pairs that are used to configure 86 * the similarity measure. See concrete implementations for details 87 * of expected/possible parameters. 88 * @throws Exception if an error occurs while configuring the similarity measure. 89 */ 90 protected abstract void config(Map<String,String> params) throws Exception; 91 92 /** 93 * Create a new instance of a similarity measure. 94 * @param confURL the URL of a configuration file. Parameters are specified 95 * one per line as key:value pairs. 96 * @return a new instance of a similairy measure as defined by the 97 * supplied configuration URL. 98 * @throws Exception if an error occurs while creating the similarity measure. 99 */ 100 public static SimilarityMeasure newInstance(URL confURL) throws Exception 101 { 102 //create map to hold the key-value pairs we are going to read from 103 //the configuration file 104 Map<String,String> params = new HashMap<String,String>(); 105 106 //create a reader for the config file 107 BufferedReader in = null; 108 109 try 110 { 111 //open the config file 112 in = new BufferedReader(new InputStreamReader(confURL.openStream())); 113 114 String line = in.readLine(); 115 while (line != null) 116 { 117 line = line.trim(); 118 119 if (!line.equals("")) 120 { 121 //if the line contains something then 122 123 //split the data so we get the key and value 124 String[] data = line.split("\\s*:\\s*",2); 125 126 if (data.length == 2) 127 { 128 //if the line is valid add the two parts to the map 129 params.put(data[0], data[1]); 130 } 131 else 132 { 133 //if the line isn't valid tell the user but continue on 134 //with the rest of the file 135 System.out.println("Config Line is Malformed: " + line); 136 } 137 } 138 139 //get the next line ready to process 140 line = in.readLine(); 141 } 142 } 143 finally 144 { 145 //close the config file if it got opened 146 if (in != null) in.close(); 147 } 148 149 //create and return a new instance of the similarity measure specified 150 //by the config file 151 return newInstance(params); 152 } 153 154 /** 155 * Creates a new instance of a similarity measure using the supplied parameters. 156 * @param params a set of key-value pairs which define the similarity measure. 157 * @return the newly created similarity measure. 158 * @throws Exception if an error occurs while creating the similarity measure. 159 */ 160 public static SimilarityMeasure newInstance(Map<String,String> params) throws Exception 161 { 162 //get the class name of the implementation we need to load 163 String name = params.remove("simType"); 164 165 //if the name hasn't been specified then throw an exception 166 if (name == null) throw new Exception("Must specifiy the similarity measure to use"); 167 168 //Get hold of the class we need to load 169 @SuppressWarnings("unchecked") Class<SimilarityMeasure> c = (Class<SimilarityMeasure>)Class.forName(name); 170 171 //create a new instance of the similarity measure 172 SimilarityMeasure sim = c.newInstance(); 173 174 //get the cache parameter from the config params 175 String cSize = params.remove("cache"); 176 177 //if a cache size was specified then set it 178 if (cSize != null) sim.cacheSize = Integer.parseInt(cSize); 179 180 //get the url of the domain mapping file 181 String mapURL = params.remove("mapping"); 182 183 if (mapURL != null) 184 { 185 //if a mapping file has been provided then 186 187 //open a reader over the file 188 BufferedReader in = new BufferedReader(new InputStreamReader((new URL(mapURL)).openStream())); 189 190 //get the first line ready for processing 191 String line = in.readLine(); 192 193 while (line != null) 194 { 195 if (!line.startsWith("#")) 196 { 197 //if the line isn't a comment (i.e. it doesn't start with #) then... 198 199 //split the line at the white space 200 String[] data = line.trim().split("\\s+"); 201 202 //create a new set to hold the mapped synsets 203 Set<Synset> mappedTo = new HashSet<Synset>(); 204 205 for (int i = 1 ; i < data.length ; ++i) 206 { 207 //for each synset mapped to get the actual Synsets 208 //and store them in the set 209 mappedTo.addAll(sim.getSynsets(data[i])); 210 } 211 212 //if we have found some actual synsets then 213 //store them in the domain mappings 214 if (mappedTo.size() > 0) sim.domainMappings.put(data[0], mappedTo); 215 } 216 217 //get the next line from the file 218 line = in.readLine(); 219 } 220 221 //we have finished with the mappings file so close it 222 in.close(); 223 } 224 225 //make sure it is configured properly 226 sim.config(params); 227 228 //then return it 229 return sim; 230 } 231 232 /** 233 * This is the method responsible for computing the similarity between two 234 * specific synsets. The method is implemented differently for each 235 * similarity measure so see the subclasses for detailed information. 236 * @param s1 one of the synsets between which we want to know the similarity. 237 * @param s2 the other synset between which we want to know the similarity. 238 * @return the similarity between the two synsets. 239 * @throws JWNLException if an error occurs accessing WordNet. 240 */ 241 public abstract double getSimilarity(Synset s1, Synset s2) throws JWNLException; 242 243 /** 244 * Get the similarity between two words. The words can be specified either 245 * as just the word or in an encoded form including the POS tag and possibly 246 * the sense number, i.e. cat#n#1 would specifiy the 1st sense of the noun cat. 247 * @param w1 one of the words to compute similarity between. 248 * @param w2 the other word to compute similarity between. 249 * @return a SimilarityInfo instance detailing the similarity between the 250 * two words specified. 251 * @throws JWNLException if an error occurs accessing WordNet. 252 */ 253 public final SimilarityInfo getSimilarity(String w1, String w2) throws JWNLException 254 { 255 //Get the (possibly) multiple synsets associated with each word 256 Set<Synset> ss1 = getSynsets(w1); 257 Set<Synset> ss2 = getSynsets(w2); 258 259 //assume the words are not at all similar 260 SimilarityInfo sim = null; 261 262 for (Synset s1 : ss1) 263 { 264 for (Synset s2 : ss2) 265 { 266 //for each pair of synsets get the similarity 267 double score = getSimilarity(s1, s2); 268 269 if (sim == null || score > sim.getSimilarity()) 270 { 271 //if the similarity is better than we have seen before 272 //then create and store an info object describing the 273 //similarity between the two synsets 274 sim = new SimilarityInfo(w1, s1, w2, s2, score); 275 } 276 } 277 } 278 279 //return the maximum similarity we have found 280 return sim; 281 } 282 283 /** 284 * Finds all the synsets associated with a specific word. 285 * @param word the word we are interested. Note that this may be encoded 286 * to include information on POS tag and sense index. 287 * @return a set of synsets that are associated with the supplied word 288 * @throws JWNLException if an error occurs accessing WordNet 289 */ 290 private final Set<Synset> getSynsets(String word) throws JWNLException 291 { 292 //get a handle on the WordNet dictionary 293 Dictionary dict = Dictionary.getInstance(); 294 295 //create an emptuy set to hold any synsets we find 296 Set<Synset> synsets = new HashSet<Synset>(); 297 298 //split the word on the # characters so we can get at the 299 //upto three componets that could be present: word, POS tag, sense index 300 String[] data = word.split("#"); 301 302 //if the word is in the domainMappings then simply return the mappings 303 if (domainMappings.containsKey(data[0])) return domainMappings.get(data[0]); 304 305 if (data.length == 1) 306 { 307 //if there is just the word 308 309 for (IndexWord iw : dict.lookupAllIndexWords(data[0]).getIndexWordArray()) 310 { 311 //for each matching word in WordNet add all it's senses to 312 //the set we are building up 313 synsets.addAll(Arrays.asList(iw.getSenses())); 314 } 315 316 //we have finihsed so return the synsets we found 317 return synsets; 318 } 319 320 //the calling method specified a POS tag as well so get that 321 POS pos = POS.getPOSForKey(data[1]); 322 323 //if the POS tag isn't valid throw an exception 324 if (pos == null) throw new JWNLException("Invalid POS Tag: " + data[1]); 325 326 //get the word with the specified POS tag from WordNet 327 IndexWord iw = dict.getIndexWord(pos, data[0]); 328 329 if (data.length > 2) 330 { 331 //if the calling method specified a sense index then 332 //add just that sysnet to the set we are creating 333 synsets.add(iw.getSense(Integer.parseInt(data[2]))); 334 } 335 else 336 { 337 //no sense index was specified so add all the senses of 338 //the word to the set we are creating 339 synsets.addAll(Arrays.asList(iw.getSenses())); 340 } 341 342 //return the set of synsets we found for the specified word 343 return synsets; 344 } 345 }
每一個函數都有詳細註解,你們應該都看的明白。
262~277的循環過程以下:
JCN.java
1 /************************************************************************ 2 * Copyright (C) 2006-2007 The University of Sheffield * 3 * Developed by Mark A. Greenwood <m.greenwood@dcs.shef.ac.uk> * 4 * * 5 * This program is free software; you can redistribute it and/or modify * 6 * it under the terms of the GNU General Public License as published by * 7 * the Free Software Foundation; either version 2 of the License, or * 8 * (at your option) any later version. * 9 * * 10 * This program is distributed in the hope that it will be useful, * 11 * but WITHOUT ANY WARRANTY; without even the implied warranty of * 12 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * 13 * GNU General Public License for more details. * 14 * * 15 * You should have received a copy of the GNU General Public License * 16 * along with this program; if not, write to the Free Software * 17 * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. * 18 ************************************************************************/ 19 20 package shef.nlp.wordnet.similarity; 21 22 import net.didion.jwnl.JWNLException; 23 import net.didion.jwnl.data.Synset; 24 25 /** 26 * An implementation of the WordNet similarity measure developed by Jiang and 27 * Conrath. For full details of the measure see: 28 * <blockquote>Jiang J. and Conrath D. 1997. Semantic similarity based on corpus 29 * statistics and lexical taxonomy. In Proceedings of International 30 * Conference on Research in Computational Linguistics, Taiwan.</blockquote> 31 * @author Mark A. Greenwood 32 */ 33 public class JCn extends ICMeasure 34 { 35 /** 36 * Instances of this similarity measure should be generated using the 37 * factory methods of {@link SimilarityMeasure}. 38 */ 39 protected JCn() 40 { 41 //A protected constructor to force the use of the newInstance method 42 } 43 44 @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException 45 { 46 //if the POS tags are not the same then return 0 as this measure 47 //only works with 2 nouns or 2 verbs. 48 if (!s1.getPOS().equals(s2.getPOS())) return 0; 49 50 //see if the similarity is already cached and... 51 Double cached = getFromCache(s1, s2); 52 53 //if it is then simply return it 54 if (cached != null) return cached.doubleValue(); 55 56 //Get the Information Content (IC) values for the two supplied synsets 57 double ic1 = getIC(s1); 58 double ic2 = getIC(s2); 59 60 //if either IC value is zero then cache and return a sim of 0 61 if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0); 62 63 //Get the Lowest Common Subsumer (LCS) of the two synsets 64 Synset lcs = getLCSbyIC(s1,s2); 65 66 //if there isn't an LCS then cache and return a sim of 0 67 if (lcs == null) return addToCache(s1,s2,0); 68 69 //get the IC valueof the LCS 70 double icLCS = getIC(lcs); 71 72 //compute the distance between the two synsets 73 //NOTE: This is the original JCN measure 74 double distance = ic1 + ic2 - (2 * icLCS); 75 76 //assume the similarity between the synsets is 0 77 double sim = 0; 78 79 if (distance == 0) 80 { 81 //if the distance is 0 (i.e. ic1 + ic2 = 2 * icLCS) then... 82 83 //get the root frequency for this POS tag 84 double rootFreq = getFrequency(s1.getPOS()); 85 86 if (rootFreq > 0.01) 87 { 88 //if the root frequency has a value then use it to generate a 89 //very large sim value 90 sim = 1/-Math.log((rootFreq - 0.01) / rootFreq); 91 } 92 } 93 else 94 { 95 //this is the normal case so just convert the distance 96 //to a similarity by taking the multiplicative inverse 97 sim = 1/distance; 98 } 99 100 //cache and return the calculated similarity 101 return addToCache(s1,s2,sim); 102 } 103 }
LIN.java
1 package shef.nlp.wordnet.similarity; 2 3 import net.didion.jwnl.JWNLException; 4 import net.didion.jwnl.data.Synset; 5 6 /** 7 * An implementation of the WordNet similarity measure developed by Lin. For 8 * full details of the measure see: 9 * <blockquote>Lin D. 1998. An information-theoretic definition of similarity. In 10 * Proceedings of the 15th International Conference on Machine 11 * Learning, Madison, WI.</blockquote> 12 * @author Mark A. Greenwood 13 */ 14 public class Lin extends ICMeasure 15 { 16 /** 17 * Instances of this similarity measure should be generated using the 18 * factory methods of {@link SimilarityMeasure}. 19 */ 20 protected Lin() 21 { 22 //A protected constructor to force the use of the newInstance method 23 } 24 25 @Override public double getSimilarity(Synset s1, Synset s2) throws JWNLException 26 { 27 //if the POS tags are not the same then return 0 as this measure 28 //only works with 2 nouns or 2 verbs. 29 if (!s1.getPOS().equals(s2.getPOS())) return 0; 30 31 //see if the similarity is already cached and... 32 Double cached = getFromCache(s1, s2); 33 34 //if it is then simply return it 35 if (cached != null) return cached.doubleValue(); 36 37 //Get the Information Content (IC) values for the two supplied synsets 38 double ic1 = getIC(s1); 39 double ic2 = getIC(s2); 40 41 //if either IC value is zero then cache and return a sim of 0 42 if (ic1 == 0 || ic2 == 0) return addToCache(s1,s2,0); 43 44 //Get the Lowest Common Subsumer (LCS) of the two synsets 45 Synset lcs = getLCSbyIC(s1,s2); 46 47 //if there isn't an LCS then cache and return a sim of 0 48 if (lcs == null) return addToCache(s1,s2,0); 49 50 //get the IC valueof the LCS 51 double icLCS = getIC(lcs); 52 53 //caluclaue the similarity score 54 double sim = (2*icLCS)/(ic1+ic2); 55 56 //cache and return the calculated similarity 57 return addToCache(s1,s2,sim); 58 } 59 }
參考文獻:
《基於維基百科的語義類似度計算》盛志超,陶曉鵬(復旦大學計算機科學技術學院);
《基於WordNet的英語詞語類似度計算》顏偉,荀恩東(北京語言大學 語言信息處理研究所)
WordNet中的名詞:http://ccl.pku.edu.cn/doubtfire/semantics/wordnet/c-wordnet/nouns-in-wordnet.htm
http://jxr19830617.blog.163.com/blog/static/163573067201301985219857/
http://jxr19830617.blog.163.com/blog/static/1635730672013019105255295/