個人架構演化筆記 11:ES之ansj分詞器之定製:動態支持StopWord及同義詞功能

上一篇文章提到過方法,本文單獨拿出來做爲一個主題。java

架構以下:redis

這裏ansj分詞器爲了支持動態添加詞彙,使用了Redis組件。架構

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~app

首先要明白動態支持意味着:elasticsearch

1)內存中支持動態增長/刪除ide

2)文件中支持動態增長/刪除測試

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~spa

先解決第2個問題:文件動態支持日誌

AddTermRedisPubSub 類中知道文件支持是由FileUtils類支持的。code

FiltUtils添加以下兩個方法:

 

public static void appendStopWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_STOP_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			appendFile(content, file);
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

	public static void removeStopWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_STOP_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			removeFile(content, file, false);
		} catch (FileNotFoundException e) {
			logger.error("file not found $ES_HOME/config/ansj/stopLibrary.dic");
			e.printStackTrace();
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

測試過程當中發現:添加一個停詞,會打出一些沒必要要的日誌:

[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswill
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswith
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswithin
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswithout

因而將FileUtils類的removeFile方法的

logger.info("match is {} text is{}",
					new Object[] { Boolean.valueOf(match(content, text, head)),
							text });

 

註釋掉便可。

AddTermRedisPubSub類添加: 

 

else if ("stop".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				// add one stopWord into memory
				AnsjElasticConfigurator.filter.add(msg[2]);
				// add one stopWord into file
				FileUtils.appendStopWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				// remove one stopWord from memory
				AnsjElasticConfigurator.filter.remove(msg[2]);
				// remove one stopWod from file
				FileUtils.removeStopWord(msg[2]);
			}
		}

最後就是stopLibrary.dic的最後一行要添加一個換行符,不然後面添加的單詞會跟原先最後一個單詞位於同一行。

這樣,就完成了動態支持redis添加停詞的功能。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

下面介紹ansj如何添加同義詞功能!

在Lucene4.6中經過lucene-analyzers-common-4.6.1.jar內的SynonymFilterFactory實現中文同義詞很是方便,

只需幾行代碼和一個同義詞詞典。

~~~~~~~~~~~~~~~~~~~

首先,修改啓動類:AnsjElasticConfigurator

public static SynonymFilterFactory factory = null;
	public static String DEFAULT_SYNONYM_FILE_LIB_PATH = "ansj/synonyms.dic";

	public static void loadSynonymFilter(Settings settings) {
		Version ver = Version.LUCENE_46;
		Map<String, String> filterArgs = new HashMap<String, String>();
		filterArgs.put("luceneMatchVersion", ver.toString());
		File path = new File(environment.configFile(), settings.get("synonyms",
				DEFAULT_SYNONYM_FILE_LIB_PATH));
		filterArgs.put("synonyms", path.getAbsolutePath());
		logger.info("synonyms.dict absolute path: " + path.getAbsolutePath());
		filterArgs.put("expand", "true");
		factory = new SynonymFilterFactory(filterArgs);
		try {
			factory.inform(new FilesystemResourceLoader());
		} catch (Exception e) {
			// Exception happens here!
			logger.info("load ansj/synonyms.dic fail,detail is as follows:"
					+ e.toString());
		}
	}

	public static void init(Settings indexSettings, Settings settings) {
		if (isLoaded()) {
			return;
		}
		environment = new Environment(indexSettings);
		initConfigPath(settings);
		loadFilter(settings);
		loadSynonymFilter(settings);
		try {
			preheat();
			logger.info("ansj preheat done! It can be used now!");
		} catch (Exception e) {
			logger.error("ansj preheat fail,please check file path.");
		}
		initRedis(settings);
		setLoaded(true);
	}

 

編譯成功。

將編譯好的2個class文件放入到elasticsearch-analysis-ansj-0.2.jar中,替換相應的文件便可。

緊接着修改:AnsjIndexAnalysis.java

@Override
	protected TokenStreamComponents createComponents(String fieldName,
			final Reader reader) {
		// TODO Auto-generated method stub
		Tokenizer tokenizer = new AnsjTokenizer(new IndexAnalysis(
				new BufferedReader(reader)), reader, filter, pstemming);
		return new TokenStreamComponents(tokenizer,
				AnsjElasticConfigurator.factory.create(tokenizer));
	}

AnsjAnalysis.java

@Override
	protected TokenStreamComponents createComponents(String fieldName,
			final Reader reader) {
		// TODO Auto-generated method stub
		Tokenizer tokenizer = new AnsjTokenizer(new ToAnalysis(
				new BufferedReader(reader)), reader, filter, pstemming);
		// add by smallblack

		return new TokenStreamComponents(tokenizer,
				AnsjElasticConfigurator.factory.create(tokenizer));
	}

編譯成功後放入ansj_lucene4_plug-1.3.jar,替換相應文件便可。

而後啓動es以前務必在ansj下添加synonyms.dic文件。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~可是目前只是靜態支持,咱們但願動態支持。

先修改FileUtils.java文件

public static void appendSynonymWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_SYNONYM_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			appendFile(content, file);
		} catch (IOException e) {
			logger.error("read ansj/synonyms.dic exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

	public static void removeSynonymWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_SYNONYM_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			removeFile(content, file, false);
		} catch (FileNotFoundException e) {
			logger.error("file not found $ES_HOME/config/ansj/synonyms.dic");
			e.printStackTrace();
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

而後修改AddTermRedisPubSub.java文件

} else if ("stop".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				AnsjElasticConfigurator.filter.add(msg[2]);
				FileUtils.appendStopWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				AnsjElasticConfigurator.filter.remove(msg[2]);
				FileUtils.removeStopWord(msg[2]);
			}
		} else if ("syn".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				FileUtils.appendSynonymWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				FileUtils.removeSynonymWord(msg[2]);
			}
			AnsjElasticConfigurator.factory
					.inform(new FilesystemResourceLoader());
		}

編譯,加入到elasticsearch-analysis-ansj-0.2.jar.

測試結果:

 

而後添加同義詞

 

再查看效果:

再嘗試下同義詞的動態刪除

再查看分詞效果

又回來了。

任務解決!

相關文章
相關標籤/搜索