本文介紹 ElasticSearch啓動時如何建立、加載Analyzer,主要的參考資料是Lucene中關於Analyzer官方文檔介紹、ElasticSearch6.3.2源碼中相關類:AnalysisModule、AnalysisPlugin、AnalyzerProvider、各類Tokenizer類和它們對應的TokenizerFactory。另外還參考了一個具體的基於ElasticSearch採用HanLP進行中文分詞的插件:elasticsearch-analysis-hanlphtml
這篇文章的主要目的是搞懂:AnalysisModule、AnalysisPlugin、AnalyzerProvider、某個具體的Tokenizer,好比HanLPStandardAnalyzer、和TokenizerFactory 之間的關係。這裏面確定是用過了某個(某些)設置模式的。搞懂了這個本身也能照葫蘆畫瓢,開發自定義的Plugin了。java
對比HanLP中文分詞器和ElasticSearch中內置的標準分詞器(StandardTokenizer),發現elasticsearch-analysis-hanlp的實現方法和ElasticSearch中實現的標準分詞插件兩者幾乎是一個套路。git
HanLP提供了各類各樣的中文分詞方式,好比:標準分詞、索引分詞、NLP分詞……所以,HanLPTokenizerFactory
implements TokenizerFactory
,實現了create()
方法,負責建立各種分詞器。github
這種寫法和ElasticSearch源碼裏面的StandardTokenizerFactory
寫法一模一樣。算法
把Analyzer想象成一部生產Token的機器,輸入Text,輸出Token。apache
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.elasticsearch
這部機器能夠以不一樣的方式生產Token。好比:對於英文,通常以文本中的空格做爲分隔符,輸入Text,輸出Token。ide
對於中文,中文文本沒有空格了,所以須要藉助一些中文分詞算法,輸入Text,輸出Token。源碼分析
對於HTML這樣的文本,那就須要根據HTML標籤做爲分隔符,輸入Text,輸出Token。ui
TokenStreamComponents
內部類封裝了生產Token的方式,看源碼註釋This class encapsulates the outer components of a token stream.It provides access to the source Tokenizer and .... 。主要是封裝了Tokenizer
/** * This class encapsulates the outer components of a token stream. It provides * access to the source ({@link Tokenizer}) and the outer end (sink), an * instance of {@link TokenFilter} which also serves as the * {@link TokenStream} returned by * {@link Analyzer#tokenStream(String, Reader)}. */ public static class TokenStreamComponents { /** * Original source of the tokens. */ protected final Tokenizer source; /** * Sink tokenstream, such as the outer tokenfilter decorating * the chain. This can be the source if there are no filters. */ protected final TokenStream sink;
若要自定義Analyzer,只需繼承Analyzer類,重寫createComponents()
方法,提供一個Tokenizer就能夠了。好比:HanLPStandardAnalyzer
重寫的方法以下:
@Override protected Analyzer.TokenStreamComponents createComponents(String fieldName) { // AccessController.doPrivileged((PrivilegedAction) () -> HanLP.Config.Normalization = true); Tokenizer tokenizer = new HanLPTokenizer(HanLP.newSegment(), configuration); return new Analyzer.TokenStreamComponents(tokenizer); }
另外,也可參考ElasticSearch中提供的StandardAnalyzer.java
,它實現了ElasticSearch查詢分析過程當中的標準分詞,它繼承了StopwordAnalyzerBase.java
,這樣能夠在生產Token的時候,過濾掉 stop words。
AnalyzerProvider封裝了Analyzer,它的構造方法實例化一個Analyzer,併爲Analyzer 提供了一些名稱、版本相關的信息:
public class HanLPAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> { private final Analyzer analyzer;
AbstractIndexAnalyzerProvider
裏面有 name 和 Version信息(Constructs a new analyzer component, with the index name and its settings and the analyzer name.)
public abstract class AbstractIndexAnalyzerProvider<T extends Analyzer> extends AbstractIndexComponent implements AnalyzerProvider<T> { private final String name; protected final Version version;
AnalysisHanLPPlugin
負責註冊各類各樣的分詞器。在定義索引的時候須要指定某個字段的Analyzer名稱,好比下面 name 字段中的文本在都使用名稱爲hanlp_standard
分詞器分詞後,寫入ElasticSearch索引。
"name": { "type": "text", "analyzer": "hanlp_standard", "fields": { "raw": { "type": "keyword" } } },
AnalysisPlugin
主要是下面三個方法,用來獲取:CharFilter、TokenFilter、Tokenizer。關於這三個的區別可參考下節:索引分析過程。
/** * Override to add additional {@link CharFilter}s. See {@link #requriesAnalysisSettings(AnalysisProvider)} * how to on get the configuration from the index. */ default Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() { return emptyMap(); } /** * Override to add additional {@link TokenFilter}s. See {@link #requriesAnalysisSettings(AnalysisProvider)} * how to on get the configuration from the index. */ default Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() { return emptyMap(); } /** * Override to add additional {@link Tokenizer}s. See {@link #requriesAnalysisSettings(AnalysisProvider)} * how to on get the configuration from the index. */ default Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() { return emptyMap(); }
這裏主要參考ElasticSearch啓動過程中相關源代碼。在建立PluginService過程當中初始化各類Analyzer, Node.java
//加載 modules 和 plugins 目錄下的內容 this.pluginsService = new PluginsService(tmpSettings, environment.configFile(), environment.modulesFile(), environment.pluginsFile(), classpathPlugins);
貌似是經過建立的ClassLoader,無論是module仍是plugin都視爲bundle,以SPI方式接入底層Lucene,PluginService.java
// load modules if (modulesDirectory != null) { Set<Bundle> modules = getModuleBundles(modulesDirectory); for (Bundle bundle : modules) { modulesList.add(bundle.plugin); } seenBundles.addAll(modules); } // now, find all the ones that are in plugins/ if (pluginsDirectory != null) { List<BundleCollection> plugins = findBundles(pluginsDirectory, "plugin"); for (final BundleCollection plugin : plugins) { final Collection<Bundle> bundles = plugin.bundles(); for (final Bundle bundle : bundles) { pluginsList.add(bundle.plugin); } seenBundles.addAll(bundles); pluginsNames.add(plugin.name()); }
加載 module/plugin jar文件:
try (DirectoryStream<Path> jarStream = Files.newDirectoryStream(dir, "*.jar")) { for (Path jar : jarStream) { // normalize with toRealPath to get symlinks out of our hair URL url = jar.toRealPath().toUri().toURL(); if (urls.add(url) == false) { throw new IllegalStateException("duplicate codebase: " + url); } } } //... // create a child to load the plugin in this bundle ClassLoader parentLoader = PluginLoaderIndirection.createLoader(getClass().getClassLoader(), extendedLoaders); ClassLoader loader = URLClassLoader.newInstance(bundle.urls.toArray(new URL[0]), parentLoader);
當PluginService載入了全部的plugin後,過濾出與Analysis相關的Plugin,建立AnalysisModule
//從plugin service 中過濾出 與Analysis相關的plugin AnalysisModule analysisModule = new AnalysisModule(this.environment, pluginsService.filterPlugins(AnalysisPlugin.class));
註冊各類分詞器、filters、analyzer的名稱:(這樣在建立索引的時候,爲某個索引字段指定分詞器,就是用的這裏的註冊了的名稱)
NamedRegistry<AnalysisProvider<CharFilterFactory>> charFilters = setupCharFilters(plugins); NamedRegistry<AnalysisProvider<TokenFilterFactory>> tokenFilters = setupTokenFilters(plugins, hunspellService); NamedRegistry<AnalysisProvider<TokenizerFactory>> tokenizers = setupTokenizers(plugins); NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> analyzers = setupAnalyzers(plugins); //.... private NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> setupAnalyzers(List<AnalysisPlugin> plugins) { NamedRegistry<AnalysisProvider<AnalyzerProvider<?>>> analyzers = new NamedRegistry<>("analyzer"); analyzers.register("default", StandardAnalyzerProvider::new); analyzers.register("standard", StandardAnalyzerProvider::new); //.... public StandardAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) { //.... standardAnalyzer = new StandardAnalyzer(stopWords); standardAnalyzer.setVersion(version); }
引用一段《An Introduction to Information Retrieval》中關於 token、type、term、dictionary概念的解釋:(這裏的type和ElasticSearch索引中的type是不同的,ElasticSearch索引中的type之後版本將不支持了)
A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. A type is the class of all tokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system's dictionary.
For example, if the document to be indexed is to
sleep perchance to dream
, then there are 5 tokens, but only 4 types (since there are 2 instances of to). However, if to is omitted from the index (as a stop word) then there will be only 3 terms: sleep, perchance, and dream.
我的以爲Tokenization和Analysis過程有交叉的地方。Lucene中定義的Analysis是指:將字符串轉化成Tokens的過程,Analysis主要有四個方面:
The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are four main classes in the package from which all analysis processes are derived. These are:
這四個的區別以下:(以中文處理舉例)
好比一句中文:「這是一篇關於ElasticSearch Analyzer的文章」,CharFilter過濾其中的某個字。Tokenizer是將這句話進行中文分詞:這是、一篇、關於、ElasticSearch、Analyzer、的、文章;分詞獲得的結果就是一個個的Token。TokenFilter則是過濾某些Token。
The Analyzer is a factory for analysis chains. Analyzers don't process text, Analyzers construct CharFilters, Tokenizers, and/or TokenFilters that process text. An Analyzer has two tasks: to produce TokenStreams that accept a reader and produces tokens, and to wrap or otherwise pre-process Reader objects.
具體可參考:Lucene7.6.0。在Lucene中,Analyzer不處理文本,它只是構建CharFilters、Tokenizer、TokenFilters, 而後讓它們來處理文本。
ElasticSearch6.3.2源碼
HanLP進行中文分詞的插件:elasticsearch-analysis-hanlp
原文:https://www.cnblogs.com/hapjin/p/10151887.html