Lucene.net(4.8.0) 學習問題記錄一:分詞器Analyzer的構造和內部成員ReuseStategy

時間 2019-12-01

標籤 lucene.net lucene 4.8.0 學習問題記錄分詞器 analyzer 構造內部成員 reusestategy 简体版

原文原文鏈接

前言：目前本身在作使用Lucene.net和PanGu分詞實現全文檢索的工做，不過本身是把別人作好的項目進行遷移。由於項目總體要遷移到ASP.NET Core 2.0版本,而Lucene使用的版本是3.6.0 ，PanGu分詞也是對應Lucene3.6.0版本的。不過好在Lucene.net 已經有了Core 2.0版本，4.8.0 bate版，而PanGu分詞，目前有人正在作，貌似已經作完，只是尚未測試~，Lucene升級的改變我都會加粗表示。git

Lucene.net 4.8.0 github

https://github.com/apache/lucenenet數據庫

PanGu分詞(能夠直接使用的)apache

https://github.com/SilentCC/Lucene.Net.Analysis.PanGuide

JIEba分詞(能夠直接使用的)函數

https://github.com/SilentCC/JIEba-netcore2.0工具

Lucene.net 4.8.0 和以前的Lucene.net 3.6.0 改動仍是至關多的，這裏對本身開發過程遇到的問題，作一個記錄吧，但願能夠幫到和我同樣須要升級Lucene.net的人。我也是第一次接觸Lucene ,也但願能夠幫助初學Lucene的同窗。測試

一，Lucene 分詞器：Analyzer

這裏就對Lucene的Analyzer作一個簡單的闡述，之後會對Analyzer作一個更加詳細的筆記：Lucene 中的Analyzer 是一個分詞器，具體的做用呢就是將文本（包括要寫入索引的文檔，和查詢的條件）進行分詞操做 Tokenization 獲得一系列的分詞 Token。咱們用的別的分詞工具，好比PanGu分詞，都是繼承Analyzer 的，而且繼承相關的類和覆寫相關的方法。Analyzer 是怎麼參與搜索的過程呢？this

1.在寫入索引的時候：

咱們須要IndexWriter ,二IndexWriter 的構建，補充一下，Lucene3.6.0 的構造方法已經被拋棄了，新的構造方法是，依賴一個IndexWriterConfig 類，這記錄的是IndexWriter 的各類屬性和配置，這裏不作細究了。IndexWriterConfig 的構造函數就要傳入一個Analyzer .spa

IndexWriterConfig(Version matchVersion, Analyzer analyzer)

因此咱們寫入索引的時候，會用到Analyzer , 寫入的索引是這樣一個藉口，索引的儲存方式是Document 類，一個Document類中有不少的Field (name, value)。咱們能夠這樣理解Document是是一個數據庫中的表，Field是數據庫的中的字段。好比一篇文章，咱們要把它存入索引，以便後來有人能夠搜索到。

文章有不少屬性：Title : xxx ; Author :xxxx;Content : xxxx;

document.Add(new Field("Title","Lucene"));
document.Add(new Field("Author","dacc"));
document.Add(new Field("Content","xxxxxx"));
IndexWriter.AddDocument(document);

大抵是上面的過程，而分詞器Analyzer須要作的就是Filed 的value進行分詞，把很長的內容分紅一個一個的小分詞 Token。

2.在查詢搜索的時候，

咱們也須要Analyzer ,固然不是必須須要，和IndexWriter的必需要求不同。Analyzer的職責就是，將查詢的內容進行分詞，好比咱們查詢的內容是「全文檢索和分詞」，那麼Analyzer會把它先分解成「全文檢索」和「分詞」，而後在索引中，去找和有這些分詞的Field ,而後把Field所在的Document，返回出去。這裏搜索的細節在這裏不細究了，之後也會作詳細的筆記。

二，問題：

大概瞭解了Analyzer以後，我就列出我遇到的問題：

1.在調用Analyer的GetTokenStream 以後，拋出

Object reference not set to an instance of an object

這個異常的意思是，引用了值爲null的對象。因而我去翻找源碼，發現

  public TokenStream GetTokenStream(string fieldName, TextReader reader)
        {
            TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);
            TextReader r = InitReader(fieldName, reader);
            if (components == null)
            {
                components = CreateComponents(fieldName, r);
                reuseStrategy.SetReusableComponents(this, fieldName, components);
            }
            else
            {
                components.SetReader(r);
            }
            return components.TokenStream;
        }

在下面這條語句上面拋出了錯誤：

    TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);

reuseStrategy 是一個空對象。因此這句就報錯了。這裏，咱們能夠了解一下，Analyzer的內部.函數 GetTokenStream 是返回Analyzer中的TokenStream，TokenStream是一系列Token的集合。先不細究TokenStream的具體做用，由於會花不少的篇幅去說。而獲取TokenStream 的關鍵就在reuseStrategy 。在新版本的Lucene中，Analyzer中TokenStream是能夠重複使用的，即在一個線程中創建的Analyzer實例，都共用TokenStream。

 internal DisposableThreadLocal<object> storedValue = new DisposableThreadLocal<object>();

Analyzer的成員 storedValue 是全局共用的，storedValue 中就儲存了TokenStream 。而reuseStrategy也是Lucene3.6.0中沒有的 的做用就是幫助實現，多個Analyzer實例共用storedValue 。ResuseStrategy類中有成員函數GetReusableComponents 和SetReusableComponents 是設置TokenStream和Tokenizer的，

這是ResueStrategy類的源碼，這個類是一個抽象類，Analyzer的內部類，

 public abstract class ReuseStrategy
    {
        /// <summary>
        /// Gets the reusable <see cref="TokenStreamComponents"/> for the field with the given name.
        /// </summary>
        /// <param name="analyzer"> <see cref="Analyzer"/> from which to get the reused components. Use
        ///        <see cref="GetStoredValue(Analyzer)"/> and <see cref="SetStoredValue(Analyzer, object)"/>
        ///        to access the data on the <see cref="Analyzer"/>. </param>
        /// <param name="fieldName"> Name of the field whose reusable <see cref="TokenStreamComponents"/>
        ///        are to be retrieved </param>
        /// <returns> Reusable <see cref="TokenStreamComponents"/> for the field, or <c>null</c>
        ///         if there was no previous components for the field </returns>
        public abstract TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName);

        /// <summary>
        /// Stores the given <see cref="TokenStreamComponents"/> as the reusable components for the
        /// field with the give name.
        /// </summary>
        /// <param name="analyzer"> Analyzer </param>
        /// <param name="fieldName"> Name of the field whose <see cref="TokenStreamComponents"/> are being set </param>
        /// <param name="components"> <see cref="TokenStreamComponents"/> which are to be reused for the field </param>
        public abstract void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components);

        /// <summary>
        /// Returns the currently stored value.
        /// </summary>
        /// <returns> Currently stored value or <c>null</c> if no value is stored </returns>
        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>
        protected internal object GetStoredValue(Analyzer analyzer)
        {
            if (analyzer.storedValue == null)
            {
                throw new ObjectDisposedException(this.GetType().GetTypeInfo().FullName, "this Analyzer is closed");
            }
            return analyzer.storedValue.Get();
        }

        /// <summary>
        /// Sets the stored value.
        /// </summary>
        /// <param name="analyzer"> Analyzer </param>
        /// <param name="storedValue"> Value to store </param>
        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>
        protected internal void SetStoredValue(Analyzer analyzer, object storedValue)
        {
            if (analyzer.storedValue == null)
            {
                throw new ObjectDisposedException("this Analyzer is closed");
            }
            analyzer.storedValue.Set(storedValue);
        }
    }

Analyzer 中的另外一個內部類，繼承了ReuseStrategy 抽象類。這兩個類實現了設置Analyzer中的TokenStreamComponents和獲取TokenStreamComponents 。這樣的話Analyzer中GetTokenStream流程就清楚了

    public sealed class GlobalReuseStrategy : ReuseStrategy
        {
            /// <summary>
            /// Sole constructor. (For invocation by subclass constructors, typically implicit.) </summary>
            [Obsolete("Don't create instances of this class, use Analyzer.GLOBAL_REUSE_STRATEGY")]
            public GlobalReuseStrategy()
            { }


            public override TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName)
            {
                return (TokenStreamComponents)GetStoredValue(analyzer);
            }


            public override void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components)
            {
                SetStoredValue(analyzer, components);
            }
        }

另外呢Analyzer 也能夠設置TokenStream:

 public TokenStream GetTokenStream(string fieldName, TextReader reader)
                    {
                        //先獲取上一次共用的TokenStreamComponents
                        TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);
                        TextReader r = InitReader(fieldName, reader);
                        //若是沒有，就須要本身建立一個
                        if (components == null)
                        {
                            components = CreateComponents(fieldName, r);
                            //而且設置新的ResuableComponents，可讓下一個使用
                            reuseStrategy.SetReusableComponents(this, fieldName, components);
                        }
                        else
                        {
                            //若是以前就生成過了，TokenStreamComponents,則reset
                            components.SetReader(r);
                        }
                        //返回TokenStream
                        return components.TokenStream;
                    }

因此咱們在調用Analyzer的時候，Analyzer有一個構造函數

  public Analyzer(ReuseStrategy reuseStrategy)
        {
            this.reuseStrategy = reuseStrategy;
        }

設置Analyzer 的 ReuseStrategy , 而後我發如今PanGu分詞中，使用的構造函數中並無傳入ReuseStrategy , 按咱們就須要本身建一個ReuseStrategy的實例。

PanGu分詞的構造函數：

 public PanGuAnalyzer(bool originalResult)
          : this(originalResult, null, null)
        {
        }

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)
            : this(false, options, parameters)
        {
        }

      
        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters)
            : base()
        {
            this.Initialize(originalResult, options, parameters);
        }

       
       
        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters, ReuseStrategy reuseStrategy)
            : base(reuseStrategy)
        {
            this.Initialize(originalResult, options, parameters);
        }

        protected virtual void Initialize(bool originalResult, MatchOptions options, MatchParameter parameters)
        {
            _originalResult = originalResult;
            _options = options;
            _parameters = parameters;
        }

我調用的是第二個構造函數，結果傳進去的ReuseStrategy 是null ,因此咱們須要新建實例，事實上Analyzer中已經爲咱們提供了：

public static readonly ReuseStrategy GLOBAL_REUSE_STRATEGY = new GlobalReuseStrategy()

因此稍微改動一下PanGu分詞的構造函數就好了：

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)
            : this(false, options, parameters, Lucene.Net.Analysis.Analyzer.GLOBAL_REUSE_STRATEGY)
        {
        }

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。