揭祕solr查詢流程內幕

時間 2020-11-30

標籤 html java 程序員 web apache json 服務器網絡架構 app 欄目 HTML 简体版

原文原文鏈接

1.什麼是solr？html

爲何要solr：java

　　solr是將整個索引操做功能封裝好了的搜索引擎系統(企業級搜索引擎產品)程序員

　　solr能夠部署到單獨的服務器上(WEB服務)，它能夠提供服務，咱們的業務系統就只要發送請求，接收響應便可，下降了業務系統的負載web

　　solr部署在專門的服務器上，它的索引庫就不會受業務系統服務器存儲空間的限制apache

　　solr支持分佈式集羣，索引服務的容量和能力能夠線性擴展json

solr的工做機制：服務器

　　solr就是在lucene工具包的基礎之上進行了封裝，並且是以web服務的形式對外提供索引功能網絡

　　業務系統須要使用到索引的功能（建索引，查索引）時，只要發出http請求，並將返回數據進行解析便可架構

Solr 是Apache下的一個頂級開源項目，採用Java開發，它是基於Lucene的全文搜索服務器。Solr提供了比Lucene更爲豐富的查詢語言，同時實現了可配置、可擴展，並對索引、搜索性能進行了優化。app

Solr能夠獨立運行，運行在Jetty、Tomcat等這些Servlet容器中，Solr 索引的實現方法很簡單，用 POST 方法向 Solr 服務器發送一個描述 Field 及其內容的 XML 文檔，Solr根據xml文檔添加、刪除、更新索引。Solr 搜索只須要發送 HTTP GET 請求，而後對 Solr 返回Xml、json等格式的查詢結果進行解析，組織頁面佈局。Solr不提供構建UI的功能，Solr提供了一個管理界面，經過管理界面能夠查詢Solr的配置和運行狀況。

2.什麼是Lucene？

做爲一個開放源代碼項目，Lucene從問世以後，引起了開放源代碼社羣的巨大反響，程序員們不只使用它構建具體的全文檢索應用，並且將之集成到各類系統軟件中去，以及構建Web應用，甚至某些商業軟件也採用了Lucene做爲其內部全文檢索子系統的核心。apache軟件基金會的網站使用了Lucene做爲全文檢索的引擎，IBM的開源軟件eclipse的2.1版本中也採用了Lucene做爲幫助子系統的全文索引引擎，相應的IBM的商業軟件Web Sphere中也採用了Lucene。Lucene以其開放源代碼的特性、優異的索引結構、良好的系統架構得到了愈來愈多的應用。

Lucene做爲一個全文檢索引擎，其具備以下突出的優勢：

（1）索引文件格式獨立於應用平臺。Lucene定義了一套以8位字節爲基礎的索引文件格式，使得兼容系統或者不一樣平臺的應用可以共享創建的索引文件。

（2）在傳統全文檢索引擎的倒排索引的基礎上，實現了分塊索引，可以針對新的文件創建小文件索引，提高索引速度。而後經過與原有索引的合併，達到優化的目的。

（3）優秀的面向對象的系統架構，使得對於Lucene擴展的學習難度下降，方便擴充新功能。

（4）設計了獨立於語言和文件格式的文本分析接口，索引器經過接受Token流完成索引文件的創立，用戶擴展新的語言和文件格式，只須要實現文本分析的接口。

（5）已經默認實現了一套強大的查詢引擎，用戶無需本身編寫代碼即便系統可得到強大的查詢能力，Lucene的查詢實現中默認實現了布爾操做、模糊查詢（Fuzzy Search）、分組查詢等等。

3.lucene和solr的關係

solr是門戶，lucene是底層基礎，solr和lucene的關係正如hadoop和hdfs的關係。

4.Jetty是什麼？

　　Jetty 是一個開源的servlet容器，它爲基於Java的web容器，例如JSP和servlet提供運行環境。Jetty是使用Java語言編寫的，它的API以一組JAR包的形式發佈。開發人員能夠將Jetty容器實例化成一個對象，能夠迅速爲一些獨立運行（stand-alone）的Java應用提供網絡和web鏈接。

5.流程概況

6.Jetty接收請求並處理

設置本地調試見<lucene-solr本地調試方法>所示

StartSolrJetty.java

public static void main( String[] args ) 
  {
    //System.setProperty("solr.solr.home", "../../../example/solr");

    Server server = new Server();
    ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory());
    // Set some timeout options to make debugging easier.
    connector.setIdleTimeout(1000 * 60 * 60);
    connector.setSoLingerTime(-1);
    connector.setPort(8983);
    server.setConnectors(new Connector[] { connector });

    WebAppContext bb = new WebAppContext();
    bb.setServer(server);
    bb.setContextPath("/solr");
    bb.setWar("solr/webapp/web");

//    // START JMX SERVER
//    if( true ) {
//      MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
//      MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer);
//      server.getContainer().addEventListener(mBeanContainer);
//      mBeanContainer.start();
//    }

    server.setHandler(bb);

    try {
      System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP");
      server.start();
      while (System.in.available() == 0) {
        Thread.sleep(5000);
      }
      server.stop();
      server.join();
    } 
    catch (Exception e) {
      e.printStackTrace();
      System.exit(100);
    }
  }

其中，Server是http服務器，聚合了Connector(http請求接收器)和請求處理器Hanlder，Server自己是一個handler和一個線程池，Connector使用線程池來調用handle方法。

/** Jetty HTTP Servlet Server.
 * This class is the main class for the Jetty HTTP Servlet server.
 * It aggregates Connectors (HTTP request receivers) and request Handlers.
 * The server is itself a handler and a ThreadPool.  Connectors use the ThreadPool methods
 * to run jobs that will eventually call the handle method.
 */

其工做流程以下圖所示

因其不是本文重點，故略去不述。

7.solr調用lucene過程

上篇文章<solr調用lucene底層實現倒排索引源碼解析>已經論述，可對照上面的總體流程圖進行解讀，故略去不述

8.lucene調用過程

從上圖能夠看出分兩個階段

8.1 建立Weight

8.1.1 建立BooleanWeight

BooleanWeight.java

BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    super(query);
    this.query = query;
    this.needsScores = needsScores;
    this.similarity = searcher.getSimilarity(needsScores);
    weights = new ArrayList<>();
    for (BooleanClause c : query) {
      Query q=c.getQuery();
      Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost);
      weights.add(w);
    }
  }

8.1.2 同義詞權重分析

SynonymQuery.java

@Override
  public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    if (needsScores) {
      return new SynonymWeight(this, searcher, boost);
    } else {
      // if scores are not needed, let BooleanWeight deal with optimizing that case.
      BooleanQuery.Builder bq = new BooleanQuery.Builder();
      for (Term term : terms) {
        bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD);
      }
      return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost);
    }
  }

8.1.3 TermQuery.java

@Override
  public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException {
    final IndexReaderContext context = searcher.getTopReaderContext();
    final TermContext termState;
    if (perReaderTermState == null
        || perReaderTermState.wasBuiltFor(context) == false) {
      if (needsScores) {
        // make TermQuery single-pass if we don't have a PRTS or if the context
        // differs!
        termState = TermContext.build(context, term);
      } else {
        // do not compute the term state, this will help save seeks in the terms
        // dict on segments that have a cache entry for this query
        termState = null;
      }
    } else {
      // PRTS was pre-build for this IS
      termState = this.perReaderTermState;
    }

    return new TermWeight(searcher, needsScores, boost, termState);
  }

調用TermWeight,計算CollectionStatistics和TermStatistics

public TermWeight(IndexSearcher searcher, boolean needsScores,
        float boost, TermContext termStates) throws IOException {
      super(TermQuery.this);
      if (needsScores && termStates == null) {
        throw new IllegalStateException("termStates are required when scores are needed");
      }
      this.needsScores = needsScores;
      this.termStates = termStates;
      this.similarity = searcher.getSimilarity(needsScores);

      final CollectionStatistics collectionStats;
      final TermStatistics termStats;
      if (needsScores) {
        termStates.setQuery(this.getQuery().getKeyword());
        collectionStats = searcher.collectionStatistics(term.field());
        termStats = searcher.termStatistics(term, termStates);
      } else {
        // we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1
        final int maxDoc = searcher.getIndexReader().maxDoc();
        collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);
        termStats = new TermStatistics(term.bytes(), maxDoc, -1,term.bytes());
      }

      this.stats = similarity.computeWeight(boost, collectionStats, termStats);
    }

調用Similarity的computeWeight

BM25Similarity.java

@Override
  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];
    float[] cache = new float[256];
    for (int i = 0; i < cache.length; i++) {
      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
    }
    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
  }

8.2 查詢過程

完整過程以下：IndexSearcher調用search方法

protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
      throws IOException {

    // TODO: should we make this
    // threaded...?  the Collector could be sync'd?
    // always use single thread:
    for (LeafReaderContext ctx : leaves) { // search each subreader
      final LeafCollector leafCollector;
      try {
        leafCollector = collector.getLeafCollector(ctx);//1
      } catch (CollectionTerminatedException e) {
        // there is no doc of interest in this reader context
        // continue with the following leaf
        continue;
      }
      BulkScorer scorer = weight.bulkScorer(ctx);//2
      if (scorer != null) {
        try {
          scorer.score(leafCollector, ctx.reader().getLiveDocs());//3
        } catch (CollectionTerminatedException e) {
          // collection was terminated prematurely
          // continue with the following leaf
        }
      }
    }
  }

8.2.1 獲取Collector

TopScoreDocCollector.java#SimpleTopScoreDocCollector

@Override
    public LeafCollector getLeafCollector(LeafReaderContext context)
        throws IOException {
      final int docBase = context.docBase;
      return new ScorerLeafCollector() {

        @Override
        public void collect(int doc) throws IOException {
          float score = scorer.score();
/*          Document document=context.reader().document(doc);
*/       
          // This collector cannot handle these scores:
          assert score != Float.NEGATIVE_INFINITY;
          assert !Float.isNaN(score);

          totalHits++;
          if (score <= pqTop.score) {
            // Since docs are returned in-order (i.e., increasing doc Id), a document
            // with equal score to pqTop.score cannot compete since HitQueue favors
            // documents with lower doc Ids. Therefore reject those docs too.
            return;
          }
          pqTop.doc = doc + docBase;
          pqTop.score = score;
          pqTop = pq.updateTop();
        }

      };
    }

8.2.2 調用打分socore

/**
   * Optional method, to return a {@link BulkScorer} to
   * score the query and send hits to a {@link Collector}.
   * Only queries that have a different top-level approach
   * need to override this; the default implementation
   * pulls a normal {@link Scorer} and iterates and
   * collects the resulting hits which are not marked as deleted.
   *
   * @param context
   *          the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}.
   *
   * @return a {@link BulkScorer} which scores documents and
   * passes them to a collector.
   * @throws IOException if there is a low-level I/O error
   */
  public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

    Scorer scorer = scorer(context);
    if (scorer == null) {
      // No docs match
      return null;
    }

    // This impl always scores docs in order, so we can
    // ignore scoreDocsInOrder:
    return new DefaultBulkScorer(scorer);
  }

  /** Just wraps a Scorer and performs top scoring using it.
   *  @lucene.internal */
  protected static class DefaultBulkScorer extends BulkScorer {
    private final Scorer scorer;
    private final DocIdSetIterator iterator;
    private final TwoPhaseIterator twoPhase;

    /** Sole constructor. */
    public DefaultBulkScorer(Scorer scorer) {
      if (scorer == null) {
        throw new NullPointerException();
      }
      this.scorer = scorer;
      this.iterator = scorer.iterator();
      this.twoPhase = scorer.twoPhaseIterator();
    }

    @Override
    public long cost() {
      return iterator.cost();
    }

    @Override
    public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
      collector.setScorer(scorer);
      if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
        scoreAll(collector, iterator, twoPhase, acceptDocs);
        return DocIdSetIterator.NO_MORE_DOCS;
      } else {
        int doc = scorer.docID();
        if (doc < min) {
          if (twoPhase == null) {
            doc = iterator.advance(min);
          } else {
            doc = twoPhase.approximation().advance(min);
          }
        }
        return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
      }
    }

調用scoreAll方法，遍歷Document，執行SimpleTopScoreDocCollector的collect方法，包含打分邏輯<見SimpleTopScoreDocCollector代碼>。

/** Specialized method to bulk-score all hits; we
     *  separate this from {@link #scoreRange} to help out
     *  hotspot.
     *  See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> */
    static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
      if (twoPhase == null) {
        for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
          if (acceptDocs == null || acceptDocs.get(doc)) {
            collector.collect(doc);
          }
        }
      } else {
        // The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirm
        final DocIdSetIterator approximation = twoPhase.approximation();
        for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {
          if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {
            collector.collect(doc);
          }
        }
      }
    }

總結：

　　梳理整理整個流程太累了。

參考資料

【1】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【2】https://baike.baidu.com/item/jetty/370234?fr=aladdin