1.什麼是solr?html
爲何要solr:java
solr是將整個索引操做功能封裝好了的搜索引擎系統(企業級搜索引擎產品)程序員
solr能夠部署到單獨的服務器上(WEB服務),它能夠提供服務,咱們的業務系統就只要發送請求,接收響應便可,下降了業務系統的負載web
solr部署在專門的服務器上,它的索引庫就不會受業務系統服務器存儲空間的限制apache
solr支持分佈式集羣,索引服務的容量和能力能夠線性擴展json
solr的工做機制:服務器
solr就是在lucene工具包的基礎之上進行了封裝,並且是以web服務的形式對外提供索引功能網絡
業務系統須要使用到索引的功能(建索引,查索引)時,只要發出http請求,並將返回數據進行解析便可架構
Solr 是Apache下的一個頂級開源項目,採用Java開發,它是基於Lucene的全文搜索服務器。Solr提供了比Lucene更爲豐富的查詢語言,同時實現了可配置、可擴展,並對索引、搜索性能進行了優化。app
Solr能夠獨立運行,運行在Jetty、Tomcat等這些Servlet容器中,Solr 索引的實現方法很簡單,用 POST 方法向 Solr 服務器發送一個描述 Field 及其內容的 XML 文檔,Solr根據xml文檔添加、刪除、更新索引 。Solr 搜索只須要發送 HTTP GET 請求,而後對 Solr 返回Xml、json等格式的查詢結果進行解析,組織頁面佈局。Solr不提供構建UI的功能,Solr提供了一個管理界面,經過管理界面能夠查詢Solr的配置和運行狀況。
2.什麼是Lucene?
做爲一個開放源代碼項目,Lucene從問世以後,引起了開放源代碼社羣的巨大反響,程序員們不只使用它構建具體的全文檢索應用,並且將之集成到各類系統軟件中去,以及構建Web應用,甚至某些商業軟件也採用了Lucene做爲其內部全文檢索子系統的核心。apache軟件基金會的網站使用了Lucene做爲全文檢索的引擎,IBM的開源軟件eclipse的2.1版本中也採用了Lucene做爲幫助子系統的全文索引引擎,相應的IBM的商業軟件Web Sphere中也採用了Lucene。Lucene以其開放源代碼的特性、優異的索引結構、良好的系統架構得到了愈來愈多的應用。
Lucene做爲一個全文檢索引擎,其具備以下突出的優勢:
(1)索引文件格式獨立於應用平臺。Lucene定義了一套以8位字節爲基礎的索引文件格式,使得兼容系統或者不一樣平臺的應用可以共享創建的索引文件。
(2)在傳統全文檢索引擎的倒排索引的基礎上,實現了分塊索引,可以針對新的文件創建小文件索引,提高索引速度。而後經過與原有索引的合併,達到優化的目的。
(3)優秀的面向對象的系統架構,使得對於Lucene擴展的學習難度下降,方便擴充新功能。
(4)設計了獨立於語言和文件格式的文本分析接口,索引器經過接受Token流完成索引文件的創立,用戶擴展新的語言和文件格式,只須要實現文本分析的接口。
(5)已經默認實現了一套強大的查詢引擎,用戶無需本身編寫代碼即便系統可得到強大的查詢能力,Lucene的查詢實現中默認實現了布爾操做、模糊查詢(Fuzzy Search)、分組查詢等等。
3.lucene和solr的關係
solr是門戶,lucene是底層基礎,solr和lucene的關係正如hadoop和hdfs的關係。
4.Jetty是什麼?
Jetty 是一個開源的servlet容器,它爲基於Java的web容器,例如JSP和servlet提供運行環境。Jetty是使用Java語言編寫的,它的API以一組JAR包的形式發佈。開發人員能夠將Jetty容器實例化成一個對象,能夠迅速爲一些獨立運行(stand-alone)的Java應用提供網絡和web鏈接。
5.流程概況
6.Jetty接收請求並處理
設置本地調試見<lucene-solr本地調試方法>所示
StartSolrJetty.java
public static void main( String[] args ) { //System.setProperty("solr.solr.home", "../../../example/solr"); Server server = new Server(); ServerConnector connector = new ServerConnector(server, new HttpConnectionFactory()); // Set some timeout options to make debugging easier. connector.setIdleTimeout(1000 * 60 * 60); connector.setSoLingerTime(-1); connector.setPort(8983); server.setConnectors(new Connector[] { connector }); WebAppContext bb = new WebAppContext(); bb.setServer(server); bb.setContextPath("/solr"); bb.setWar("solr/webapp/web"); // // START JMX SERVER // if( true ) { // MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer(); // MBeanContainer mBeanContainer = new MBeanContainer(mBeanServer); // server.getContainer().addEventListener(mBeanContainer); // mBeanContainer.start(); // } server.setHandler(bb); try { System.out.println(">>> STARTING EMBEDDED JETTY SERVER, PRESS ANY KEY TO STOP"); server.start(); while (System.in.available() == 0) { Thread.sleep(5000); } server.stop(); server.join(); } catch (Exception e) { e.printStackTrace(); System.exit(100); } }
其中,Server是http服務器,聚合了Connector(http請求接收器)和請求處理器Hanlder,Server自己是一個handler和一個線程池,Connector使用線程池來調用handle方法。
/** Jetty HTTP Servlet Server. * This class is the main class for the Jetty HTTP Servlet server. * It aggregates Connectors (HTTP request receivers) and request Handlers. * The server is itself a handler and a ThreadPool. Connectors use the ThreadPool methods * to run jobs that will eventually call the handle method. */
其工做流程以下圖所示
因其不是本文重點,故略去不述。
7.solr調用lucene過程
上篇文章<solr調用lucene底層實現倒排索引源碼解析>已經論述,可對照上面的總體流程圖進行解讀,故略去不述
8.lucene調用過程
從上圖能夠看出分兩個階段
8.1 建立Weight
8.1.1 建立BooleanWeight
BooleanWeight.java
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, float boost) throws IOException { super(query); this.query = query; this.needsScores = needsScores; this.similarity = searcher.getSimilarity(needsScores); weights = new ArrayList<>(); for (BooleanClause c : query) { Query q=c.getQuery(); Weight w = searcher.createWeight(q, needsScores && c.isScoring(), boost); weights.add(w); } }
8.1.2 同義詞權重分析
SynonymQuery.java
@Override public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException { if (needsScores) { return new SynonymWeight(this, searcher, boost); } else { // if scores are not needed, let BooleanWeight deal with optimizing that case. BooleanQuery.Builder bq = new BooleanQuery.Builder(); for (Term term : terms) { bq.add(new TermQuery(term), BooleanClause.Occur.SHOULD); } return searcher.rewrite(bq.build()).createWeight(searcher, needsScores, boost); } }
8.1.3 TermQuery.java
@Override public Weight createWeight(IndexSearcher searcher, boolean needsScores, float boost) throws IOException { final IndexReaderContext context = searcher.getTopReaderContext(); final TermContext termState; if (perReaderTermState == null || perReaderTermState.wasBuiltFor(context) == false) { if (needsScores) { // make TermQuery single-pass if we don't have a PRTS or if the context // differs! termState = TermContext.build(context, term); } else { // do not compute the term state, this will help save seeks in the terms // dict on segments that have a cache entry for this query termState = null; } } else { // PRTS was pre-build for this IS termState = this.perReaderTermState; } return new TermWeight(searcher, needsScores, boost, termState); }
調用TermWeight,計算CollectionStatistics和TermStatistics
public TermWeight(IndexSearcher searcher, boolean needsScores, float boost, TermContext termStates) throws IOException { super(TermQuery.this); if (needsScores && termStates == null) { throw new IllegalStateException("termStates are required when scores are needed"); } this.needsScores = needsScores; this.termStates = termStates; this.similarity = searcher.getSimilarity(needsScores); final CollectionStatistics collectionStats; final TermStatistics termStats; if (needsScores) { termStates.setQuery(this.getQuery().getKeyword()); collectionStats = searcher.collectionStatistics(term.field()); termStats = searcher.termStatistics(term, termStates); } else { // we do not need the actual stats, use fake stats with docFreq=maxDoc and ttf=-1 final int maxDoc = searcher.getIndexReader().maxDoc(); collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1); termStats = new TermStatistics(term.bytes(), maxDoc, -1,term.bytes()); } this.stats = similarity.computeWeight(boost, collectionStats, termStats); }
調用Similarity的computeWeight
BM25Similarity.java
@Override public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats); float avgdl = avgFieldLength(collectionStats); float[] oldCache = new float[256]; float[] cache = new float[256]; for (int i = 0; i < cache.length; i++) { oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl); cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl); } return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache); }
8.2 查詢過程
完整過程以下:IndexSearcher調用search方法
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector) throws IOException { // TODO: should we make this // threaded...? the Collector could be sync'd? // always use single thread: for (LeafReaderContext ctx : leaves) { // search each subreader final LeafCollector leafCollector; try { leafCollector = collector.getLeafCollector(ctx);//1 } catch (CollectionTerminatedException e) { // there is no doc of interest in this reader context // continue with the following leaf continue; } BulkScorer scorer = weight.bulkScorer(ctx);//2 if (scorer != null) { try { scorer.score(leafCollector, ctx.reader().getLiveDocs());//3 } catch (CollectionTerminatedException e) { // collection was terminated prematurely // continue with the following leaf } } } }
8.2.1 獲取Collector
TopScoreDocCollector.java#SimpleTopScoreDocCollector
@Override public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { final int docBase = context.docBase; return new ScorerLeafCollector() { @Override public void collect(int doc) throws IOException { float score = scorer.score(); /* Document document=context.reader().document(doc); */ // This collector cannot handle these scores: assert score != Float.NEGATIVE_INFINITY; assert !Float.isNaN(score); totalHits++; if (score <= pqTop.score) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } pqTop.doc = doc + docBase; pqTop.score = score; pqTop = pq.updateTop(); } }; }
8.2.2 調用打分socore
/** * Optional method, to return a {@link BulkScorer} to * score the query and send hits to a {@link Collector}. * Only queries that have a different top-level approach * need to override this; the default implementation * pulls a normal {@link Scorer} and iterates and * collects the resulting hits which are not marked as deleted. * * @param context * the {@link org.apache.lucene.index.LeafReaderContext} for which to return the {@link Scorer}. * * @return a {@link BulkScorer} which scores documents and * passes them to a collector. * @throws IOException if there is a low-level I/O error */ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { Scorer scorer = scorer(context); if (scorer == null) { // No docs match return null; } // This impl always scores docs in order, so we can // ignore scoreDocsInOrder: return new DefaultBulkScorer(scorer); } /** Just wraps a Scorer and performs top scoring using it. * @lucene.internal */ protected static class DefaultBulkScorer extends BulkScorer { private final Scorer scorer; private final DocIdSetIterator iterator; private final TwoPhaseIterator twoPhase; /** Sole constructor. */ public DefaultBulkScorer(Scorer scorer) { if (scorer == null) { throw new NullPointerException(); } this.scorer = scorer; this.iterator = scorer.iterator(); this.twoPhase = scorer.twoPhaseIterator(); } @Override public long cost() { return iterator.cost(); } @Override public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { collector.setScorer(scorer); if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) { scoreAll(collector, iterator, twoPhase, acceptDocs); return DocIdSetIterator.NO_MORE_DOCS; } else { int doc = scorer.docID(); if (doc < min) { if (twoPhase == null) { doc = iterator.advance(min); } else { doc = twoPhase.approximation().advance(min); } } return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max); } }
調用scoreAll方法,遍歷Document,執行SimpleTopScoreDocCollector的collect方法,包含打分邏輯<見SimpleTopScoreDocCollector代碼>。
/** Specialized method to bulk-score all hits; we * separate this from {@link #scoreRange} to help out * hotspot. * See <a href="https://issues.apache.org/jira/browse/LUCENE-5487">LUCENE-5487</a> */ static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException { if (twoPhase == null) { for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) { if (acceptDocs == null || acceptDocs.get(doc)) { collector.collect(doc); } } } else { // The scorer has an approximation, so run the approximation first, then check acceptDocs, then confirm final DocIdSetIterator approximation = twoPhase.approximation(); for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) { if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) { collector.collect(doc); } } } }
總結:
梳理整理整個流程太累了。
參考資料
【1】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html