Solr調研總結(轉)

時間 2019-11-24

標籤 solr 調研總結简体版

原文原文鏈接

Solr調研總結php

開發類型html	全文檢索相關開發前端
Solr版本java	4.2python
文件內容ios	本文介紹solr的功能使用及相關注意事項;主要包括如下內容:環境搭建及調試、兩個核心配置文件介紹、中文分詞器配置、維護索引、查詢索引,高亮顯示、拼寫檢查、搜索建議、分組統計、自動聚類、類似匹配、拼音檢索等功能的使用方法。web
在代碼文本框中若有顯示不全的,請在文本框中按Ctrl+A再複製.算法
版本數據庫	做者/修改人express	日期
V1.0	gzk	2013-06-04

1. Solr 是什麼？

Solr它是一種開放源碼的、基於 Lucene Java 的搜索服務器，易於加入到 Web 應用程序中。Solr 提供了層面搜索(就是統計)、命中醒目顯示而且支持多種輸出格式（包括XML/XSLT 和JSON等格式）。它易於安裝和配置，並且附帶了一個基於HTTP 的管理界面。可使用 Solr 的表現優異的基本搜索功能，也能夠對它進行擴展從而知足企業的須要。Solr的特性包括：

高級的全文搜索功能
專爲高通量的網絡流量進行的優化
基於開放接口（XML和HTTP）的標準
綜合的HTML管理界面
可伸縮性－可以有效地複製到另一個Solr搜索服務器
使用XML配置達到靈活性和適配性
可擴展的插件體系

2. Lucene 是什麼？

Lucene是一個基於Java的全文信息檢索工具包，它不是一個完整的搜索應用程序，而是爲你的應用程序提供索引和搜索功能。Lucene 目前是 Apache Jakarta(雅加達) 家族中的一個開源項目。也是目前最爲流行的基於Java開源全文檢索工具包。目前已經有不少應用程序的搜索功能是基於 Lucene ，好比Eclipse 幫助系統的搜索功能。Lucene可以爲文本類型的數據創建索引，因此你只要把你要索引的數據格式轉化的文本格式，Lucene 就能對你的文檔進行索引和搜索。

3. Solr vs Lucene

Solr與Lucene 並非競爭對立關係，偏偏相反Solr 依存於Lucene，由於Solr底層的核心技術是使用Lucene 來實現的，Solr和Lucene的本質區別有如下三點：搜索服務器，企業級和管理。Lucene本質上是搜索庫，不是獨立的應用程序，而Solr是。 Lucene專一於搜索底層的建設，而Solr專一於企業應用。Lucene不負責支撐搜索服務所必須的管理，而Solr負責。因此說，一句話歸納 Solr: Solr是Lucene面向企業搜索應用的擴展。

Solr與Lucene架構圖:

Solr使用Lucene而且擴展了它！

一個真正的擁有動態字段(Dynamic Field)和惟一鍵(Unique Key)的數據模式(Data Schema)
對Lucene查詢語言的強大擴展！
支持對結果進行動態的分組和過濾
高級的，可配置的文本分析
高度可配置和可擴展的緩存機制
性能優化
支持經過XML進行外部配置
擁有一個管理界面
可監控的日誌
支持高速增量式更新(Fast incremental Updates)和快照發布(Snapshot Distribution)

Solr和Lucene這兩個一樣的索引,查詢中國:

solr用時:0.073(秒)

Lucene用時:1.071(秒)

4.搭建並調試Solr

4.1 安裝虛擬機

Solr 必須運行在Java1.6 或更高版本的Java 虛擬機中，運行標準Solr 服務只須要安裝JRE 便可，但若是須要擴展功能或編譯源碼則須要下載JDK 來完成。能夠經過下面的地址下載所需JDK 或JRE ：

OpenJDK （ http://java.sun.com/j2se/downloads.html ）
Sun （http://java.sun.com/j2se/downloads.html ）
IBM （http://www.ibm.com/developerworks/java/jdk/ ）
Oracle （http://www.oracle.com/technology/products/jrockit/index.html ）

安裝步驟請參考相應的幫助文檔。

4.2下載Solr

本文針對Solr4.2版本進行調研的，下文介紹內容均針對Solr4.2版本，如與Solr 最新版本有出入請以官方網站內容爲準。Solr官方網站下載地址：http://lucene.apache.org/solr/

4.3下載並設置Apache Ant

Solr是使用Ant進行管理的源碼, Ant是一種基於Java的build工具。理論上來講，它有些相似於Maven 或者是 C中的make。下載後解壓出來後，進行環境變量設置。

ANT_HOME：E:\Work\apache-ant\1.9.1 (這裏爲你本身解壓縮的目錄) PATH：%ANT_HOME%\bin （這個設置是爲了方便在dos環境下操做）

查看是否安裝成功，在命令行窗口中輸入命令ant，若出現結果:

說明ant安裝成功！由於ant默認運行build.xml文件，這個文件須要咱們創建。如今就能夠進行build Solr源碼了。在命令行窗口中進入到你的Solr源碼目錄，輸入ant會出現當前build.xml使用提示信息。

其它的先不用管它，咱們只要針對咱們使用的IDE進行build就好了，若是使用eclipse就在命令行輸入：ant eclipse.若是使用IntelliJ IDEA 就在命令行輸入：ant idea。這樣就能進行build了。

黑窗口裏提示這個。。。

失敗。。。爲何呢，最後我發現是由於下載的ant中少了一個jar就是這apache-ivy（下載地址：http://ant.apache.org/ivy/）這東東名子真怪 ivy是ant管理jar依賴關係的。當第一次bulid時ivy會自動把build中的缺乏的依賴進行下載。網速慢的第一次build要很久的。。。

下載一個jar就行把jar放到ant的lib下（E:\Work\apache-ant\1.9.1\lib）這樣再次運行ant 就會成功了。到如今才能夠進行Solr的代碼調試。

4.4配置並運行Solr代碼

無論用什麼IDE首選都要設置Solr Home在IDE的JVM參數設置VM arguments寫入 -Dsolr.solr.home=solr/example/solr通常就好了.不行也可使用絕對路徑.

solr使用StartSolrJetty文件做爲入口文件進行調試代碼,在這裏能夠設置服務器使用的端口和solr的 webapps目錄.通常都不用設置,默認的就能夠進行調試.Solr Home也能可在代碼中設置同樣好用. System.setProperty("solr.solr.home", "E:\\Work\\solr-4.2.0-src-idea\\solr\\example\\solr");

目前是使用自帶的一個example做爲solr配置的根目錄，若是你有其餘的solr配置目錄，設置之便可。點擊run便可，debug也是同樣能夠用了。沒有別的問題就應該能運行了.注意servlet 容器使用的端口,如查提示:

FAILED SocketConnector@0.0.0.0:8983: java.net.BindException: Address already in use: JVM_Bind 就說明當前端口占用中.改一下就能夠了.若是沒有報錯啓動成功後就能夠在瀏覽器中輸入地址: http://localhost:8983/solr/ 就能夠看到以下界面

到這裏Solr就成功配置並運行了.要是想跟代碼調試在啓動時在這個方法裏點斷點就能夠Initializer的initialize()方法若是想從瀏覽器中找斷點調試就要到SolrDispatchFilter的doFilter方法中點斷點了.

注：IE9在兼容模式下有bug，必須設置爲非兼容模式。

5.Solr基礎

由於 Solr 包裝並擴展了Lucene，因此它們使用不少相同的術語。更重要的是，Solr 建立的索引與 Lucene 搜索引擎庫徹底兼容。經過對 Solr 進行適當的配置，某些狀況下可能須要進行編碼，Solr 能夠閱讀和使用構建到其餘 Lucene 應用程序中的索引。在 Solr 和 Lucene 中，使用一個或多個 Document 來構建索引。Document 包括一個或多個 Field。Field 包括名稱、內容以及告訴 Solr 如何處理內容的元數據。

例如，Field 能夠包含字符串、數字、布爾值或者日期，也能夠包含你想添加的任何類型，只需用在solr的配置文件中進行相應的配置便可。Field 可使用大量的選項來描述，這些選項告訴 Solr 在索引和搜索期間如何處理內容。

如今，查看一下表 1 中列出的重要屬性的子集：

屬性名稱	描述
Indexed	Indexed Field 能夠進行搜索和排序。你還能夠在 indexed Field 上運行 Solr 分析過程，此過程可修改內容以改進或更改結果。
Stored	stored Field 內容保存在索引中。這對於檢索和醒目顯示內容頗有用，但對於實際搜索則不是必需的。例如，不少應用程序存儲指向內容位置的指針而不是存儲實際的文件內容。

5.1模式配置Schema.xml

schema.xml這個配置文件能夠在你下載solr包的安裝解壓目錄的\solr\example\solr\collection1\conf中找到，它就是solr模式關聯的文件。打開這個配置文件，你會發現有詳細的註釋。模式組織主要分爲三個重要配置

5.1.1. types 部分

是一些常見的可重用定義，定義了 Solr（和 Lucene）如何處理 Field。也就是添加到索引中的xml文件屬性中的類型，如int、text、date等.

</analyzer>

</analyzer>

</fieldType>

參數說明:

屬性	描述
name	標識而已
class	和其餘屬性決定了這個fieldType的實際行爲。
sortMissingLast	設置成true沒有該field的數據排在有該field的數據以後，而無論請求時的排序規則, 默認是設置成false。
sortMissingFirst	跟上面倒過來唄。默認是設置成false
analyzer	字段類型指定的分詞器
type	當前分詞用用於的操做.index表明生成索引時使用的分詞器query代碼在查詢時使用的分詞器
tokenizer	分詞器類
filter	分詞後應用的過濾器過濾器調用順序和配置相同.

5.1.2. fileds

是你添加到索引文件中出現的屬性名稱，而聲明類型就須要用到上面的types

field: 固定的字段設置
dynamicField: 動態的字段設置,用於後期自定義字段,*號通配符.例如: test_i就是int類型的動態字段.

還有一個特殊的字段copyField,通常用於檢索時用的字段這樣就只對這一個字段進行索引分詞就好了copyField的dest字段若是有多個source必定要設置multiValued=true,不然會報錯的

字段屬性說明:

屬性	描述
name	字段類型名
class	java類名
indexed	缺省true。說明這個數據應被搜索和排序，若是數據沒有indexed，則stored應是true。
stored	缺省true。說明這個字段被包含在搜索結果中是合適的。若是數據沒有stored,則indexed應是true。
omitNorms	字段的長度不影響得分和在索引時不作boost時，設置它爲true。通常文本字段不設置爲true。
termVectors	若是字段被用來作more like this 和highlight的特性時應設置爲true。
compressed	字段是壓縮的。這可能致使索引和搜索變慢，但會減小存儲空間，只有StrField和TextField是能夠壓縮，這一般適合字段的長度超過200個字符。
multiValued	字段多於一個值的時候，可設置爲true。
positionIncrementGap	和multiValued一塊兒使用，設置多個值之間的虛擬空白的數量

注意:_version_ 是一個特殊字段,不能刪除,是記錄當前索引版本號的.

5.1.3. 其餘配置

uniqueKey: 惟一鍵，這裏配置的是上面出現的fileds，通常是id、url等不重複的。在更新、刪除的時候能夠用到。

defaultSearchField:默認搜索屬性，如q=solr就是默認的搜索那個字段

solrQueryParser:查詢轉換模式，是而且仍是或者（AND/OR必須大寫）

5.2. solr配置solrconfig.xml

solrconfig.xml這個配置文件能夠在你下載solr包的安裝解壓目錄的E:\Work\solr-4.2.0-src-idea\solr \example\solr\collection1\conf中找到，這個配置文件內容有點多,主要內容有:使用的lib配置,包含依賴的jar和 Solr的一些插件;組件信息配置;索引配置和查詢配置,下面詳細說一下索引配置和查詢配置.

5.2.1索引indexConfig

Solr 性能因素，來了解與各類更改相關的性能權衡。表 1 歸納了可控制 Solr 索引處理的各類因素：

屬性	描述
useCompoundFile	經過將不少 Lucene 內部文件整合到一個文件來減小使用中的文件的數量。這可有助於減小 Solr 使用的文件句柄數目，代價是下降了性能。除非是應用程序用完了文件句柄，不然 false 的默認值應該就已經足夠。
ramBufferSizeMB	在添加或刪除文檔時，爲了減小頻繁的更些索引,Solr會選緩存在內存中,當內存中的文件大於設置的值,纔會更新到索引庫。較大的值可以使索引時間變快但會犧牲較多的內存。如兩個值同時設置,知足一個就會進行刷新索引.
maxBufferedDocs
mergeFactor	決定低水平的 Lucene 段被合併的頻率。較小的值（最小爲 2）使用的內存較少但致使的索引時間也更慢。較大的值可以使索引時間變快但會犧牲較多的內存。
maxIndexingThreads	indexWriter生成索引時使用的最大線程數
unlockOnStartup	unlockOnStartup 告知 Solr 忽略在多線程環境中用來保護索引的鎖定機制。在某些狀況下，索引可能會因爲不正確的關機或其餘錯誤而一直處於鎖定，這就妨礙了添加和更新。將其設置爲 true 能夠禁用啓動鎖定，進而容許進行添加和更新。
lockType	single: 在只讀索引或是沒有其它進程修改索引時使用. native: 使用操做系統本地文件鎖,不能使用多個Solr在同一個JVM中共享一個索引. simple :使用一個文本文件鎖定索引.

5.2.2 查詢配置query

屬性	描述
maxBooleanClauses	最大的BooleanQuery數量. 當值超出時，拋出 TooManyClausesException.注意這個是全局的,若是是多個SolrCore都會使用一個值,每一個Core裏設置不同的化,會使用最後一個的.
filterCache	filterCache存儲了無序的lucene document id集合，1.存儲了filter queries(「fq」參數)獲得的document id集合結果。2還可用於facet查詢3. 3）若是配置了useFilterForSortedQuery，那麼若是查詢有filter，則使用filterCache。
queryResultCache	緩存搜索結果,一個文檔ID列表
documentCache	緩存Lucene的Document對象,不會自熱
fieldValueCache	字段緩存使用文檔ID進行快速訪問。默認狀況下建立fieldValueCache即便這裏沒有配置。
enableLazyFieldLoading	若應用程序預期只會檢索 Document 上少數幾個 Field，那麼能夠將屬性設置爲 true。延遲加載的一個常見場景大都發生在應用程序返回和顯示一系列搜索結果的時候，用戶經常會單擊其中的一個來查看存儲在此索引中的原始文檔。初始的顯示經常只須要顯示很短的一段信息。若考慮到檢索大型 Document 的代價，除非必需，不然就應該避免加載整個文檔。
queryResultWindowSize	一次查詢中存儲最多的doc的id數目.
queryResultMaxDocsCached	查詢結果doc的最大緩存數量, 例如要求每頁顯示10條,這裏設置是20條,也就是說緩存裏總會給你多出10條的數據.讓你點示下一頁時很快拿到數據.
listener	選項定義 newSearcher 和 firstSearcher 事件，您可使用這些事件來指定實例化新搜索程序或第一個搜索程序時應該執行哪些查詢。若是應用程序指望請求某些特定的查詢，那麼在建立新搜索程序或第一個搜索程序時就應該反註釋這些部分並執行適當的查詢。
useColdSearcher	是否使用冷搜索,爲false時使用自熱後的searcher
maxWarmingSearchers	最大自熱searcher數量

5.3Solr加入中文分詞器

中文分詞在solr裏面是沒有默認開啓的，須要咱們本身配置一箇中文分詞器。目前可用的分詞器有smartcn，IK，Jeasy，庖丁。其實主要是兩種，一種是基於中科院ICTCLAS的隱式馬爾科夫HMM算法的中文分詞器，如smartcn，ictclas4j，優勢是分詞準確度高，缺點是不能使用用戶自定義詞庫；另外一種是基於最大匹配的分詞器，如IK ，Jeasy，庖丁，優勢是能夠自定義詞庫，增長新詞，缺點是分出來的垃圾詞較多。各有優缺點看應用場合本身衡量選擇吧。

下面給出兩種分詞器的安裝方法，任選其一便可，推薦第一種，由於smartcn就在solr發行包的contrib/analysis-extras /lucene-libs/下，就是lucene-analyzers-smartcn-4.2.0.jar,首選在solrconfig.xml中加一句引用analysis-extras的配置,這樣咱們本身加入的分詞器纔會引到的solr中.

5.3.1. smartcn 分詞器的安裝

首選將發行包的contrib/analysis-extras/lucene-libs/ lucene-analyzers-smartcn-4.2.0.jar複製到\solr\contrib\analysis-extras\lib下, 在solr本地應用文件夾下，打開/solr/conf/scheme.xml，編輯text字段類型以下，添加如下代碼到scheme.xml中的相應位置，就是找到fieldType定義的那一段，在下面多添加這一段就好啦

</analyzer>

</analyzer>

</fieldType>

若是須要檢索某個字段，還須要在scheme.xml下面的field中，添加指定的字段，用text_ smartcn做爲type的名字，來完成中文分詞。如 text要實現中文檢索的話，就要作以下的配置：

5.3.2. IK 分詞器的安裝

首選要去下載IKAnalyzer的發行包.下載地址: http://ik-analyzer.googlecode.com/files/IK%20Analyzer%202012FF_hf1.zip.

下載後解壓出來文件中的三個複製到\solr\contrib\analysis-extras\lib目錄中.

IKAnalyzer2012FF_u1.jar 分詞器jar包

IKAnalyzer.cfg.xml 分詞器配置文件

Stopword.dic 分詞器停詞字典,可自定義添加內容

複製後就能夠像smartcn同樣的進行配置scheme.xml了.

</fieldType>

如今來驗證下是否添加成功,首先使用StartSolrJetty來啓動solr服務,啓動過程當中若是配置出錯,通常有兩個緣由:一是配置的分詞器jar 找不到,也就是你沒有複製jar包到\solr\contrib\analysis-extras\lib目前下;二是分詞器版本不對致使的分詞器接口 API不同出的錯,要是這個錯的話就在檢查分詞器的相關文檔,看一下支持的版本是否同樣.

若是在啓動過程當中沒有報錯的話說明配置成功了.咱們能夠進入到http://localhost:8983/solr地址進行測試一下剛加入的中文分詞器.在首頁的Core Selector中選擇你配置的Croe後點擊下面的Analysis,在Analyse Fieldname / FieldType裏選擇你剛纔設置的字段名稱或是分詞器類型,在Field Value(index)中輸入:中國人,點擊右面的分詞就好了.

6.Solr功能應用

我這裏主要使用SolrJ進行介紹一下Solr的一些基本應用,使用SolrJ加上EmbeddedSolrServer(嵌入式服務器),方便進行代碼跟蹤調試.在功能上和其它服務器都是同樣的,它們都是繼承的SolrServer來提供服務API的. EmbeddedSolrServer優勢是不用起http協議,直接加載SolrCore進行操做,性能上應該是最快的,方便用於把Solr單結點服務嵌入到項目中使用.下面開始介紹Solr的功能的應用.EmbeddedSolrServer初始化:

System.setProperty("solr.solr.home", "E:\\Work\\solr-4.2.0-src\\solr\\example\\solr");

CoreContainer.Initializer initializer = new CoreContainer.Initializer();

CoreContainer coreContainer = initializer.initialize();

SolrServer server = new EmbeddedSolrServer(coreContainer, "");

6.1維護索引

在通常系統中維護的都是增刪改,在Solr中的維護功能是增刪和優化功能,在Solr中的修改操做就是先刪掉再添加.在作索引維護以前,首先要作的是配置 schema.xml主要是按上面章節中的說明設置好字段信息(名稱,類型,索引,存儲,分詞等信息),大概就像在數據庫中新建一個表同樣.設置好 schema.xml就能夠進行索引相關操做了.

6.1.1增長索引

在增長索引以前先可構建好SolrInputDocument對象.主要操做就是給文檔添加字段和值.代碼以下:

SolrInputDocument doc = new SolrInputDocument();

doc.setField("id", "ABC");

doc.setField("content", "中華人民共和國");

構建好文檔後添加的上面初始化好的server裏就好了.

server.add(doc);

server.commit();//這句通常不用加由於咱們能夠經過在配置文件中的

//autoCommit來提升性能

Solr在add文檔時.若是文檔不存在就直接添加,若是文檔存在就刪除後添加,這也就是修改功能了.判斷文檔是否存在的依據是定義好的uniqueKey字段.

6.1.2刪除索引

刪除索引能夠經過兩種方式操做,一種是經過文檔ID進行刪除,別一種是經過查詢到的結果進行刪除.

經過ID刪除方式代碼:

server.deleteById(id);

//或是使用批量刪除

server.deleteById(ids);

經過查詢刪除方式代碼:

server.deleteByQuery("*.*");//這樣就刪除了全部文檔索引

//」*.*」就查詢全部內容的,介紹查詢時會詳細說明.

6.1.2優化索引

優化Lucene 的索引文件以改進搜索性能。索引完成後執行一下優化一般比較好。若是更新比較頻繁，則應該在使用率較低的時候安排優化。一個索引無需優化也能夠正常地運行。優化是一個耗時較多的過程。

server.optimize();//不要頻繁的調用..儘可能在無人使用時調用.

6.2文本檢索

Solr在不修改任務配置的狀況下就可使用文本檢索功能，在web項目中應用能夠直接URL進行訪問Solr服務器例如：

http://localhost:8983/solr/ collection1/select?q=*%3A*&wt=xml&indent=true

上面的意思就是檢索名爲collection1的SolrCore的全部內容用xml格式返回而且有縮進。

返回結果以下:

<?xml version="1.0" encoding="UTF-8"?>

</lst>

<doc>

<str name="path">E:\Reduced\軍事\1539.txt</str>

<str name="content"> [俄羅斯lenta網站2006年2月9日報道]俄空軍副總司令比熱耶夫中將稱，2006年春天獨聯體國家防空系統打擊範圍向西推動150公里，偵察範圍向西推動400公里。　　2006年3月白俄羅斯4個S-300PS防空導彈營擔負戰鬥任務，使獨聯體防空系統做戰範圍得以向西推動。比熱耶夫中將還宣布，近期烏茲別克斯坦可能加入獨聯體防空系統。　　獨聯體國家防空系統建於9年前，共有9個國家參加該組織。目前只有亞美尼亞、白俄羅斯、哈薩克斯坦、吉爾吉斯、俄羅斯和塔吉克斯坦支持該體系。　　烏克蘭、烏茲別克斯坦與俄羅斯在雙邊基礎上合做，格魯吉亞和土庫曼最近7年不參加獨聯體國家對空防御。</str>

…

</result>

</response>

上面所看到的就是用xml格式返回的查詢結果,其中的doc就是一個文檔,在doc裏面的那個就是咱們開始在schema.xml中定義的字段.

若是使用SolrJ進行調用的話代碼以下：

SolrQuery query = new SolrQuery();

query.set("q","*.*");

QueryResponse rsp =server.query(query)

SolrDocumentList list = rsp.getResults();

返回結果在SolrDocumentList中在這個對象中遍歷取出值來:

for (int i = 0; i < list.size(); i++) {

SolrDocument sd = list.get(i);

String id = (String) sd.getFieldValue("id");

System.out.println(id);

}

6.2.1查詢參數

名稱	描述
q	查詢字符串，必須的。
fq	filter query。使用Filter Query能夠充分利用Filter Query Cache，提升檢索性能。做用：在q查詢符合結果中同時是fq查詢符合的，例如：q=mm&fq=date_time:[20081001 TO 20091031]，找關鍵字mm，而且date_time是20081001到20091031之間的。
fl	field list。指定返回結果字段。以空格「」或逗號「,」分隔。
start	用於分頁定義結果起始記錄數，默認爲0。
rows	用於分頁定義結果每頁返回記錄數，默認爲10。
sort	排序，格式:sort=<field name>+<desc\|asc>[,<field name>+<desc\|asc>]… 。示例：（inStock desc, price asc）表示先「inStock」降序, 再「price」升序，默認是相關性降序。
df	默認的查詢字段，通常默認指定。
q.op	覆蓋schema.xml的defaultOperator（有空格時用"AND"仍是用"OR"操做邏輯），通常默認指定。必須大寫
wt	writer type。指定查詢輸出結構格式，默認爲「xml」。在solrconfig.xml中定義了查詢輸出格式：xml、json、python、ruby、php、phps、custom。
qt	query type，指定查詢使用的Query Handler，默認爲「standard」。
explainOther	設置當debugQuery=true時，顯示其餘的查詢說明。
defType	設置查詢解析器名稱。
timeAllowed	設置查詢超時時間。
omitHeader	設置是否忽略查詢結果返回頭信息，默認爲「false」。
indent	返回的結果是否縮進，默認關閉，用 indent=true\|on 開啓，通常調試json,php,phps,ruby輸出纔有必要用這個參數。
version	查詢語法的版本，建議不使用它，由服務器指定默認值。
debugQuery	設置返回結果是否顯示Debug信息。

6.2.2查詢語法

1.匹配全部文檔：*:*

2.強制、阻止和可選查詢：

1) Mandatory：查詢結果中必須包括的(for example, only entry name containing the word make)

Solr/Lucene Statement：+make, +make +up ,+make +up +kiss

2) prohibited：(for example, all documents except those with word believe)

Solr/Lucene Statement：+make +up -kiss

3) optional：

Solr/Lucene Statement：+make +up kiss

3.布爾操做：AND、OR和NOT布爾操做（必須大寫）與Mandatory、optional和prohibited類似。

1) make AND up ＝ +make +up :AND左右兩邊的操做都是mandatory

2) make || up ＝ make OR up＝make up :OR左右兩邊的操做都是optional

3) +make +up NOT kiss ＝ +make +up –kiss

4) make AND up OR french AND Kiss不能夠達到指望的結果，由於AND兩邊的操做都是mandatory的。

4. 子表達式查詢（子查詢）：可使用「()」構造子查詢。

示例：(make AND up) OR (french AND Kiss)

5.子表達式查詢中阻止查詢的限制：

示例：make (-up):只能取得make的查詢結果；要使用make (-up *:*)查詢make或者不包括up的結果。

6.多字段fields查詢：經過字段名加上分號的方式（fieldName:query）來進行查詢

示例：entryNm:make AND entryId:3cdc86e8e0fb4da8ab17caed42f6760c

7.通配符查詢（wildCard Query）：

1) 通配符？和*：「*」表示匹配任意字符；「？」表示匹配出現的位置。

示例：ma?*（ma後面的一個位置匹配），ma??*(ma後面兩個位置都匹配)

2) 查詢字符必需要小寫:+Ma +be**能夠搜索到結果；+Ma +Be**沒有搜索結果.

3) 查詢速度較慢，尤爲是通配符在首位：主要緣由一是須要迭代查詢字段中的每一個term，判斷是否匹配；二是匹配上的term被加到內部的查詢，當terms數量達到1024的時候，查詢會失敗。

4) Solr中默認通配符不能出如今首位（能夠修改QueryParser，設置

setAllowLeadingWildcard爲true）

5) set setAllowLeadingWildcard to true.

8.模糊查詢、類似查詢：不是精確的查詢，經過對查詢的字段進行從新插入、刪除和轉換來取得得分較高的查詢解決（由Levenstein Distance Algorithm算法支持）。

1) 通常模糊查詢：示例：make-believ~

2) 門檻模糊查詢：對模糊查詢能夠設置查詢門檻，門檻是0~1之間的數值，門檻越高表面類似度越高。示例：make-believ~0.五、make-believ~0.八、make-believ~0.9

9.範圍查詢（Range Query）：Lucene支持對數字、日期甚至文本的範圍查詢。結束的範圍可使用「*」通配符。

示例：

1) 日期範圍（ISO-8601 時間GMT）：sa_type:2 AND a_begin_date:[1990-01-01T00:00:00.000Z TO 1999-12-31T24:59:99.999Z]

2) 數字：salary:[2000 TO *]

3) 文本：entryNm:[a TO a]

10.日期匹配：YEAR, MONTH, DAY, DATE (synonymous with DAY) HOUR, MINUTE, SECOND, MILLISECOND, and MILLI (synonymous with MILLISECOND)能夠被標誌成日期。

示例：

1) r_event_date:[* TO NOW-2YEAR]：2年前的如今這個時間

2) r_event_date:[* TO NOW/DAY-2YEAR]：2年前前一天的這個時間

6.2.3函數查詢（Function Query）

函數查詢能夠利用 numeric字段的值或者與字段相關的的某個特定的值的函數，來對文檔進行評分。

1. 使用函數查詢的方法

這裏主要有三種方法可使用函數查詢，這三種s方法都是經過solr http接口的。

1) 使用FunctionQParserPlugin。ie: q={!func}log(foo)

2) 使用「_val_」內嵌方法

內嵌在正常的solr查詢表達式中。即，將函數查詢寫在 q這個參數中，這時候，咱們使用「_val_」將函數與其餘的查詢加以區別。

ie：entryNm:make && _val_:ord(entryNm)

3) 使用dismax中的bf參數

使用明確爲函數查詢的參數，好比說dismax中的bf（boost function）這個參數。注意：bf這個參數是能夠接受多個函數查詢的，它們之間用空格隔開，它們還能夠帶上權重。因此，當咱們使用bf這個參數的時候，咱們必須保證單個函數中是沒有空格出現的，否則程序有可能會覺得是兩個函數。

示例：

q=dismax&bf="ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3

2. 函數的格式（Function Query Syntax)

目前，function query 並不支持 a+b 這樣的形式，咱們得把它寫成一個方法形式，這就是 sum(a,b).

3. 使用函數查詢注意事項

1) 用於函數查詢的field必須是被索引的；

2) 字段不能夠是多值的（multi-value）

4. 能夠利用的函數（available function）

1) constant：支持有小數點的常量；例如：1.5 ；SolrQuerySyntax:_val_:1.5

2) fieldvalue：這個函數將會返回numeric field的值，這個字段必須是indexd的，非multiValued的。格式很簡單，就是該字段的名字。若是這個字段中沒有這樣的值，那麼將會返回0。

3) ord：對於一個字段，它全部的值都將會按照字典順序排列，這個函數返回你要查詢的那個特定的值在這個順序中的排名。這個字段，必須是非 multiValued的，當沒有值存在的時候，將返回0。例如：某個特定的字段只能去三個值，「apple」、「banana」、「pear」，那麼 ord（「apple」）=1，ord（「banana」）=2，ord（「pear」）=3.須要注意的是，ord（）這個函數，依賴於值在索引中的位置，因此當有文檔被刪除、或者添加的時候，ord（）的值就會發生變化。當你使用MultiSearcher的時候，這個值也就是不定的了。

4) rord：這個函數將會返回與ord相對應的倒排序的排名。

格式: rord(myIndexedField)。

5) sum：這個函數的意思就顯而易見啦，它就是表示「和」啦。

格式：sum(x,1) 、sum(x,y)、 sum(sqrt(x),log(y),z,0.5)

6) product：product(x,y,...)將會返回多個函數的乘積。格式：product(x,2)、product(x,y)

7) div：div(x,y)表示x除以y的值，格式：div（1,x）、div(sum(x,100),max(y,1))

8) pow：pow表示冪值。pow(x,y) =x^y。例如：pow(x,0.5) 表示開方pow(x,log(y))

9) abs：abs(x)將返回表達式的絕對值。格式：abs(-5)、 abs(x)

10) log：log(x)將會返回基數爲10，x的對數。格式： log(x)、 log(sum(x,100))

11) Sqrt：sqrt(x) 返回一個數的平方根。格式：sqrt（2）、sqrt(sum(x,100))

12) Map：若是 x>=min,且x<=max,那麼map(x,min,max,target)=target.若是 x不在[min,max]這個區間內，那麼map(x,min,max,target)=x.

格式：map(x,0,0,1)

13) Scale：scale(x,minTarget,maxTarget) 這個函數將會把x的值限制在[minTarget,maxTarget]範圍內。

14) query ：query(subquery,default)將會返回給定subquery的分數，若是subquery與文檔不匹配，那麼將會返回默認值。任何的查詢類型都是受支持的。能夠經過引用的方式，也能夠直接指定查詢串。

例子：q=product(popularity, query({!dismax v='solr rocks'}) 將會返回popularity和經過dismax 查詢獲得的分數的乘積。

q=product(popularity, query($qq)&qq={!dismax}solr rocks 跟上一個例子的效果是同樣的。不過這裏使用的是引用的方式

q=product(popularity, query($qq,0.1)&qq={!dismax}solr rocks 在前一個例子的基礎上又加了一個默認值。

15) linear： inear(x,m,c)表示 m*x+c ,其中m和c都是常量，x是一個變量也能夠是一個函數。例如： linear(x,2,4)=2*x+4.

16) Recip：recip(x,m,a,b)=a/(m*x+b)其中，m、a、b是常量，x是變量或者一個函數。當a=b，而且x>=0的時候，這個函數的最大值是1，值的大小隨着x的增大而減少。例如：recip(rord(creationDate),1,1000,1000)

17) Max： max(x,c)將會返回一個函數和一個常量之間的最大值。

例如：max(myfield,0)

6.3高亮顯示

咱們常用搜索引擎，好比在baidu 搜索 java ，會出現以下結果，結果中與關鍵字匹配的地方是紅色顯示與其餘內容區別開來。

solr 默認已經配置了highlight 組件(詳見 SOLR_HOME/conf/sorlconfig.xml)。一般我出只須要這樣請求http://localhost:8983/solr/ collection1 /select? q=%E4%B8%AD%E5%9B%BD&start=0&rows=1&fl=content+path+&wt=xml&indent=true&hl=true&hl.fl=content

能夠看到與比通常的請求多了兩個參數 "hl=true" 和 "hl.fl= content " 。

"hl=true" 是開啓高亮，"hl.fl= content " 是告訴solr 對 name 字段進行高亮(若是你想對多個字段進行高亮，能夠繼續添加字段，字段間用逗號隔開，如 "hl.fl=name,name2,name3")。高亮內容與關鍵匹配的地方，默認將會被 "" 和 "" 包圍。還可使用hl.simple.pre" 和 "hl.simple.post"參數設置先後標籤.

查詢結果以下：

<?xml version="1.0" encoding="UTF-8"?>

<response>

<lst name="responseHeader">

<int name="status"></int> 0

<int name="QTime"></int> 2

<lst name="params">

<str name="fl"></str>   content path

<str name="indent"></str>   true

<str name="start"></str>   0

<str name="q"></str>   中國

<str name="hl.simple.pre"><em></str>

<str name="hl.simple.post"></em></str>

<str name="hl.fl"></str>   content

<str name="wt"></str>   xml

<str name="hl"></str>   true

<str name="rows"></str>   1

</lst>

</lst>

<result name="response" numFound="6799" start="0">

<doc>

<str name="path"></str>   E:\Reduced\IT\630.txt

<str name="content"></str></doc>   　　本報訊 中國銀聯股份有限公司和中國電信集團日前在北京簽署全面戰略合做協議。這標誌着中國銀聯和中國電信將在通訊服務、信息增值服務、新型支付產品合做開發等領域創建全面合做夥伴關係。　　據悉，雙方簽署的全面戰略合做協議主要內容是：中國銀聯將選擇中國電信做爲通訊信息服務的主要提供商，雙方圍繞提升中國銀聯內部通訊的水平和銷售網絡的服務水平開展全面、深刻的合做；中國電信選擇中國銀聯做爲銀行卡轉接支付服務的主要提供商，並圍繞開發、推廣新型支付終端產品和增值服務開展全面合做。（辛華）

</result>

<lst name="highlighting">

<lst name="7D919C61-03B3-4B6F-2D10-9E3CC92D2852">

<arr name="content">

<str><em></em><em></em><em></em><em></em></str>     　　本報訊中國銀聯股份有限公司和中國電信集團日前在北京簽署全面戰略合做協議。這標誌着中國銀聯和中國電信將在通訊服務、信息增值服務、新型支付產品合做開發等領域創建全面合做夥伴關係。　　據悉，雙方簽署

</arr>

</lst>

</lst>

</response>

使用SolrJ方法基本同樣也是設置這些個參數,只不過是SolrJ封裝起來了,代碼以下:

SolrQuery query = new SolrQuery();

query.set("q","*.*");

query.setHighlight(true); // 開啓高亮組件

query.addHighlightField("content");// 高亮字段

query.setHighlightSimplePre(PRE_TAG);// 標記

query.setHighlightSimplePost(POST_TAG);

QueryResponse rsp =server.query(query)

//…上面取結果的代碼

//取出高亮結果

if (rsp.getHighlighting() != null) {

if (rsp.getHighlighting().get(id) != null) {//先經過結果中的ID到高亮集合中取出文檔高亮信息

Map<String, List<String>> map = rsp.getHighlighting().get(id);//取出高亮片斷

if (map.get(name) != null) {

for (String s : map.get(name)) {

System.out.println(s);

}

6.4拼寫檢查

首先配置 solrconfig.xml，文件可能已經有這兩個元素(若是沒有添加便可)，須要根據咱們本身的系統環境作些適當的修改。

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

     <str name="queryAnalyzerFieldType">text_spell</str>

     <lst name="spellchecker">

       <str name="name">direct</str>

       <str name="field">spell</str>

       <str name="classname">solr.DirectSolrSpellChecker</str>

       <str name="distanceMeasure">internal</str>

       <float name="accuracy">0.5</float>

       <int name="maxEdits">2</int>

       <int name="minPrefix">1</int>

       <int name="maxInspections">5</int>

       <int name="minQueryLength">2</int>

       <float name="maxQueryFrequency">0.01</float>

</lst>

  </searchComponent>

  <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">

     <lst name="defaults">

         <str name="spellcheck.dictionary">direct</str>

          <str name="spellcheck">on</str>

          <str name="spellcheck.collate">true</str>

         <str name="spellcheck.collateExtendedResults">true</str>

     </lst>

     <arr name="last-components">

       <str>spellcheck</str>

     </arr>

   </requestHandler>

配置完成以後，咱們進行一下測試,重啓Solr後，訪問以下連接

http://localhost:8983/solr/ collection1/spell?wt=xml&indent=true&spellcheck=true&spellcheck.q=%E4%B8%AD%E5%9B%BD

<?xml version="1.0" encoding="UTF-8"?>

  <response>

    <lst name="responseHeader">

      <int name="status">0</int>

      <int name="QTime">0</int>

    </lst>

    <result name="response" numFound="0" start="0"/>

    <lst name="spellcheck">

      <lst name="suggestions">

        <lst name="beijink">

          <int name="numFound">1</int>

          <int name="startOffset">0</int>

          <int name="endOffset">3</int>

          <arr name="suggestion">

            <str>beijing</str>

          </arr>

        </lst>

      </lst>

    </lst>

  </response>

使用SolrJ時也一樣加入參數就能夠

SolrQuery query = new SolrQuery();

query.set("q","*.*");

query.set("qt", "/spell");

QueryResponse rsp =server.query(query)

//…上面取結果的代碼

SpellCheckResponse spellCheckResponse = rsp.getSpellCheckResponse();

if (spellCheckResponse != null) {

String collation = spellCheckResponse.getCollatedResult();

}

6.5檢索建議

檢索建議目前是各大搜索的標配應用，主要做用是避免用戶輸入錯誤的搜索詞，同時將用戶引導到相應的關鍵詞搜索上。Solr內置了檢索建議功能，它在 Solr裏叫作Suggest模塊.該模塊可選擇基於提示詞文本作檢索建議，還支持經過針對索引的某個字段創建索引詞庫作檢索建議。在諸多文檔中都推薦使用基於索引來作檢索建議，所以咱們目前的實現也是採起該方案。

如今咱們開始配置Suggest模塊,首先在solrconfig.xml文件中配置Suggest依賴的SpellChecker模塊，而後再配置Suggest模塊,因此這兩個都須要配置。

<str name="queryAnalyzerFieldType">string</str>

<str name="name">suggest</str>

<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>

<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>

<str name="spellcheckIndexDir">spellchecker</str>

</lst>

</searchComponent>

<str name="spellcheck.dictionary">suggest</str>

<str name="spellcheck.extendedResults">false</str>

</lst>

<str>suggest</str>

</arr>

</requestHandler>

配置完成以後，咱們進行一下測試,重啓Solr後，訪問以下連接

http://localhost:8983/solr/ collection1/suggest?wt=xml&indent=true&spellcheck=true&spellcheck.q=%E4%B8%AD%E5%9B%BD

<?xml version="1.0" encoding="UTF-8"?>

<response>

<lst name="responseHeader">

<int name="status"></int> 0

<int name="QTime"></int> 4

</lst>

<lst name="spellcheck">

<lst name="suggestions">

<lst name="中國">

<int name="numFound"></int>     4

<int name="startOffset"></int>     0

<int name="endOffset"></int>     2

<arr name="suggestion">

<str></str>       中國隊

<str></str>       中國證監會

<str></str>       中國足協

<str></str>       中國銀行

</arr>

</lst>

</lst>

</lst>

</response>

使用SolrJ時也一樣加入參數就能夠

SolrQuery query = new SolrQuery();

query.set("q", token);

query.set("qt", "/suggest");

query.set("spellcheck.count", "10");

QueryResponse response = server.query(query);

SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse();

if (spellCheckResponse != null) {

List<SpellCheckResponse.Suggestion> suggestionList = spellCheckResponse.getSuggestions();

for (SpellCheckResponse.Suggestion suggestion : suggestionList) {

List<String> suggestedWordList = suggestion.getAlternatives();

for (int i = 0; i < suggestedWordList.size(); i++) {

String word = suggestedWordList.get(i);

}

return results;

}

經過threshold參數來限制一些不經常使用的詞不出如今智能提示列表中，當這個值設置過大時，可能致使結果太少，須要引發注意。目前主要存在的問題是使用freq排序算法，返回的結果徹底基於索引中字符的出現次數，沒有兼顧用戶搜索詞語的頻率，所以沒法將一些熱門詞排在更靠前的位置。這塊可定製 SuggestWordScoreComparator來實現，目前尚未着手作這件事情。

6.6分組統計

我這裏實現分組統計的方法是使用了Solr的Facet組件, Facet組件是Solr默認集成的一個組件.

6.6.1 Facet簡介

Facet是solr的高級搜索功能之一,能夠給用戶提供更友好的搜索體驗.在搜索關鍵字的同時,可以按照Facet的字段進行分組並統計

6.6.2 Facet字段

1. 適宜被Facet的字段

通常表明了實體的某種公共屬性,如商品的分類,商品的製造廠家,書籍的出版商等等.

2. Facet字段的要求

Facet的字段必須被索引.通常來講該字段無需分詞,無需存儲.

無需分詞是由於該字段的值表明了一個總體概念,如電腦的品牌」聯想」表明了一個整體概念,若是拆成」聯」,」想」兩個字都不具備實際意義.另外該字段的值無需進行大小寫轉換等處理,保持其原貌便可.

無需存儲是由於通常而言用戶所關心的並非該字段的具體值,而是做爲對查詢結果進行分組的一種手段,用戶通常會沿着這個分組進一步深刻搜索.

3. 特殊狀況

對於通常查詢而言,分詞和存儲都是必要的.好比CPU類型」Intel 酷睿2雙核 P7570」, 拆分紅」Intel」,」酷睿」,」P7570」這樣一些關鍵字並分別索引,可能提供更好的搜索體驗.可是若是將CPU做爲Facet字段,最好不進行分詞.這樣就形成了矛盾,解決方法爲, 將CPU字段設置爲不分詞不存儲,而後創建另一個字段爲它的COPY,對這個COPY的字段進行分詞和存儲.

<types>

……

</analyzer>

</fieldType>

</types>

</fields>

6.6.2 Facet組件

Solr的默認requestHandler已經包含了Facet組件(solr.FacetComponent).若是自定義 requestHandler或者對默認的requestHandler自定義組件列表,那麼須要將Facet加入到組件列表中去.

……

<str>自定義組件名</str>

<str>facet</str>

……

</arr>

</requestHandler>

6.6.2 Facet查詢

進行Facet查詢須要在請求參數中加入facet=on或者facet=true只有這樣Facet組件才起做用.

1. Field Facet

Facet字段經過在請求中加入facet.field參數加以聲明,若是須要對多個字段進行Facet查詢,那麼將該參數聲明屢次.例如:

http://localhost:8983/solr/ collection1/select?q=*%3A*&start=0&rows=1&wt=xml&indent=true&facet=true&facet.field=category_s&facet.field=modified_l

返回結果:

<?xml version="1.0" encoding="UTF-8"?>

<response>

<lst name="responseHeader">

<int name="status"></int> 0

<int name="QTime"></int> 1

<lst name="params">

<str name="facet"></str>   true

<str name="indent"></str>   true

<str name="start"></str>   0

<str name="q"></str>   *:*

<arr name="facet.field">

   <str>category_s</str>

   <str>modified_l</str>

</arr>

<str name="wt"></str>   xml

<str name="rows"></str>   0

</lst>

</lst>

<result name="response" numFound="17971" start="0">

</result>

<lst name="facet_counts">

<lst name="facet_queries"/>

<lst name="facet_fields">

<lst name="category_s">

<int name="0"></int>     5991

<int name="1"></int>     5990

<int name="2"></int>     5990

</lst>

    <lst name="modified_l">

<int name="1162438554000"></int>     951

<int name="1162438556000"></int>     917

<int name="1162438548000"></int>     902

<int name="1162438546000"></int>     674

</lst>

</lst>

<lst name="facet_dates"/>

<lst name="facet_ranges"/>

</lst>

</response>

各個Facet字段互不影響,且能夠針對每一個Facet字段設置查詢參數.如下介紹的參數既能夠應用於全部的Facet字段,也能夠應用於每一個單獨的Facet字段.應用於單獨的字段時經過

f.字段名.參數名=參數值

這種方式調用.好比facet.prefix參數應用於cpu字段,能夠採用以下形式

f.cpu.facet.prefix=Intel

1.1 facet.prefix

表示Facet字段值的前綴.好比facet.field=cpu&facet.prefix=Intel,那麼對cpu字段進行Facet查詢,返回的cpu都是以Intel開頭的, AMD開頭的cpu型號將不會被統計在內.

1.2 facet.sort

表示Facet字段值以哪一種順序返回.可接受的值爲true(count)|false(index,lex). true(count)表示按照count值從大到小排列. false(index,lex)表示按照字段值的天然順序(字母,數字的順序)排列.默認狀況下爲true(count).當facet.limit值爲負數時,默認facet.sort= false(index,lex).

1.3 facet.limit

限制Facet字段返回的結果條數.默認值爲100.若是此值爲負數,表示不限制.

1.4 facet.offset

返回結果集的偏移量,默認爲0.它與facet.limit配合使用能夠達到分頁的效果.

1.5 facet.mincount

限制了Facet字段值的最小count,默認爲0.合理設置該參數能夠將用戶的關注點集中在少數比較熱門的領域.

1.6 facet.missing

默認爲」」,若是設置爲true或者on,那麼將統計那些該Facet字段值爲null的記錄.

1.7 facet.method

取值爲enum或fc,默認爲fc.該字段表示了兩種Facet的算法,與執行效率相關.

enum適用於字段值比較少的狀況,好比字段類型爲布爾型,或者字段表示中國的全部省份.Solr會遍歷該字段的全部取值,並從filterCache裏爲每一個值分配一個filter(這裏要求solrconfig.xml裏對filterCache的設置足夠大).而後計算每一個filter與主查詢的交集.

fc(表示Field Cache)適用於字段取值比較多,但在每一個文檔裏出現次數比較少的狀況.Solr會遍歷全部的文檔,在每一個文檔內搜索Cache內的值,若是找到就將Cache內該值的count加1.

1.8 facet.enum.cache.minDf

當facet.method=enum時,此參數其做用,minDf表示minimum document frequency.也就是文檔內出現某個關鍵字的最少次數.該參數默認值爲0.設置該參數能夠減小filterCache的內存消耗,但會增長總的查詢時間(計算交集的時間增長了).若是設置該值的話,官方文檔建議優先嚐試25-50內的值.

6.6.3 Date Facet

日期類型的字段在文檔中很常見,如商品上市時間,貨物出倉時間,書籍上架時間等等.某些狀況下須要針對這些字段進行Facet.不過期間字段的取值有無限性,用戶每每關心的不是某個時間點而是某個時間段內的查詢統計結果. Solr爲日期字段提供了更爲方便的查詢統計方式.固然,字段的類型必須是DateField(或其子類型).

須要注意的是,使用Date Facet時,字段名,起始時間,結束時間,時間間隔這4個參數都必須提供.與Field Facet相似,Date Facet也能夠對多個字段進行Facet.而且針對每一個字段均可以單獨設置參數.

facet.date:該參數表示須要進行Date Facet的字段名,與facet.field同樣,該參數能夠被設置屢次,表示對多個字段進行Date Facet.

facet.date.start:起始時間,時間的通常格式爲1995-12-31T23:59:59Z,另外可使用NOW\YEAR\ MONTH等等,具體格式能夠參考DateField的java doc.

facet.date.end:結束時間.

facet.date.gap:時間間隔.若是start爲2009-1-1,end爲2010-1-1.gap設置爲+1MONTH表示間隔1個月,那麼將會把這段時間劃分爲12個間隔段.

注意+由於是特殊字符因此應該用%2B代替.

facet.date.hardend:取值能夠爲true|false,默認爲false.它表示gap迭代到end處採用何種處理.舉例說明start爲2009-1-1,end爲2009-12-25,gap爲+1MONTH,

hardend爲false的話最後一個時間段爲2009-12-1至2010-1-1;

hardend爲true的話最後一個時間段爲2009-12-1至2009-12-25.

facet.date.other:取值範圍爲before|after|between|none|all,默認爲 none.before會對start以前的值作統計.after會對end以後的值作統計.between會對start至end之間全部值作統計.如果hardend爲true的話,那麼該值就是各個時間段統計值的和.none表示該項禁用.all表示before,after,all都會統計.

舉例:

&facet=on

&facet.date=date

&facet.date.start=2009-1-1T0:0:0Z

&facet.date.end=2010-1-1T0:0:0Z

&facet.date.gap=%2B1MONTH

&facet.date.other=all

返回結果:

<str name="gap">+1MONTH</str>

</lst>

6.6.4 Facet Query

Facet Query利用相似於filter query的語法提供了更爲靈活的Facet.經過facet.query參數,能夠對任意字段進行篩選.

例1:

&facet=on

&facet.query=date:[2009-1-1T0:0:0Z TO 2009-2-1T0:0:0Z]

&facet.query=date:[2009-4-1T0:0:0Z TO 2009-5-1T0:0:0Z]

返回結果:

</lst>

</lst>

例2:

&facet=on

&facet.query=date:[2009-1-1T0:0:0Z TO 2009-2-1T0:0:0Z]

&facet.query=price:[* TO 5000]

返回結果:

</lst>

</lst>

例3:

&facet=on

&facet.query=cpu:[A TO G]

返回結果:

</lst>

</lst>

6.6.5 key操做符

能夠用key操做符爲Facet字段取一個別名.

例:

&facet=on

&facet.field={!key=中央處理器}cpu

&facet.field={!key=顯卡}videoCard

返回結果:

</lst>

</lst>

</lst>

6.6.6 tag操做符和ex操做符

當查詢使用filter query的時候,若是filter query的字段正好是Facet字段,那麼查詢結果每每被限制在某一個值內.

例:

&fq=screenSize:14

&facet=on

&facet.field=screenSize

返回結果:

</lst>

</lst>

能夠看到,屏幕尺寸(screenSize)爲14寸的產品共有107件,其它尺寸的產品的數目都是0,這是由於在filter裏已經限制了 screenSize:14.這樣,查詢結果中,除了screenSize=14的這一項以外,其它項目沒有實際的意義.有些時候,用戶但願把結果限制在某一範圍內,又但願查看該範圍外的概況.好比上述狀況,既要把查詢結果限制在14寸屏的筆記本,又想查看一下其它屏幕尺寸的筆記本有多少產品.這個時候需要用到tag和ex操做符.tag就是把一個filter標記起來,ex(exclude)是在Facet的時候把標記過的filter排除在外.

例:

&fq={!tag=aa}screenSize:14

&facet=on

&facet.field={!ex=aa}screenSize

返回結果:

</lst>

</lst>

這樣其它屏幕尺寸的統計信息就有意義了.

6.6.7 SolrJ對Facet的支持

//初始化查詢對象

String q = 「*.*」;

SolrQuery query = new SolrQuery(q);

query.setIncludeScore(false);//是否按每組數量高低排序

query.setFacet(true);//是否分組查詢

query.setRows(0);//設置返回結果條數，若是你時分組查詢，你就設置爲0

query.addFacetField(「modified_l」);//增長分組字段 q

query.addFacetQuery (「category_s[0 TO 1]」);

QueryResponse rsp = server.query(query);

…

//取出結果

List<FacetField.Count> list = rsp.getFacetField(「modified_l」).getValues();

Map<String, Integer> list = rsp.getFacetQuery();

6.7自動聚類

Solr 使用Carrot2完成了聚類功能,可以把檢索到的內容自動分類, Carrot2聚類示例:

要想Solr支持聚類功能,首選要把Solr發行包的中的dist/ solr-clustering-4.2.0.jar, 複製到\solr\contrib\analysis-extras\lib下.而後打開solrconfig.xml進行添加配置:

<searchComponent name="clustering"

enable="${solr.clustering.enabled:true}"

class="solr.clustering.ClusteringComponent" >

<str name="name">default</str>

<str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>

<str name="LingoClusteringAlgorithm.labelAssigner">org.carrot2.clustering.lingo.SimpleLabelAssigner</str>

<!--

org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory

org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory

org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationKLFactory

org.carrot2.matrix.factorization.LocalNonnegativeMatrixFactorizationFactory

org.carrot2.matrix.factorization.KMeansMatrixFactorizationFactory

-->

<str name="TermDocumentMatrixReducer.factorizationFactory">org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory</str>

<str name="TermDocumentMatrixBuilder.termWeighting">org.carrot2.text.vsm.TfTermWeighting</str>

<str name="MultilingualClustering.defaultLanguage">CHINESE_SIMPLIFIED</str>

<str name="MultilingualClustering.languageAggregationStrategy">org.carrot2.text.clustering.MultilingualClustering.LanguageAggregationStrategy.FLATTEN_MAJOR_LANGUAGE </str>

<str name="DocumentAssigner.exactPhraseAssignment">false</str>

<str name="carrot.lexicalResourcesDir">clustering/carrot2</str>

</lst>

</searchComponent>

配好了聚類組件後,下面配置requestHandler:

<requestHandler name="/clustering"

startup="lazy"

enable="${solr.clustering.enabled:true}"

class="solr.SearchHandler">

<str name="echoParams">explicit</str>

<str name="clustering.engine">default</str>

<str name="carrot.title">category_s</str>

<str name="carrot.snippet">content</str>

</lst>

<str>clustering</str>

</arr>

</requestHandler>

有兩個參數要注意carrot.title, carrot.snippet是聚類的比較計算字段,這兩個參數必須是stored="true".carrot.title的權重要高於 carrot.snippet,若是隻有一個作計算的字段carrot.snippet能夠去掉(是去掉不是值爲空).設完了用下面的URL就能夠查詢了

http://localhost:8983/skyCore/clustering?q=*%3A*&wt=xml&indent=true

6.8類似匹配

　　在咱們使用網頁搜索時，會注意到每個結果都包含一個「類似頁面」連接，單擊該連接，就會發布另外一個搜索請求，查找出與起初結果相似的文檔。Solr 使用 MoreLikeThisComponent（MLT）和 MoreLikeThisHandler 實現了同樣的功能。如上所述，MLT 是與標準 SolrRequestHandler 集成在一塊兒的；MoreLikeThisHandler 與 MLT 結合在一塊兒，並添加了一些其餘選項，但它要求發佈一個單一的請求。我將着重講述 MLT，由於使用它的可能性更大一些。幸運的是，不須要任何設置就能夠查詢它，因此您如今就能夠開始查詢。

MLT 要求字段被儲存或使用檢索詞向量，檢索詞向量以一種以文檔爲中心的方式儲存信息。MLT 經過文檔的內容來計算文檔中關鍵詞語，而後使用原始查詢詞語和這些新詞語建立一個新的查詢。提交新查詢就會返回其餘查詢結果。全部這些均可以用檢索詞向量來完成：只需將 termVectors="true" 添加到 schema.xml 中的 <field> 聲明。

MoreLikeThisComponent 參數：

參數	說明	值域
mlt	在查詢時，打開/關閉 MoreLikeThisComponent 的布爾值。	true\|false
mlt.count	可選。每個結果要檢索的類似文檔數。	> 0
mlt.fl	用於建立 MLT 查詢的字段。	任何被儲存的或含有檢索詞向量的字段。
mlt.maxqt	可選。查詢詞語的最大數量。因爲長文檔可能會有不少關鍵詞語，這樣 MLT 查詢可能會很大，從而致使反應緩慢或可怕的 TooManyClausesException，該參數只保留關鍵的詞語。	> 0

要想使用匹配類似首先在 solrconfig.xml 中配置 MoreLikeThisHandler

而後我就能夠請求

http://localhost:8983/skyCore/mlt?q=id%3A6F398CCD-2DE0-D3B1-9DD6-D4E532FFC531&mlt.true&mlt.fl=content&wt=xml&indent=true

上面請求的意思查找 id 爲 6F398CCD-2DE0-D3B1-9DD6-D4E532FFC531 的 document ,而後返回與此 document 在 name 字段上類似的其餘 document。須要注意的是 mlt.fl 中的 field 的 termVector=true 纔有效果

使用SolrJ時也一樣加入參數就能夠

SolrQuery query = new SolrQuery();

query.set("qt", "/mlt");

query.set("mlt.fl","content");

query.set("fl", "id,");

query.set("q", "id: 6F398CCD-2DE0-D3B1-9DD6-D4E532FFC531");

query.setStart(0);

query.setRows(5);

QueryResponse rsp = server.query(query);

SolrDocumentList list = rsp.getResults();

6.9拼音檢索

拼音檢索中國人的專用檢索,例如:中文內容爲中國的輸入zhongguo、zg、zhonggu 全拼、簡拼、拼音的相鄰的一部份都應該能檢索出中國來。

想要實現拼音檢索第一個就是拼音轉換我這裏用的是pinyin4j進行拼音轉換。第二個就是N-Gram的題目，推敲到用戶可能輸入的既不是前綴也不是後綴，因此此處選擇的是N-Gram技巧，但不一樣於經常使用的N-Gram，我應用的從一邊開端的單向的N-Gram，Solr裏的實現叫 EdgeNGramTokenFilter，可是分的分的太細了，不須要這麼複雜EdgeNGramTokenFilter,也就是說咱們用的N- Gram不一樣於傳統的N-Gram。

一樣的例子使用EdgeNGramTokenFilter從前日後取2-Gram的結果是zh, 通常是取min–max之間的全部gram，因此使用EdgeNGramTokenFilter取2-20的gram結果就是zh,zho, zhon, zhong, zhongg, zhonggu, zhongguo, 從這個例子也不難理解爲何我要選擇使用EdgeNGramTokenFilter而非通常意義上的N-Gram，考慮到用戶可能輸入的不是前綴而是後綴，因此爲了照顧這些用戶，我選擇了從前日後和從後往前使用了兩次EdgeNGramTokenFilter，這樣不只是前綴、後綴，二十任意的字串都考慮進去了，因此大幅度的提升了搜索體驗.

如今思路明確了咱們把它結合到Solr中，爲了方便使用如今寫了兩個Filter進行處理拼音分詞問題一個是拼音轉換 Filter（PinyinTransformTokenFilter）一個是拼音N-Gram的 Filter(PinyinNGramTokenFilter),這樣一來使用時就不用在添加索引前作攔音的轉換了。並且 PinyinTransformTokenFilter還有個好處就是它只使用中文分詞器分過的詞，也就是說作轉換的詞都是有用的不重複的，不會對沒用的停詞類的作拼音轉換和重複拼音轉換，這樣大大的提升了拼音轉換速度。

想要Solr支持拼音檢索就要先把拼音分詞（PinyinAnalyzer）的jar複製到\solr\contrib\analysis-extras\lib下，而後在schema.xml中配置一個拼音字段類型：

</analyzer>

</analyzer>

</fieldType>

minTermLenght：最小中文詞長度，意思是小於這個值的中文詞不會作拼音轉換。

minGram：最小拼音切分長度。
若是想使用簡拼的話在拼音轉換Filter 使用這個參數isFirstChar="true"就能夠了

在這個拼音類型中咱們使用了smartcn的中言語分詞器，若是想使用其它的本身換掉就好了。如今咱們在原來索引中加入一個拼音字段，由於只作索引,咱們能夠這樣配置:

加完後咱們從新啓動Solr測試一下看看

因爲上面minTermLenght和minGram設置的值，如今出現了人沒有進行拼音轉換而且最小拼音切分是從1個開始的。

到這裏咱們的配置還有沒完成呢，還要加幾個copyFiled，這樣就不用單獨處理咱們新加的拼音字段了。方便呀~~~

到如今就可使用拼音檢索了。

我寫的拼音jar

http://pan.baidu.com/share/link?shareid=2579170560&uk=4077294790

6.10 SolrCloud

SolrCloud是基於Solr和Zookeeper的分佈式搜索方案，是正在開發中的Solr4.0的核心組件之一，它的主要思想是使用 Zookeeper做爲集羣的配置信息中心。它有幾個特點功能,集中式的配置信息、自動容錯、近實時搜索、查詢時自動負載均衡。

基本能夠用上面這幅圖來概述，這是一個擁有4個Solr節點的集羣，索引分佈在兩個Shard裏面，每一個Shard包含兩個Solr節點，一個是Leader節點，一個是Replica節點，此外集羣中有一個負責維護集羣狀態信息的Overseer節點，它是一個總控制器。集羣的全部狀態信息都放在Zookeeper集羣中統一維護。從圖中還能夠看到，任何一個節點均可以接收索引更新的請求，而後再將這個請求轉發到文檔所應該屬於的那個Shard的Leader節點，Leader節點更新結束完成，最後將版本號和文檔轉發給同屬於一個Shard的replicas節點。這裏就很少說SolrCloud了，等研究明白後再單寫一個文檔。

附1：schema.xml

<?xml version="1.0" encoding="UTF-8" ?>

</fields>

<types>

</analyzer>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

</fieldType>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

<!-- in this example, we will only use synonyms at query time

-->

<!-- Case insensitive stop word removal.

add enablePositionIncrements=true in both the index and query

analyzers to leave a 'gap' for more accurate phrase queries.

-->

<filter class="solr.StopFilterFactory"

ignoreCase="true"

words="lang/stopwords_en.txt"

enablePositionIncrements="true"

<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

-->

</analyzer>

<filter class="solr.StopFilterFactory"

ignoreCase="true"

words="lang/stopwords_en.txt"

enablePositionIncrements="true"

<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:

-->

</analyzer>

</fieldType>

<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100"

autoGeneratePhraseQueries="true">

<!-- in this example, we will only use synonyms at query time

-->

<!-- Case insensitive stop word removal.

add enablePositionIncrements=true in both the index and query

analyzers to leave a 'gap' for more accurate phrase queries.

-->

<filter class="solr.StopFilterFactory"

ignoreCase="true"

words="lang/stopwords_en.txt"

enablePositionIncrements="true"

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1"

catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

</analyzer>

<filter class="solr.StopFilterFactory"

ignoreCase="true"

words="lang/stopwords_en.txt"

enablePositionIncrements="true"

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0"

catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

</analyzer>

</fieldType>

<fieldType name="text_en_splitting_tight" class="solr.TextField" positionIncrementGap="100"

autoGeneratePhraseQueries="true">

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1"

catenateNumbers="1" catenateAll="0"/>

<!-- this filter can remove any duplicate tokens that appear at the same position - sometimes

possible with WordDelimiterFilter in conjuncton with stemming. -->

</analyzer>

</fieldType>

<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"

maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>

</analyzer>

</analyzer>

</fieldType>

<!-- KeywordTokenizer does no actual tokenizing, so the entire

input string is preserved as a single token

-->

<!-- The LowerCase TokenFilter does what you expect, which can be

when you want your sorting to be case insensitive

-->

<!-- The PatternReplaceFilter gives you the flexibility to use

Java Regular expression to replace any sequence of characters

matching a pattern with an arbitrary replacement string,

which may include back references to portions of the original

string matched by the pattern.

See the Java Regular Expression documentation for more

information on pattern and replacement string syntax.

http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html

-->

<filter class="solr.PatternReplaceFilterFactory"

pattern="([^a-z])" replacement="" replace="all"

</analyzer>

</fieldType>

</analyzer>

</fieldtype>

<!--

The DelimitedPayloadTokenFilter can put payloads on tokens... for example,

a token of "foo|1.4" would be indexed as "foo" with a payload of 1.4f

Attributes of the DelimitedPayloadTokenFilterFactory :

"delimiter" - a one character delimiter. Default is | (pipe)

"encoder" - how to encode the following value into a playload

float -> org.apache.lucene.analysis.payloads.FloatEncoder,

integer -> o.a.l.a.p.IntegerEncoder

identity -> o.a.l.a.p.IdentityEncoder

Fully Qualified class name implementing PayloadEncoder, Encoder must have a no arg constructor.

-->

</analyzer>

</fieldtype>

</analyzer>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

</analyzer>

</analyzer>

</fieldType>

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"

geo="true" distErrPct="0.025" maxDistErr="0.000009" units="degrees"/>

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD"

currencyConfig="currency.xml"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_bg.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ca.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_cz.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_da.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="false" words="lang/stopwords_el.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_eu.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fi.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/hyphenations_ga.txt"

enablePositionIncrements="false"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ga.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_gl.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_hi.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_hu.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_hy.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_id.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_it.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<!-- Japanese using morphological analysis (see text_cjk for a configuration using bigramming)

NOTE: If you want to optimize search for precision, use default operator AND in your query

parser config with <solrQueryParser defaultOperator="AND"/> further down in this file. Use

OR if you would like to optimize for recall (default).

-->

<!-- Kuromoji Japanese morphological analyzer/tokenizer (JapaneseTokenizer)

Kuromoji has a search mode (default) that does segmentation useful for search. A heuristic

is used to segment compounds into its parts and the compound itself is kept as synonym.

Valid values for attribute mode are:

normal: regular segmentation

search: segmentation useful for search with synonyms compounds (default)

extended: same as search mode, but unigrams unknown words (experimental)

For some applications it might be good to use search mode for indexing and normal mode for

queries to reduce recall and prevent parts of compounds from being matched and highlighted.

Use <analyzer type="index"> and <analyzer type="query"> for this and mode normal in query.

Kuromoji also has a convenient user dictionary feature that allows overriding the statistical

model with your own entries for segmentation, part-of-speech tags and readings without a need

to specify weights. Notice that user dictionaries have not been subject to extensive testing.

User dictionary attributes are:

userDictionary: user dictionary filename

userDictionaryEncoding: user dictionary encoding (default is UTF-8)

See lang/userdict_ja.txt for a sample user dictionary file.

Punctuation characters are discarded by default. Use discardPunctuation="false" to keep them.

See http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese language support.

-->

<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"

enablePositionIncrements="true"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_lv.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_no.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ro.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_sv.txt" format="snowball"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_th.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

<filter class="solr.StopFilterFactory" ignoreCase="false" words="lang/stopwords_tr.txt"

enablePositionIncrements="true"/>

</analyzer>

</fieldType>

</types>

</schema>

附2：solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?>

<luceneMatchVersion>LUCENE_42</luceneMatchVersion>

<directoryFactory name="DirectoryFactory"

class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>

<!-- maxFieldLength was removed in 4.0. To get similar behavior, include a

LimitTokenCountFilterFactory in your fieldType definition. E.g.

-->

<!-- The maximum number of simultaneous threads that may be

indexing documents at once in IndexWriter; if more than this

many threads arrive they will wait for others to finish.

Default in Solr/Lucene is 8. -->

<!-- Expert: Enabling compound file will use less files for the index,

using fewer file descriptors on the expense of performance decrease.

Default in Lucene is "true". Default in Solr is "false" (since 3.6) -->

<!-- ramBufferSizeMB sets the amount of RAM that may be used by Lucene

indexing for buffering added documents and deletions before they are

flushed to the Directory.

maxBufferedDocs sets a limit on the number of documents buffered

before flushing.

If both ramBufferSizeMB and maxBufferedDocs is set, then

Lucene will flush based on whichever limit is hit first. -->

<!-- Expert: Merge Policy

The Merge Policy in Lucene controls how merging of segments is done.

The default since Solr/Lucene 3.3 is TieredMergePolicy.

The default since Lucene 2.3 was the LogByteSizeMergePolicy,

Even older versions of Lucene used LogDocMergePolicy.

</mergePolicy>

-->

<!-- Merge Factor

The merge factor controls how many segments will get merged at a time.

For TieredMergePolicy, mergeFactor is a convenience parameter which

will set both MaxMergeAtOnce and SegmentsPerTier at once.

For LogByteSizeMergePolicy, mergeFactor decides how many new segments

will be allowed before they are merged into one.

Default is 10 for both merge policies.

-->

<!-- Expert: Merge Scheduler

The Merge Scheduler in Lucene controls how merges are

performed. The ConcurrentMergeScheduler (Lucene 2.3 default)

can perform merges in the background using separate threads.

The SerialMergeScheduler (Lucene 2.2 default) does not.

-->

<!--

-->

<!-- LockFactory

This option specifies which Lucene LockFactory implementation

to use.

single = SingleInstanceLockFactory - suggested for a

read-only index or when there is no possibility of

another process trying to modify the index.

native = NativeFSLockFactory - uses OS native file locking.

Do not use when multiple solr webapps in the same

JVM are attempting to share a single index.

simple = SimpleFSLockFactory - uses a plain file for locking

Defaults: 'native' is default for Solr3.6 and later, otherwise

'simple' is the default

More details on the nuances of each LockFactory...

http://wiki.apache.org/lucene-java/AvailableLockFactories

-->

<lockType>${solr.lock.type:native}</lockType>

<!-- Unlock On Startup

If true, unlock any held write or commit locks on startup.

This defeats the locking mechanism that allows multiple

processes to safely access a lucene index, and should be used

with care. Default is "false".

This is not needed if lock type is 'single'

-->

<!--

<unlockOnStartup>false</unlockOnStartup>

-->

<!-- Expert: Controls how often Lucene loads terms into memory

Default is 128 and is likely good for most everyone.

-->

<!-- If true, IndexReaders will be reopened (often more efficient)

instead of closed and then opened. Default: true

-->

<!--

-->

<!-- Commit Deletion Policy

Custom deletion policies can be specified here. The class must

implement org.apache.lucene.index.IndexDeletionPolicy.

The default Solr IndexDeletionPolicy implementation supports

deleting index commit points on number of commits, age of

commit point and optimized status.

The latest commit point should always be preserved regardless

of the criteria.

-->

<!--

-->

<!--

Delete all commit points once they have reached the given age.

Supports DateMathParser syntax e.g.

-->

<!--

<str name="maxCommitAge">30MINUTES</str>

-->

<!--

</deletionPolicy>

-->

<!-- Lucene Infostream

To aid in advanced debugging, Lucene provides an "InfoStream"

of detailed information when indexing.

Setting The value to true will instruct the underlying Lucene

IndexWriter to write its debugging info the specified file

-->

</indexConfig>

<jmx />

</updateLog>

<openSearcher>false</openSearcher>

</autoCommit>

</updateHandler>

<query>

<!-- Max Boolean Clauses

Maximum number of clauses in each BooleanQuery, an exception

is thrown if exceeded.

** WARNING **

This option actually modifies a global Lucene property that

will affect all SolrCores. If multiple solrconfig.xml files

disagree on this property, the value at any given moment will

be based on the last SolrCore to be initialized.

-->

<!-- Solr Internal Query Caches

There are two implementations of cache available for Solr,

LRUCache, based on a synchronized LinkedHashMap, and

FastLRUCache, based on a ConcurrentHashMap.

FastLRUCache has faster gets and slower puts in single

threaded operation and thus is generally faster than LRUCache

when the hit ratio of the cache is high (> 75%), and may be

faster under other scenarios on multi-cpu systems.

-->

<!-- Filter Cache

Cache used by SolrIndexSearcher for filters (DocSets),

unordered sets of *all* documents that match a query. When a

new searcher is opened, its caches may be prepopulated or

"autowarmed" using data from caches in the old searcher.

autowarmCount is the number of items to prepopulate. For

LRUCache, the autowarmed items will be the most recently

accessed items.

Parameters:

class - the SolrCache implementation LRUCache or

(LRUCache or FastLRUCache)

size - the maximum number of entries in the cache

initialSize - the initial capacity (number of entries) of

the cache. (see java.util.HashMap)

autowarmCount - the number of entries to prepopulate from

and old cache.

-->

<filterCache class="solr.FastLRUCache"

size="512"

initialSize="512"

autowarmCount="0"/>

<!-- Query Result Cache

Caches results of searches - ordered lists of document ids

(DocList) based on a query, a sort, and the range of documents requested.

-->

<queryResultCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="0"/>

<!-- Document Cache

Caches Lucene Document objects (the stored fields for each

document). Since Lucene internal document ids are transient,

this cache will not be autowarmed.

-->

<documentCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="0"/>

<!-- Field Value Cache

Cache used to hold field values that are quickly accessible

by document id. The fieldValueCache is created by default

even if not configured here.

-->

<!--

<fieldValueCache class="solr.FastLRUCache"

size="512"

autowarmCount="128"

showItems="32" />

-->

<!-- Custom Cache

Example of a generic cache. These caches may be accessed by

name through SolrIndexSearcher.getCache(),cacheLookup(), and

cacheInsert(). The purpose is to enable easy caching of

user/application level data. The regenerator argument should

be specified as an implementation of solr.CacheRegenerator

if autowarming is desired.

-->

<!--

<cache name="myUserCache"

class="solr.LRUCache"

size="4096"

initialSize="1024"

autowarmCount="1024"

regenerator="com.mycompany.MyRegenerator"

-->

<!-- Lazy Field Loading

If true, stored fields that are not requested will be loaded

lazily. This can result in a significant speed improvement

if the usual case is to not load all stored fields,

especially if the skipped fields are large compressed text

fields.

-->

<!-- Use Filter For Sorted Query

A possible optimization that attempts to use a filter to

satisfy a search. If the requested sort does not include

score, then the filterCache will be checked for a filter

matching the query. If found, the filter will be used as the

source of document ids, and then the sort will be applied to

that.

For most situations, this will not be useful unless you

frequently get the same search repeatedly with different sort

options, and none of them ever use "score"

-->

<!--

-->

<!-- Result Window Size

An optimization for use with the queryResultCache. When a search

is requested, a superset of the requested number of document ids

are collected. For example, if a search for a particular query

requests matching documents 10 through 19, and queryWindowSize is 50,

then documents 0 through 49 will be collected and cached. Any further

requests in that range can be satisfied via the cache.

-->

<!-- Maximum number of documents to cache for any entry in the

queryResultCache.

-->

<!-- Query Related Event Listeners

Various IndexSearcher related events can trigger Listeners to

take actions.

newSearcher - fired whenever a new searcher is being prepared

and there is a current searcher handling requests (aka

registered). It can be used to prime certain caches to

prevent long request times for certain requests.

firstSearcher - fired whenever a new searcher is being

prepared but there is no current registered searcher to handle

requests or to gain autowarming data from.

-->

<!-- QuerySenderListener takes an array of NamedList and executes a

local query request for each NamedList in sequence.

-->

<!--

<lst><str name="q">solr</str><str name="sort">price asc</str></lst>

<lst><str name="q">rocks</str><str name="sort">weight asc</str></lst>

-->

</arr>

</listener>

<lst>

<str name="q">static firstSearcher warming in solrconfig.xml</str>

</lst>

</arr>

</listener>

<!-- Use Cold Searcher

If a search request comes in and there is no current

registered searcher, then immediately register the still

warming searcher and use it. If "false" then all requests

will block until the first searcher is done warming.

-->

<useColdSearcher>false</useColdSearcher>

<!-- Max Warming Searchers

Maximum number of searchers that may be warming in the

background concurrently. An error is returned if this limit

is exceeded.

Recommend values of 1-2 for read-only slaves, higher for

masters w/o cache warming.

-->

</query>

<requestParsers enableRemoteStreaming="true"

multipartUploadLimitInKB="2048000"

formdataUploadLimitInKB="2048"/>

</requestDispatcher>

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

</lst>

</requestHandler>

<str name="echoParams">explicit</str>

<str name="wt">velocity</str>

<str name="v.template">browse</str>

<str name="v.layout">layout</str>

<str name="title">Solritas</str>

<str name="defType">edismax</str>

text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4

title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0

</str>

<str name="fl">*,score</str>

text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4

title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0

</str>

<str name="mlt.fl">text,features,name,sku,id,manu,cat,title,description,keywords,author,resourcename</str>

<str name="facet.field">manu_exact</str>

<str name="facet.field">content_type</str>

<str name="facet.field">author_s</str>

<str name="facet.pivot">cat,inStock</str>

<str name="facet.range.other">after</str>

<str name="facet.range">price</str>

<str name="facet.range">popularity</str>

<str name="facet.range">manufacturedate_dt</str>

<str name="f.manufacturedate_dt.facet.range.start">NOW/YEAR-10YEARS</str>

<str name="f.manufacturedate_dt.facet.range.other">before</str>

<str name="f.manufacturedate_dt.facet.range.other">after</str>

<str name="hl.fl">content features title name</str>

<str name="f.title.hl.alternateField">title</str>

<str name="f.content.hl.alternateField">content</str>

<str name="spellcheck.extendedResults">false</str>

</lst>

<str>spellcheck</str>

</arr>

</requestHandler>

</requestHandler>

<str name="stream.contentType">application/json</str>

</lst>

</requestHandler>

<str name="stream.contentType">application/csv</str>

</lst>

</requestHandler>

<requestHandler name="/update/extract"

startup="lazy"

class="solr.extraction.ExtractingRequestHandler" >

<str name="uprefix">ignored_</str>

<str name="fmap.a">links</str>

<str name="fmap.div">ignored_</str>

</lst>

</requestHandler>

<requestHandler name="/analysis/field"

startup="lazy"

class="solr.FieldAnalysisRequestHandler" />

<requestHandler name="/analysis/document"

class="solr.DocumentAnalysisRequestHandler"

startup="lazy" />

<requestHandler name="/admin/"

class="solr.admin.AdminHandlers" />

<str name="q">solrpingquery</str>

</lst>

</lst>

</requestHandler>

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

</requestHandler>

<str name="name">direct</str>

<str name="field">spell</str>

<str name="classname">solr.DirectSolrSpellChecker</str>

<str name="distanceMeasure">internal</str>

</lst>

<!--

Optional, it is required when more than one spellchecker is configured.

Select non-default name with spellcheck.dictionary in request handler.

name是可選的，若是隻有一個spellchecker能夠不寫name

若是有多個spellchecker，須要在Request Handler中指定spellcheck.dictionary

-->

<str name="name">default</str>

<str name="classname">solr.IndexBasedSpellChecker</str>

<!--

Load tokens from the following field for spell checking,

analyzer for the field's type as defined in schema.xml are used

下面這個field名字指的是拼寫檢查的依據，也就是說要根據哪一個Field來檢查用戶輸入。

-->

<str name="field">spell</str>

<!-- Optional, by default use in-memory index (RAMDirectory)

SpellCheck索引文件的存放位置，是可選的，若是不寫默認使用內存模式RAMDirectory。

./spellchecker1指的是：corex\data\spellchecker1

-->

<str name="spellcheckIndexDir">./spellchecker1</str>

</lst>

<str name="name">jarowinkler</str>

<str name="classname">solr.IndexBasedSpellChecker</str>

<str name="field">spell</str>

<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>

<str name="spellcheckIndexDir">./spellchecker2</str>

</lst>

<!-- 另外一個拼寫檢查器，使用文件內容爲檢查依據

<str name="classname">solr.FileBasedSpellChecker</str>

<str name="sourceLocation">spellings.txt</str>

<str name="spellcheckIndexDir">./spellcheckerFile</str>

</lst>-->

<str name="queryAnalyzerFieldType">text_spell</str>

</searchComponent>

<str name="spellcheck.dictionary">default</str>

<str name="spellcheck.extendedResults">false</str>

</lst>

<str>spellcheck</str>

</arr>

</requestHandler>

<str name="queryAnalyzerFieldType">string</str>

<str name="name">suggest</str>

<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>

<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>

</lst>

</searchComponent>

<str name="spellcheck.dictionary">suggest</str>

<str name="spellcheck.extendedResults">false</str>

</lst>

<str>suggest</str>

</arr>

</requestHandler>

</requestHandler>

</lst>

<str>tvComponent</str>

</arr>

</requestHandler>

<searchComponent name="clustering"

enable="${solr.clustering.enabled:true}"

class="solr.clustering.ClusteringComponent" >

<str name="name">default</str>

<str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>

<str name="LingoClusteringAlgorithm.labelAssigner">org.carrot2.clustering.lingo.SimpleLabelAssigner</str>

<!--

org.carrot2.matrix.factorization.PartialSingularValueDecompositionFactory

org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory

org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationKLFactory

org.carrot2.matrix.factorization.LocalNonnegativeMatrixFactorizationFactory

org.carrot2.matrix.factorization.KMeansMatrixFactorizationFactory

-->

<str name="TermDocumentMatrixReducer.factorizationFactory">org.carrot2.matrix.factorization.NonnegativeMatrixFactorizationEDFactory</str>

<str name="TermDocumentMatrixBuilder.termWeighting">org.carrot2.text.vsm.TfTermWeighting</str>

<str name="MultilingualClustering.defaultLanguage">CHINESE_SIMPLIFIED</str>

<str name="DocumentAssigner.exactPhraseAssignment">false</str>

<str name="carrot.lexicalResourcesDir">clustering/carrot2</str>

</lst>

</searchComponent>

<requestHandler name="/clustering"

startup="lazy"

enable="${solr.clustering.enabled:true}"

class="solr.SearchHandler">

<str name="echoParams">explicit</str>

<str name="clustering.engine">default</str>

<str name="carrot.title">category_s</str>

<str name="carrot.snippet">content</str>

</lst>

<str>clustering</str>

</arr>

</requestHandler>

<bool name="distrib">false</bool>

</lst>

<str>terms</str>

</arr>

</requestHandler>

<str name="queryFieldType">string</str>

<str name="config-file">elevate.xml</str>

</searchComponent>

<str name="echoParams">explicit</str>

</lst>

<str>elevator</str>

</arr>

</requestHandler>

<fragmenter name="gap"

default="true"

class="solr.highlight.GapFragmenter">

</lst>

</fragmenter>

<!-- A regular-expression-based fragmenter

(for sentence extraction)

-->

<fragmenter name="regex"

class="solr.highlight.RegexFragmenter">

</lst>

</fragmenter>

<formatter name="html"

default="true"

class="solr.highlight.HtmlFormatter">

</lst>

</formatter>

<encoder name="html"

class="solr.highlight.HtmlEncoder" />

<fragListBuilder name="simple"

class="solr.highlight.SimpleFragListBuilder"/>

<fragListBuilder name="single"

class="solr.highlight.SingleFragListBuilder"/>

<fragListBuilder name="weighted"

default="true"

class="solr.highlight.WeightedFragListBuilder"/>

<fragmentsBuilder name="default"

default="true"

class="solr.highlight.ScoreOrderFragmentsBuilder">

</fragmentsBuilder>

<fragmentsBuilder name="colored"

class="solr.highlight.ScoreOrderFragmentsBuilder">

<str name="hl.tag.pre"><![CDATA[

,,

,,

,,

,,

,]]></str>

</lst>

</fragmentsBuilder>

<boundaryScanner name="default"

default="true"

class="solr.highlight.SimpleBoundaryScanner">

</lst>

</boundaryScanner>

<boundaryScanner name="breakIterator"

class="solr.highlight.BreakIteratorBoundaryScanner">

</lst>

</boundaryScanner>

</highlighting>

</searchComponent>

<str name="content-type">text/plain; charset=UTF-8</str>

</queryResponseWriter>

<!--

Custom response writers can be declared as needed...

-->

</queryResponseWriter>

<admin>

</admin>

</config>

參考資料及文獻

http://wiki.apache.org/solr/ 全部的配置在這裏都有說明

http://doc.carrot2.org/ 聚類相關設置在這裏有說明

分類: 非結構化數據

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。