Lucene文本解析器實現 把一段文本信息拆分紅多個分詞,咱們都知道搜索引擎是經過分詞檢索的,文本解析器的好壞直接決定了搜索的精度和搜索的速度。java
1.簡單的Demoapp
private static final String[] examples = { "The quick brown 1234 fox jumped over the lazy dog!","XY&Z 15.6 Corporation - xyz@example.com", "北京市北京大學" }; private static final Analyzer[] ANALYZERS = new Analyzer[] {
new WhitespaceAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new StandardAnalyzer(), new CJKAnalyzer(), new SmartChineseAnalyzer() }; //空格符拆分 非字母拆分 非字母拆分去掉停詞 Unicode文本分割 日韓文分割 簡體中文分割
@Test public void testAnalyzer() throws IOException { for (int i = 0; i < ANALYZERS.length; i++) { String simpleName = ANALYZERS[i].getClass().getSimpleName(); for (int j = 0; j < examples.length; j++) { //TokenStream是分析處理組件中的一種中間數據格式,它從一個reader中獲取文本, 分詞器Tokenizer和過濾器TokenFilter繼承自TokenStream TokenStream contents = ANALYZERS[i].tokenStream("contents", new StringReader(examples[j])); //添加多個Attribute,從而能夠了解到分詞以後詳細的詞元信息 ,OffsetAttribute 表示token的首字母和尾字母在原文本中的位置 OffsetAttribute offsetAttribute = contents.addAttribute(OffsetAttribute.class); TypeAttribute typeAttribute = contents.addAttribute(TypeAttribute.class); //TypeAttribute 表示token的詞彙類型信息,默認值爲word contents.reset(); System.out.println(" " + simpleName + " analyzing : " + examples[j]); while (contents.incrementToken()) { String s1 = offsetAttribute.toString(); int i1 = offsetAttribute.startOffset();// 起始偏移量 int i2 = offsetAttribute.endOffset(); // 結束偏移量 System.out.println(" " + s1 + "[" + i1 + "," + i2 + ":" + typeAttribute.type() + "]" + " "); } contents.end(); contents.close(); //調用incrementToken()結束迭代以後,調用end()和close()方法,其中end()能夠喚醒當前TokenStream的處理器去作一些收尾工做,close()能夠關閉TokenStream和Analyzer去釋放在分析過程當中使用的資源。 System.out.println(); } } } }
2. 瞭解tokenStream的Attribute
ide
tokenStream()方法以後,添加多個Attribute,能夠了解到分詞以後詳細的詞元信息,好比CharTermAttribute用於保存詞元的內容,TypeAttribute用於保存詞元的類型。函數
CharTermAttribute 表示token自己的內容
PositionIncrementAttribute 表示當前token相對於前一個token的相對位置,也就是相隔的詞語數量(例如「text for attribute」,
text和attribute之間的getPositionIncrement爲2),若是二者之間沒有停用詞,那麼該值被置爲默認值1
OffsetAttribute 表示token的首字母和尾字母在原文本中的位置
TypeAttribute 表示token的詞彙類型信息,默認值爲word,
其它值有<ALPHANUM> <APOSTROPHE> <ACRONYM> <COMPANY> <EMAIL> <HOST> <NUM> <CJ> <ACRONYM_DEP>
FlagsAttribute 與TypeAttribute相似,假設你須要給token添加額外的信息,並且但願該信息能夠經過分析鏈,那麼就能夠經過flags去傳遞
PayloadAttribute 在每一個索引位置都存儲了payload(關鍵信息),當使用基於Payload的查詢時,該信息在評分中很是有用ui
@Test public void testAttribute() throws IOException { Analyzer analyzer = new StandardAnalyzer(); String input = "This is a test text for attribute! Just add-some word."; TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(input)); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); PositionIncrementAttribute positionIncrementAttribute = tokenStream.addAttribute(PositionIncrementAttribute.class); OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); TypeAttribute typeAttribute = tokenStream.addAttribute(TypeAttribute.class); PayloadAttribute payloadAttribute = tokenStream.addAttribute(PayloadAttribute.class); payloadAttribute.setPayload(new BytesRef("Just")); tokenStream.reset(); while (tokenStream.incrementToken()) { System.out.print( "[" + charTermAttribute + " increment:" + positionIncrementAttribute.getPositionIncrement() + " start:" + offsetAttribute.startOffset() + " end:" + offsetAttribute.endOffset() + " type:"+ typeAttribute.type() + " payload:" + payloadAttribute.getPayload() + "]\n"); } tokenStream.end(); tokenStream.close(); }
3.Lucene 的分詞器Tokenizer和過濾器TokenFilterthis
一個分析器由一個分詞器和多個過濾器組成,分詞器接受reader數據轉換成 TokenStream,TokenFilter主要用於TokenStream的過濾操做,用來處理Tokenizer或者上一個TokenFilter處理後的結果,若是是對現有分詞器進行擴展或修改。搜索引擎
自定義TokenFilter須要實現incrementToken()抽象函數,spa
public class TestTokenFilter { @Test public void test() throws IOException { String text = "Hi, Dr Wang, Mr Liu asks if you stay with Mrs Liu yesterday!"; Analyzer analyzer = new WhitespaceAnalyzer(); CourtesyTitleFilter filter = new CourtesyTitleFilter(analyzer.tokenStream("text", text)); CharTermAttribute charTermAttribute = filter.addAttribute(CharTermAttribute.class); filter.reset(); while (filter.incrementToken()) { System.out.print(charTermAttribute + " "); } } } /** * 自定義詞擴展過濾器 */ class CourtesyTitleFilter extends TokenFilter { Map<String, String> courtesyTitleMap = new HashMap<>(); private CharTermAttribute termAttribute; protected CourtesyTitleFilter(TokenStream input) { super(input); termAttribute = addAttribute(CharTermAttribute.class); courtesyTitleMap.put("Dr", "doctor"); courtesyTitleMap.put("Mr", "mister"); courtesyTitleMap.put("Mrs", "miss"); } @Override public final boolean incrementToken() throws IOException { if (!input.incrementToken()) { return false; } String small = termAttribute.toString(); if (courtesyTitleMap.containsKey(small)) { termAttribute.setEmpty().append(courtesyTitleMap.get(small)); } return true; } }
輸出結果以下
Hi, doctor Wang, mister Liu asks if you stay with miss Liu yesterday!code
4.自定義Analyzer實現擴展停用詞blog
class StopAnalyzerExtend extends Analyzer { private CharArraySet stopWordSet;//中止詞詞典 public CharArraySet getStopWordSet() { return this.stopWordSet; } public void setStopWordSet(CharArraySet stopWordSet) { this.stopWordSet = stopWordSet; } public StopAnalyzerExtend() { super(); setStopWordSet(StopAnalyzer.ENGLISH_STOP_WORDS_SET); } /** * @param stops 須要擴展的中止詞 */ public StopAnalyzerExtend(List<String> stops) { this(); /**若是直接爲stopWordSet賦值的話,會報以下異常,這是由於在StopAnalyzer中有ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet); * ENGLISH_STOP_WORDS_SET 被設置爲不可更改的set集合 */ //stopWordSet = getStopWordSet(); stopWordSet = CharArraySet.copy(getStopWordSet()); stopWordSet.addAll(StopFilter.makeStopSet(stops)); } @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer source = new LowerCaseTokenizer(); return new TokenStreamComponents(source, new StopFilter(source, stopWordSet)); } public static void main(String[] args) throws IOException { ArrayList<String> strings = new ArrayList<String>() {{ add("小鬼子"); add("美國佬"); }}; Analyzer analyzer = new StopAnalyzerExtend(strings); String content = "小鬼子 and 美國佬 are playing together!"; TokenStream tokenStream = analyzer.tokenStream("myfield", content); tokenStream.reset(); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { // 已通過濾掉自定義停用詞 // 輸出:playing together System.out.println(charTermAttribute.toString()); } tokenStream.end(); tokenStream.close(); } }
5.自定義Analyzer實現字長過濾
class LongFilterAnalyzer extends Analyzer { private int len; public int getLen() { return this.len; } public void setLen(int len) { this.len = len; } public LongFilterAnalyzer() { super(); } public LongFilterAnalyzer(int len) { super(); setLen(len); } @Override protected TokenStreamComponents createComponents(String fieldName) { final Tokenizer source = new WhitespaceTokenizer(); //過濾掉長度<len,而且>20的token TokenStream tokenStream = new LengthFilter(source, len, 20); return new TokenStreamComponents(source, tokenStream); } public static void main(String[] args) { //把長度小於2的過濾掉,開區間 Analyzer analyzer = new LongFilterAnalyzer(2); String words = "I am a java coder! Testingtestingtesting!"; TokenStream stream = analyzer.tokenStream("myfield", words); try { stream.reset(); CharTermAttribute offsetAtt = stream.addAttribute(CharTermAttribute.class); while (stream.incrementToken()) { System.out.println(offsetAtt.toString()); } stream.end(); stream.close(); } catch (IOException e) { } } }
長度小於兩個字符的文本都被過濾掉了。
6.PerFieldAnalyzerWrapper 處理不一樣的Field使用不一樣的Analyzer 。PerFieldAnalyzerWrapper能夠像其它的Analyzer同樣使用,包括索引和查詢分析
@Test public void testPerFieldAnalyzerWrapper() throws IOException, ParseException { Map<String, Analyzer> fields = new HashMap<>(); fields.put("partnum", new KeywordAnalyzer()); // 對於其餘的域,默認使用SimpleAnalyzer分析器,對於指定的域partnum使用KeywordAnalyzer PerFieldAnalyzerWrapper perFieldAnalyzerWrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), fields); Directory directory = new RAMDirectory(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(perFieldAnalyzerWrapper); IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); Document document = new Document(); FieldType fieldType = new FieldType(); fieldType.setStored(true); fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS); document.add(new Field("partnum", "Q36", fieldType)); document.add(new Field("description", "Illidium Space Modulator", fieldType)); indexWriter.addDocument(document); indexWriter.close(); IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(directory)); // 直接使用TermQuery是能夠檢索到的 TopDocs search = indexSearcher.search(new TermQuery(new Term("partnum", "Q36")), 10); Assert.assertEquals(1, search.totalHits); // 若是使用QueryParser,那麼必需要使用PerFieldAnalyzerWrapper,不然以下所示,是檢索不到的 Query description = new QueryParser("description", new SimpleAnalyzer()).parse("partnum:Q36 AND SPACE"); search = indexSearcher.search(description, 10); Assert.assertEquals(0, search.totalHits); System.out.println("SimpleAnalyzer :" + description.toString());// +partnum:q // +description:space,緣由是SimpleAnalyzer會剝離非字母字符並將字母小寫化 // 使用PerFieldAnalyzerWrapper能夠檢索到 // partnum:Q36 AND SPACE表示在partnum中出現Q36,在description中出現SPACE description = new QueryParser("description", perFieldAnalyzerWrapper).parse("partnum:Q36 AND SPACE"); search = indexSearcher.search(description, 10); Assert.assertEquals(1, search.totalHits); System.out.println("(SimpleAnalyzer,KeywordAnalyzer) :" + description.toString());// +partnum:Q36 +description:space }
參考 : http://www.codepub.cn/2016/05/23/Lucene-6-0-in-action-4-The-text-analyzer/