最近在工做中用到拼音搜索,目前參考靠網上的例子作出一套,在這跟你們分享一下。java
這套代碼能夠識別包快拼音縮寫在內的拼音與漢字混合的字符串(例如:xiug手機h --> 修改手機號)git
話很少說,直接開始:正則表達式
PinyinWord tablesql
CREATE TABLE "public"."pinyinword" ( "id" text COLLATE "default" NOT NULL, "word" text COLLATE "default" NOT NULL, "whole" text COLLATE "default" NOT NULL, "acronym" text COLLATE "default" NOT NULL, "wordlength" int4 NOT NULL, "wholelength" int4 NOT NULL, "acronymlength" int4 NOT NULL )
WordClick tablemybatis
CREATE TABLE "public"."wordclick" ( "wordcontent" text COLLATE "default", "id" text COLLATE "default" NOT NULL )
表中數據自行初始化app
/** * 詞元 */ public class Lexeme { private String content; //詞元內容 private LexemeType lexemeType; //詞元類型 } public enum LexemeType { CHINESE, //中文 WHOLE, //全拼 ACRONYM //拼音首字母縮寫 }
/** * 中文句子(處理用戶輸入的類) */ public class ChineseSentence { private String content; // 用戶輸入內容 private List<Lexeme> sentenceUnits; // content中包含的詞元 private SentenceType sentenceType; // 句子最低級類型(不能set,賦值請看initSentenceType()) public String getContent() { return content; } public void setContent(String content) { this.content = content; } public List<Lexeme> getSentenceUnits() { return sentenceUnits; } public SentenceType getSentenceType() { return sentenceType; } public void setSentenceUnits(List<Lexeme> sentenceUnits) { this.sentenceUnits = sentenceUnits; initSentenceType(); } private void initSentenceType() { sentenceType = SentenceType.CHINESE_SENTENCE; for (Lexeme lexeme : sentenceUnits) { if (lexeme.getLexemeType() == LexemeType.ACRONYM) { sentenceType = SentenceType.ACRONYM_SENTENCE; break; } else if (lexeme.getLexemeType() == LexemeType.WHOLE && sentenceType == SentenceType.CHINESE_SENTENCE) { sentenceType = SentenceType.WHOLE_SENTENCE; } } } }
//正則表達式(從網上copy下來作了一些修改,識別中文和疑似的拼音) private static final String SUSPECTED_PINYIN_REGEX = "[\\u4e00-\\u9fa5]|(sh|ch|zh|[^aoeiuv])?[iuv]?(ai|ei|ao|ou|er|ang?|eng?|ong|a|o|e|i|u|ng|n)?";
使用這個正則表達式可能會截取出 不存在的拼音組合,好比說jvaoelasticsearch
這種直接成 j,v,a,o(找一個拼音組合的庫,看看截出來的拼音屬不屬於庫裏便可)測試
通過截取並給每一個詞元附一個lexemeType,獲得下邊的結果優化
首先簡要說明一下分析原理this
先看一下查詢條件
解釋一下查詢參數,首先是lexemeType 這個字段是指定搜索的詞級,必須按照句子的最低詞級進行搜索
例如: '修改' --> LexemeType.CHINESE
'xiu改' --> LexemeType.WHOLE
'修g' --> LexemeType.ACRONYM
Search結尾的三個參數是用來作搜索的,他們在SQL中用來作like操做, 這樣能夠擊中索引
and pinyinword.acronym like #{acronymSearch} || '%'
因爲用戶輸入的句子中可能含有中文或者拼音,這兩種類型裏須要進行過濾
好比說用戶輸入 ‘修g’ 咱們用最低詞級進行搜索 就是 like 'xg' % 這樣可能搜到 '鞋櫃' 因此我用了chineseFilter 和 pinyinFilter 來進行過濾(把 '%修%' append到chineseFilter中),這樣查詢條件就變成了
and pinyinword.acronym like #{acronymSearch} || '%' and pinyinword.word like #{chineseFilter}
這樣就不會搜到 '鞋櫃' 了
來看一下mybatis下的SQL,這裏join了wordclick表,取得了每一個詞語的點擊量,用來排序
<select id="searchByClickCount" resultType="model.value.WordClickCount" parameterType="model.options.PinyinWordAnalyzeSearchOptions"> select pw.word word, count(wc.id) clickCount from PinyinWord pw left join wordclick wc on wc.wordcontent = pw.word where 1=1 <choose> <when test="lexemeType.equals('CHINESE')"> <if test="chineseSearch!=null"> and pw.word like #{chineseSearch} || '%' </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </when> <when test="lexemeType.equals('WHOLE')"> <if test="wholeSearch!=null"> and pw.whole like #{wholeSearch} || '%' </if> <if test="chineseFilter!=null"> and pw.word like #{chineseFilter} </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </when> <otherwise> <if test="acronymSearch!=null"> and pw.acronym like #{acronymSearch} || '%' </if> <if test="chineseFilter!=null"> and pw.word like #{chineseFilter} </if> <if test="pinyinFilter!=null"> and pw.whole like #{pinyinFilter} </if> group by pw.word order by clickCount desc <if test="paging"> limit 5 offset 0 </if> </otherwise> </choose> </select>
而後是分析用戶的輸入,把查詢條件生成出來
這是部分代碼,足以明瞭 查詢條件生成原則了
LexemeType currentLexemeType; //當前詞元類型 LexemeType lastLexemeType = null; //以前詞元最低級 List<Lexeme> lexemes = sentence.getSentenceUnits(); //累積詞元最低級 for (int i = 0; i < lexemes.size(); i++) { Lexeme lexeme = lexemes.get(i); currentLexemeType = lexeme.getLexemeType(); String content = lexeme.getContent(); switch (currentLexemeType) { case CHINESE: //若當前詞元爲中文 String pinyin = convertSmartAll(content); //轉成拼音(pinyin4j) chineseSearch.append(content); //append到chineseSearch字段 wholeSearch.append(pinyin); //append到wholeSearch字段 acronymSearch.append(pinyin.charAt(0)); //append到acronymSearch字段 chineseFilter.append(content).append("%"); //append到chineseFilter字段 break; case WHOLE: //若爲拼音 同理中文 wholeSearch.append(content); acronymSearch.append(content.charAt(0)); pinyinFilter.append(content).append("%"); break; case ACRONYM: //同理 acronymSearch.append(content); break; } //將lastLexemeType 轉換成當前詞元和當前lastLexemeType中的第一級別的LexeType,由於搜索時須要詞元最低級 lastLexemeType = LexemeType.changeDown(lastLexemeType, currentLexemeType); //new searchOptions PinyinWordAnalyzeSearchOptions options = new PinyinWordAnalyzeSearchOptions( chineseSearch.toString(), wholeSearch.toString(), acronymSearch.toString(), chineseFilter.toString(), pinyinFilter.toString(), lastLexemeType ); // 結果出來啦。。。 List<WordClickCount> wordClickCounts = mapper.searchByClickCount(options);
@Test public void analyzeAndSearchTest() throws Exception { List<List<WordClickCount>> results = pinyinWordService.analyzeSearch("xiugaishoujihao"); //爲了初始化 pinyin4j long start1 = System.currentTimeMillis(); for (int i = 0; i < 100; i++) { long start = System.currentTimeMillis(); List<List<WordClickCount>> results1 = pinyinWordService.analyzeSearch("xiugais機haoqyxgai修改"); long end = System.currentTimeMillis(); System.out.println(end - start + " ms"); } long end1 = System.currentTimeMillis(); System.out.println(end1 - start1 + " ms"); }
測試結果
測試分解100條 11個詞元的用戶輸入,話費22.1秒,平均每一個221ms,效果還行
在sql中,使用了表關聯和count操做,當數據量比較大的時候,能夠考慮將pinyinword 加一個字段,天天跑定時把count update到pinyinword表中,這樣能夠對pinyinword進行單表查詢了