Jcseg是什麼?java
Jcseg是基於mmseg算法的一個輕量級中文分詞器,同時集成了關鍵字提取,關鍵短語提取,關鍵句子提取和文章自動摘要等功能,而且提供了一個基於Jetty的web服務器,方便各大語言直接http調用,同時提供了最新版本的lucene, solr, elasticsearch的分詞接口!Jcseg自帶了一個 jcseg.properties文件用於快速配置而獲得適合不一樣場合的分詞應用,例如:最大匹配詞長,是否開啓中文人名識別,是否追加拼音,是否追加同義詞等!c++
六種切分模式:git
+--------Jcseg chinese word tokenizer demo---------------+
|- @Author chenxin<chenxin619315@gmail.com> |
|- :seg_mode : switch to specified tokenizer mode. |
|- (:complex,:simple,:search,:detect,:delimiter,:NLP) |
|- :keywords : switch to keywords extract mode. |
|- :keyphrase : switch to keyphrase extract mode. |
|- :sentence : switch to sentence extract mode. |
|- :summary : switch to summary extract mode. |
|- :help : print this help menu. |
|- :quit : to exit the program. |
+--------------------------------------------------------+
jcseg~tokenizer:complex>>
歧義和同義詞:研究生命起源,混合詞:作B超檢查身體,x射線本質是什麼,今天去奇都ktv唱卡拉ok去,哆啦a夢是一個動漫中的主角,單位和全角: 2009年8月6日開始大學之旅,岳陽今天的氣溫爲38.6℃,也就是101.48℉,中文數字/分數:你分三十分之二,小陳拿三十分之五,剩下的三十分之二十三所有是個人,那是一九九八年前的事了,四川麻辣燙很好吃,五四運動留下的五四精神。筆記本五折包郵虧本大甩賣。人名識別:我是陳鑫,也是jcseg的做者,三國時期的諸葛亮是個天才,咱們一塊兒給劉翔加油,羅志高興奮極了由於老吳送了他一臺筆記本。外文名識別:冰島時間7月1日,正在當地拍片的湯姆·克魯斯經過發言人認可,他與第三任妻子凱蒂·赫爾墨斯(第一二任妻子分別爲咪咪·羅傑斯、妮可·基德曼)的婚姻即將結束。配對標點:本次『暢想杯』黑客技術大賽的得主爲電信09-2BF的張三,獎勵C++程序設計語言一書和【暢想網絡】的『PHP教程』一套。特殊字母:【Ⅰ】(Ⅱ),英文數字: bug report chenxin619315@gmail.com or visit http://code.google.com/p/jcseg, we all admire the hacker spirit!特殊數字: ① ⑩⑽㈩.
歧義/n和/o同義詞/n :/w研究/vn琢磨/vn研討/vn鑽研/vn生命/n起源/n,/w混合詞:/w作/v b超/n檢查/vn身體/n,/w x射線/n x光線/n本質/n是/a什麼/n,/w今天/t去/q奇都ktv/nz唱/n卡拉ok/nz去/q,/w哆啦a夢/nz是/a一個/q動漫/n中/q的/u主角/n,/w單位/n和/o全角/nz :/w 2009年/m 8月/m 6日/m開始/n大學/n之旅,/w岳陽/ns今天/t的/u氣溫/n爲/u 38.6℃/m ,/w也就是/v 101.48℉/m ,/w中文/n國語/n數字/n //w分數/n :/w你/r分/h三十分之二/m ,/w小陳/nr拿/nh三十分之五/m ,/w剩下/v的/u三十分之二十三/m所有/a是/a個人/nt,/w那是/c一九九八年/m 1998年/m前/v的/u事/i了/i,/w四川/ns麻辣燙/n很/m好吃/v,/w五四運動/nz留下/v的/u五四/m 54/m精神/n。/w筆記本/n五折/m 5折/m包郵虧本/v大甩賣甩賣。/w人名/n識別/v :/w我/r是/a陳鑫/nr,/w也/e是/a jcseg/en的/u做者/n,/w三國/mq時期/n的/u諸葛亮/nr是個天才/n,/w咱們/r一塊兒/d給/v劉翔/nr加油/v,/w羅志高/nr興奮/v極了/u由於/c老吳/nr送了他/r一臺筆記本/n。/w外文/n名/j識別/v:/w冰島/ns時間/n 7月/m 1日/m,/w正在/u當地/s拍片/vi的/u湯姆·克魯斯/nr阿湯哥/nr經過/v發言人/n認可/v,/w他/r與/u第三/m任/q妻子/n凱蒂·赫爾墨斯/nr(/w第一/a二/j任/q妻子/n分別爲咪咪·羅傑斯/nr、/w妮可·基德曼/nr)/w的/u婚姻/n即將/d結束/v。/w配對/v標點/n :/w本次/r『/w暢想杯/nz』/w黑客/n技術/n大賽/vn的/u得主/n爲/u電信/nt 09/en -/w bf/en 2bf/en的/u張三/nr,/w獎勵/vn c++/en程序設計/gi語言/n一書/ns和/o【/w暢想網絡/nz】/w的/u『/w PHP教程/nz』/w一套/m。/w特殊/a字母/n :/w【/wⅠ/nz】/w(/wⅡ/m)/w,/w英文/n英語/n數字/n :/w bug/en report/en chenxin/en 619315/en gmail/en com/en chenxin619315@gmail.com/en or/en visit/en http/en :/w //w //w code/en google/en com/en code.google.com/en //w p/en //w jcseg/en ,/w we/en all/en admire/en appreciate/en like/en love/en enjoy/en the/en hacker/en spirit/en mind/en !/w特殊/a數字/n :/w ①/m ⑩/m⑽/m㈩/m ./w
Jcseg從1.9.8纔開始上傳到了maven倉庫!web
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-core</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-analyzer</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-elasticsearch</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-server</artifactId>
<version>2.2.0</version>
</dependency>
//lucene 5.x
//Analyzer analyzer = new JcsegAnalyzer5X(JcsegTaskConfig.COMPLEX_MODE);
//available constructor: since 1.9.8
//1, JcsegAnalyzer5X(int mode)
//2, JcsegAnalyzer5X(int mode, String proFile)
//3, JcsegAnalyzer5X(int mode, JcsegTaskConfig config)
//4, JcsegAnalyzer5X(int mode, JcsegTaskConfig config, ADictionary dic)
//lucene 4.x版本
//Analyzer analyzer = new JcsegAnalyzer4X(JcsegTaskConfig.COMPLEX_MODE);
//lucene 6.3.0以及以上版本
Analyzer analyzer = new JcsegAnalyzer(JcsegTaskConfig.COMPLEX_MODE);
//available constructor:
//1, JcsegAnalyzer(int mode)
//2, JcsegAnalyzer(int mode, String proFile)
//3, JcsegAnalyzer(int mode, JcsegTaskConfig config)
//4, JcsegAnalyzer(int mode, JcsegTaskConfig config, ADictionary dic)
//非必須(用於修改默認配置): 獲取分詞任務配置實例
JcsegAnalyzer jcseg = (JcsegAnalyzer) analyzer;
JcsegTaskConfig config = jcseg.getTaskConfig();
//追加同義詞, 須要在 jcseg.properties中配置jcseg.loadsyn=1
config.setAppendCJKSyn(true);
//追加拼音, 須要在jcseg.properties中配置jcseg.loadpinyin=1
config.setAppendCJKPinyin();
//更多配置, 請查看 org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig
<!-- 複雜模式分詞: -->
<fieldtype name="textComplex" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="complex"/>
</analyzer>
</fieldtype>
<!-- 簡易模式分詞: -->
<fieldtype name="textSimple" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="simple"/>
</analyzer>
</fieldtype>
<!-- 檢測模式分詞: -->
<fieldtype name="textDetect" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="detect"/>
</analyzer>
</fieldtype>
<!-- 檢索模式分詞: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="search"/>
</analyzer>
</fieldtype>
<!-- NLP模式分詞: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="nlp"/>
</analyzer>
</fieldtype>
<!-- 空格分隔符模式分詞: -->
<fieldtype name="textSearch" class="solr.TextField">
<analyzer>
<tokenizer class="org.lionsoul.jcseg.analyzer.JcsegTokenizerFactory" mode="delimiter"/>
</analyzer>
</fieldtype>
備註:算法
可選的analyzer名字:編程
jcseg :對應Jcseg的檢索模式切分算法
jcseg_complex :對應Jcseg的複雜模式切分算法
jcseg_simple :對應Jcseg的簡易切分算法
jcseg_detect :對應Jcseg的檢測模式切分算法
jcseg_search :對應Jcseg的檢索模式切分算法
jcseg_nlp :對應Jcseg的NLP模式切分算法
jcseg_delimiter :對應Jcseg的分隔符模式切分算法
配置測試地址:json
http://localhost:9200/_analyze?analyzer=jcseg_search&text=一百美圓等於多少人民幣
對應測試結果:c#
GET _analyze?pretty
{
"analyzer": "jcseg_complex",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "達",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "廣場",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "發",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "銀行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_simple",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "達",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "廣場",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "發",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "銀行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_detect",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "達",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "廣場",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "發",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "銀行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_search",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "達",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "廣",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "廣場",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "場",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 5
},
{
"token": "發",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 6
},
{
"token": "銀",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 7
},
{
"token": "銀行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 8
},
{
"token": "行",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 9
},
{
"token": "信",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 10
},
{
"token": "信用",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 11
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 12
},
{
"token": "用",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 13
},
{
"token": "卡",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 14
},
{
"token": "中",
"start_offset": 11,
"end_offset": 12,
"type": "word",
"position": 15
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 16
},
{
"token": "心",
"start_offset": 12,
"end_offset": 13,
"type": "word",
"position": 17
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_nlp",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "達",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "廣場",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "浦",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 3
},
{
"token": "發",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 4
},
{
"token": "銀行",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "信用卡",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "中心",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}
GET _analyze?pretty
{
"analyzer": "jcseg_delimiter",
"text": "中達廣場浦發銀行信用卡中心"
}
{
"tokens": [
{
"token": "中達廣場浦發銀行信用卡中心",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
也能夠直接使用集成了jcseg的elasticsearch運行包:elasticsearch-jcseg,開封就可使用。api
jcseg-server模塊嵌入了jetty,實現了一個絕對高性能的服務器,給jcseg的所有Api功能都加上了restful接口,而且標準化了api結果的json輸出格式,各大語言直接使用http客戶端調用便可。數組
# 在最後傳入jcseg-server.properties配置文件的路徑
java -jar jcseg-server-{version}.jar ./jcseg-server.properties
懶得翻譯了,默默的多念幾遍就會了!
# jcseg server configuration file with standard json syntax
{
# jcseg server configuration
"server_config": {
# server port
"port": 1990,
# default conmunication charset
"charset": "utf-8",
# http idle timeout in ms
"http_connection_idle_timeout": 60000,
# jetty maximum thread pool size
"max_thread_pool_size": 200,
# thread idle timeout in ms
"thread_idle_timeout": 30000,
# http output buffer size
"http_output_buffer_size": 32768,
# request header size
"http_request_header_size": 8192,
# response header size
"http_response_header_size": 8192
},
# global setting for jcseg, yet another copy of the old
# configuration file jcseg.properties
"jcseg_global_config": {
# maximum match length. (5-7)
"jcseg_maxlen": 7,
# recognized the chinese name.
# (true to open and false to close it)
"jcseg_icnname": true,
# maximum length for pair punctuation text.
# set it to 0 to close this function
"jcseg_pptmaxlen": 7,
# maximum length for chinese last name andron.
"jcseg_cnmaxlnadron": 1,
# Whether to clear the stopwords.
# (set true to clear stopwords and false to close it)
"jcseg_clearstopword": false,
# Whether to convert the chinese numeric to arabic number.
# (set to true open it and false to close it) like '\u4E09\u4E07' to 30000.
"jcseg_cnnumtoarabic": true,
# Whether to convert the chinese fraction to arabic fraction.
# @Note: for lucene,solr,elasticsearch eg.. close it.
"jcseg_cnfratoarabic": false,
# Whether to keep the unrecognized word.
# (set true to keep unrecognized word and false to clear it)
"jcseg_keepunregword": true,
# Whether to start the secondary segmentation for the complex english words.
"jcseg_ensencondseg": true,
# min length of the secondary simple token.
# (better larger than 1)
"jcseg_stokenminlen": 2,
#thrshold for chinese name recognize.
# better not change it before you know what you are doing.
"jcseg_nsthreshold": 1000000,
#The punctuations that will be keep in an token.
# (Not the end of the token).
"jcseg_keeppunctuations": "@#%.&+"
},
# dictionary instance setting.
# add yours here with standard json syntax
"jcseg_dict": {
"master": {
"path": [
"{jar.dir}/lexicon"
# absolute path here
#"/java/JavaSE/jcseg/lexicon"
],
# Whether to load the part of speech of the words
"loadpos": true,
# Whether to load the pinyin of the words.
"loadpinyin": true,
# Whether to load the synoyms words of the words.
"loadsyn": true,
# whether to load the entity of the words.
"loadentity": true,
# Whether to load the modified lexicon file auto.
"autoload": true,
# Poll time for auto load. (in seconds)
"polltime": 300
}
# add more of yours here
# ,"name" : {
# "path": [
# "absolute jcseg standard lexicon path 1",
# "absolute jcseg standard lexicon path 2"
# ...
# ],
# "autoload": 0,
# "polltime": 300
# }
},
# JcsegTaskConfig instance setting.
# @Note:
# All the config instance here is extends from the global_setting above.
# do nothing will extends all the setting from global_setting
"jcseg_config": {
"master": {
# extends and Override the global setting
"jcseg_pptmaxlen": 0,
"jcseg_cnfratoarabic": true,
"jcseg_keepunregword": false
}
# this one is for keywords,keyphrase,sentence,summary extract
# @Note: do not delete this instance if u want jcseg to
# offset u extractor service
,"extractor": {
"jcseg_pptmaxlen": 0,
"jcseg_clearstopword": true,
"jcseg_cnnumtoarabic": false,
"jcseg_cnfratoarabic": false,
"jcseg_keepunregword": false,
"jcseg_ensencondseg": false
}
# well, this one is for NLP only
,"nlp" : {
"jcseg_ensencondseg": false,
"jcseg_cnfratoarabic": true,
"jcseg_cnnumtoarabic": true
}
# add more of yours here
# ,"name": {
# ...
# }
},
# jcseg tokenizer instance setting.
# Your could let the instance service for you by access:
# http://jcseg_server_host:port/tokenizer/instance_name
# instance_name is the name of instance you define here.
"jcseg_tokenizer": {
"master": {
# jcseg tokenizer algorithm, could be:
# 1: SIMPLE_MODE
# 2: COMPLEX_MODE
# 3: DETECT_MODE
# 4: SEARCH_MODE
# 5: DELIMITER_MODE
# 6: NLP_MODE
# see org.lionsoul.jcseg.tokenizer.core.JcsegTaskConfig for more info
"algorithm": 2,
# dictionary instance name
# choose one of your defines above in the dict scope
"dict": "master",
# JcsegTaskConfig instance name
# choose one of your defines above in the config scope
"config": "master"
}
# this tokenizer instance is for extractor service
# do not delete it if you want jcseg to offset you extractor service
,"extractor": {
"algorithm": 2,
"dict": "master",
"config": "extractor"
}
# this tokenizer instance of for NLP analysis
# keep it for you NLP project
,"nlp" : {
"algorithm": 6,
"dict": "master",
"config": "nlp"
}
# add more of your here
# ,"name": {
# ...
# }
}
}
api地址:http://jcseg_server_host:port/extractor/keywords?text=&number=&autoFilter=true|false
api參數:
text: post或者get過來的文檔文本
number: 要提取的關鍵詞個數
autoFilter: 是否自動過濾掉低分數關鍵字
api返回:
{
//api錯誤代號,0正常,1參數錯誤, -1內部錯誤
"code": 0,
//api返回數據
"data": {
//關鍵字數組
"keywords": [],
//操做耗時
"took": 0.001
}
}
更多配置請參考:org.lionsoul.jcseg.server.controller.KeywordsController
api地址:http://jcseg_server_host:port/extractor/keyphrase?text=&number=
api參數:
text: post或者get過來的文檔文本
number: 要提取的關鍵短語個數
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//關鍵短語數組
"keyphrase": []
}
}
更多配置請參考:org.lionsoul.jcseg.server.controller.KeyphraseController
api地址:http://jcseg_server_host:port/extractor/sentence?text=&number=
api參數:
text: post或者get過來的文檔文本
number: 要提取的關鍵句子個數
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//關鍵句子數組
"sentence": []
}
}
更多配置請參考:org.lionsoul.jcseg.server.controller.SentenceController
api地址:http://jcseg_server_host:port/extractor/summary?text=&length=
api參數:
text: post或者get過來的文檔文本
length: 要提取的摘要的長度
api返回:
{
"code": 0,
"data": {
"took": 0.0277,
//文章摘要
"summary": ""
}
}
更多配置請參考:org.lionsoul.jcseg.server.controller.SummaryController
api地址:http://jcseg_server_host:port/tokenizer/tokenizer_instance?text=&ret_pinyin=&ret_pos=...
api參數:
tokenizer_instance: 表示在jcseg-server.properties中定義的分詞實例名稱
text: post或者get過來的文章文本
ret_pinyin: 是否在分詞結果中返回詞條拼音(2.0.1版本後已經取消)
ret_pos: 是否在分詞結果中返回詞條詞性(2.0.1版本後已經取消)
api返回:
{
"code": 0,
"data": {
"took": 0.00885,
//詞條對象數組
"list": [
{
word: "哆啦a夢", //詞條內容
position: 0, //詞條在原文中的索引位置
length: 4, //詞條的詞個數(非字節數)
pinyin: "duo la a meng", //詞條的拼音
pos: "nz", //詞條的詞性標註
entity: null //詞條的實體標註
}
]
}
}
更多配置請參考:org.lionsoul.jcseg.server.controller.TokenizerController
jcseg.properties查找步驟:
因此,默認狀況下能夠在jcseg-core-{version}.jar同目錄下來放一份jcseg.properties來自定義配置。
JcsegTaskConfig構造方法以下:
JcsegTaskConfig(); //不作任何配置文件查找來初始化
JcsegTaskConfig(boolean autoLoad); //autoLoad=true會自動查找配置來初始化
JcsegTaskConfig(java.lang.String proFile); //從指定的配置文件中初始化配置對象
JcsegTaskConfig(InputStream is); //從指定的輸入流中初始化配置對象
demo代碼:
//建立JcsegTaskConfig使用默認配置,不作任何配置文件查找
JcsegTaskConfig config = new JcsegTaskConfig();
//該方法會自動按照上述「jcseg.properties查找步驟」來尋找jcseg.properties而且初始化:
JcsegTaskConfig config = new JcsegTaskConfig(true);
//依據給定的jcseg.properties文件建立而且初始化JcsegTaskConfig
JcsegTaskConfig config = new JcsegTaskConfig("absolute or relative jcseg.properties path");
//調用JcsegTaskConfig#load(String proFile)方法來從指定配置文件中初始化配置選項
config.load("absolute or relative jcseg.properties path");
ADictionary構造方法以下:
ADictionary(JcsegTaskConfig config, java.lang.Boolean sync)
//config:上述的JcsegTaskConfig實例
//sync: 是否建立線程安全詞庫,若是你須要在運行時操做詞庫對象則指定true,
// 若是jcseg.properties中autoload=1則會自動建立同步詞庫
demo代碼:
//Jcseg提供org.lionsoul.jcseg.tokenzier.core.DictionaryFactory來方便詞庫的建立與日後的兼容
//一般能夠經過
// DictionaryFactory#createDefaultDictionary(JcsegTaskConfig)
// DictionaryFactory.createSingletonDictionary(JcsegTaskConfig)
//兩方法來建立詞庫對象而且加載詞庫文件,建議使用createSingletonDictionary來建立單例詞庫
//config爲上面建立的JcsegTaskConfig對象.
//若是給定的JcsegTaskConfig裏面的詞庫路徑信息正確
//ADictionary會依據config裏面的詞庫信息加載所有有效的詞庫;
//而且該方法會依據config.isAutoload()來決定詞庫的同步性仍是非同步性,
//config.isAutoload()爲true就建立同步詞庫, 反之就建立非同步詞庫,
//config.isAutoload()對應jcseg.properties中的lexicon.autoload;
//若是config.getLexiconPath() = null,DictionaryFactory會自動加載classpath下的詞庫
//若是不想讓其自動加載lexicon下的詞庫
//能夠調用:DictionaryFactory.createSingletonDictionary(config, false)建立ADictionary便可;
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//建立一個非同步的按照config.lexPath配置加載詞庫的ADictioanry.
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, false);
//建立一個同步的按照config.lexPath加載詞庫的ADictioanry.
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, true);
//依據 config.isAutoload()來決定同步性,默認按照config.lexPath來加載詞庫的ADictionary
ADictionary dic = DictionaryFactory.createDefaultDictionary(config, config.isAutoload());
//指定ADictionary加載給定目錄下的全部詞庫文件的詞條.
//config.getLexiconPath爲詞庫文件存放有效目錄數組.
for ( String path : config.getLexiconPath() ) {
dic.loadDirectory(path);
}
//指定ADictionary加載給定詞庫文件的詞條.
dic.load("/java/lex-main.lex");
dic.load(new File("/java/lex-main.lex"));
//指定ADictionary加載給定輸入流的詞條
dic.load(new FileInputStream("/java/lex-main.lex"));
//閱讀下面的「若是自定義使用詞庫」來獲取更多信息
ISegment接口核心分詞方法:
public IWord next();
//返回下一個切分的詞條
demo代碼:
//依據給定的ADictionary和JcsegTaskConfig來建立ISegment
//一般使用SegmentFactory#createJcseg來建立ISegment對象
//將config和dic組成一個Object數組給SegmentFactory.createJcseg方法
//JcsegTaskConfig.COMPLEX_MODE表示建立ComplexSeg複雜ISegment分詞對象
//JcsegTaskConfig.SIMPLE_MODE表示建立SimpleSeg簡易Isegmengt分詞對象.
//JcsegTaskConfig.DETECT_MODE表示建立DetectSeg Isegmengt分詞對象.
//JcsegTaskConfig.SEARCH_MODE表示建立SearchSeg Isegmengt分詞對象.
//JcsegTaskConfig.DELIMITER_MODE表示建立DelimiterSeg Isegmengt分詞對象.
//JcsegTaskConfig.NLP_MODE表示建立NLPSeg Isegmengt分詞對象.
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//設置要分詞的內容
String str = "研究生命起源。";
seg.reset(new StringReader(str));
//獲取分詞結果
IWord word = null;
while ( (word = seg.next()) != null ) {
System.out.println(word.getValue());
}
//建立JcsegTaskConfig分詞配置實例,自動查找加載jcseg.properties配置項來初始化
JcsegTaskConfig config = new JcsegTaskConfig(true);
//建立默認單例詞庫實現,而且按照config配置加載詞庫
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//依據給定的ADictionary和JcsegTaskConfig來建立ISegment
//爲了Api日後兼容,建議使用SegmentFactory來建立ISegment對象
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{new StringReader(str), config, dic}
);
//備註:如下代碼能夠反覆調用,seg爲非線程安全
//設置要被分詞的文本
String str = "研究生命起源。";
seg.reset(new StringReader(str));
//獲取分詞結果
IWord word = null;
while ( (word = seg.next()) != null ) {
System.out.println(word.getValue());
}
從1.9.9版本開始,Jcseg已經默認將jcseg.properties和lexicon所有詞庫打包進了jcseg-core-{version}.jar中,若是是經過JcsegTaskConfig(true)構造的JcsegTaskConfig或者調用了JcsegTaskConfig#autoLoad()方法,在找不到自定義配置文件狀況下Jcseg會自動的加載classpath中的配置文件,若是config.getLexiconPath() = null DictionaryFactory默認會自動加載classpath下的詞庫。
//1, 默認構造JcsegTaskConfig,不作任何配置文件尋找來初始化
JcsegTaskConfig config = new JcsegTaskConfig();
//2, 設置自定義詞庫路徑集合
config.setLexiconPath(new String[]{
"relative or absolute lexicon path1",
"relative or absolute lexicon path2"
//add more here
});
//3, 經過config構造詞庫而且DictionaryFactory會按照上述設置的詞庫路徑自動加載所有詞庫
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
//1, 構造默認的JcsegTaskConfig,不作任何配置文件尋找來初始化
JcsegTaskConfig config = new JcsegTaskConfig();
//2, 構造ADictionary詞庫對象
//注意第二個參數爲false,阻止DictionaryFactory自動檢測config.getLexiconPath()來加載詞庫
ADictionary dic = DictionaryFactory.createSingletonDictionary(config, false);
//3, 手動加載詞庫
dic.load(new File("absolute or relative lexicon file path")); //加載指定詞庫文件下所有詞條
dic.load("absolute or relative lexicon file path"); //加載指定詞庫文件下所有詞條
dic.load(new FileInputStream("absolute or relative lexicon file path")); //加載指定InputStream輸入流下的所有詞條
dic.loadDirectory("absolute or relative lexicon directory"); //加載指定目錄下的所有詞庫文件的所有詞條
dic.loadClassPath(); //加載classpath路徑下的所有詞庫文件的所有詞條(默認路徑/lexicon)
TextRankKeywordsExtractor(ISegment seg);
//seg: Jcseg ISegment分詞對象
//1, 建立Jcseg ISegment分詞對象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(true); //設置過濾中止詞
config.setAppendCJKSyn(false); //設置關閉同義詞追加
config.setKeepUnregWords(false); //設置去除不識別的詞條
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 構建TextRankKeywordsExtractor關鍵字提取器
TextRankKeywordsExtractor extractor = new TextRankKeywordsExtractor(seg);
extractor.setMaxIterateNum(100); //設置pagerank算法最大迭代次數,非必須,使用默認便可
extractor.setWindowSize(5); //設置textRank計算窗口大小,非必須,使用默認便可
extractor.setKeywordsNum(10); //設置最大返回的關鍵詞個數,默認爲10
//3, 從一個輸入reader輸入流中獲取關鍵字
String str = "現有的分詞算法可分爲三大類:基於字符串匹配的分詞方法、基於理解的分詞方法和基於統計的分詞方法。按照是否與詞性標註過程相結合,又能夠分爲單純分詞方法和分詞與標註相結合的一體化方法。";
List<String> keywords = extractor.getKeywords(new StringReader(str));
//4, output:
//"分詞","方法","分爲","標註","相結合","字符串","匹配","過程","大類","單純"
TextRankSummaryExtractor(ISegment seg, SentenceSeg sentenceSeg);
//seg: Jcseg ISegment分詞對象
//sentenceSeg: Jcseg SentenceSeg句子切分對象
//1, 建立Jcseg ISegment分詞對象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(true); //設置過濾中止詞
config.setAppendCJKSyn(false); //設置關閉同義詞追加
config.setKeepUnregWords(false); //設置去除不識別的詞條
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 構造TextRankSummaryExtractor自動摘要提取對象
SummaryExtractor extractor = new TextRankSummaryExtractor(seg, new SentenceSeg());
//3, 從一個Reader輸入流中獲取length長度的摘要
String str = "Jcseg是基於mmseg算法的一個輕量級開源中文分詞器,同時集成了關鍵字提取,關鍵短語提取,關鍵句子提取和文章自動摘要等功能,而且提供了最新版本的lucene,%20solr,%20elasticsearch的分詞接口。Jcseg自帶了一個%20jcseg.properties文件用於快速配置而獲得適合不一樣場合的分詞應用。例如:最大匹配詞長,是否開啓中文人名識別,是否追加拼音,是否追加同義詞等!";
String summary = extractor.getSummary(new StringReader(str), 64);
//4, output:
//Jcseg是基於mmseg算法的一個輕量級開源中文分詞器,同時集成了關鍵字提取,關鍵短語提取,關鍵句子提取和文章自動摘要等功能,而且提供了最新版本的lucene, solr, elasticsearch的分詞接口。
//-----------------------------------------------------------------
//5, 從一個Reader輸入流中提取n個關鍵句子
String str = "you source string here";
extractor.setSentenceNum(6); //設置返回的關鍵句子個數
List<String> keySentences = extractor.getKeySentence(new StringReader(str));
TextRankKeyphraseExtractor(ISegment seg);
//seg: Jcseg ISegment分詞對象
//1, 建立Jcseg ISegment分詞對象
JcsegTaskConfig config = new JcsegTaskConfig(true);
config.setClearStopwords(false); //設置不過濾中止詞
config.setAppendCJKSyn(false); //設置關閉同義詞追加
config.setKeepUnregWords(false); //設置去除不識別的詞條
config.setEnSecondSeg(false); //關閉英文自動二次切分
ADictionary dic = DictionaryFactory.createSingletonDictionary(config);
ISegment seg = SegmentFactory.createJcseg(
JcsegTaskConfig.COMPLEX_MODE,
new Object[]{config, dic}
);
//2, 構建TextRankKeyphraseExtractor關鍵短語提取器
TextRankKeyphraseExtractor extractor = new TextRankKeyphraseExtractor(seg);
extractor.setMaxIterateNum(100); //設置pagerank算法最大迭代詞庫,非必須,使用默認便可
extractor.setWindowSize(5); //設置textRank窗口大小,非必須,使用默認便可
extractor.setKeywordsNum(15); //設置最大返回的關鍵詞個數,默認爲10
extractor.setMaxWordsNum(4); //設置最大短語詞長,默認爲5
//3, 從一個輸入reader輸入流中獲取短語
String str = "支持向量機普遍應用於文本挖掘,例如,基於支持向量機的文本自動分類技術研究一文中很詳細的介紹支持向量機的算法細節,文本自動分類是文本挖掘技術中的一種!";
List<String> keyphrases = extractor.getKeyphrase(new StringReader(str));
//4, output:
//支持向量機, 自動分類
名詞n、時間詞t、處所詞s、方位詞f、數詞m、量詞q、區別詞b、代詞r、動詞v、形容詞a、狀態詞z、副詞d、介詞p、連詞c、助詞u、語氣詞y、嘆詞e、擬聲詞o、成語i、習慣用語l、簡稱j、前接成分h、後接成分k、語素g、非語素字x、標點符號w)外,從語料庫應用的角度,增長了專有名詞(人名nr、地名ns、機構名稱nt、其餘專有名詞nz)。
格式:
詞根,同義詞1[/可選拼音],同義詞2[/可選拼音],...,同義詞n[/可選拼音]
例如:
單行定義:
研究,研討,鑽研,研磨/yan mo,研發
多行定義:(只要詞根同樣,定義的所有同義詞就都屬於同一個集合)
中央一臺,央視一臺,中央第一臺
中央一臺,中央第一頻道,央視第一臺,央視第一頻道
1,第一個詞爲同義詞的根詞條,這個詞條必須是CJK_WORD詞庫中必須存在的詞條,若是不存在,這條同義詞定義會被忽略。
2,根詞會做爲不一樣行同義詞集合的區別,若是兩行同義詞定義的根詞同樣,會自動合併成一個同義詞集合。
3,jcseg中使用org.lionsoul.jcseg.tokenizer.core.SynonymsEntry來管理同義詞集合,每一個IWord詞條對象都會有一個SynonymsEntry屬性來指向本身的同義詞集合。
4,SynonymsEntry.rootWord存儲了同義詞集合的根詞,同義詞的合併建議統一替換成根詞。
5,除去根詞外的其餘同義詞,jcseg會自動檢測而且建立相關的IWord詞條對象而且將其加入CJK_WORD詞庫中,也就是說其餘同義詞不必定要是CJK_WORD詞庫中存在的詞條。
6,其餘同義詞會自動繼承詞根的詞性和實體定義,也會繼承CJK_WORD詞庫中該詞條的拼音定義(若是存在該詞),也能夠在詞條後面經過增長"/拼音"來單獨定義拼音。
7,同一同義詞定義的集合中的所有IWord詞條都指向同一個SynonymsEntry對象,也就是同義詞之間會自動相互引用。
拷貝到{ESHOME}/ plugins/jcseg目錄下,重啓es
Kibana上操做,先獲取須要修改的mapping
GET _template/logstash
拷貝當前獲取的mapping,並添加指定分詞器爲jcseg_search
PUT _template/logstash
{
"order": 0,
"version": 50001,
"template": "logstash-*",
"settings": {
"index": {
"refresh_interval": "5s"
}
},
"mappings": {
"_default_": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"mapping": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"type": "text"
},
"match_mapping_type": "string"
}
},
{
"string_fields": {
"mapping": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"match_mapping_type": "string",
"match": "*"
}
}
],
"_all": {
"norms": false,
"analyzer": "jcseg_search",
"search_analyzer": "jcseg_search",
"enabled": true
},
"properties": {
"@timestamp": {
"include_in_all": false,
"type": "date"
},
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"latitude": {
"type": "half_float"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "half_float"
}
}
},
"@version": {
"include_in_all": false,
"type": "keyword"
}
}
}
},
"aliases": {}
}
此修改只針對新生成的索引有效