英文詞幹提取有多種方式,在實踐中,可能涉及到機器學習數據挖掘等多方面的內容。
這裏主要介紹的是易於實現的幾種原始算法:算法
Lovins (1968)segmentfault
Porter (1980)app
Porter2 (2000)less
Lovins是最先的實現機器學習
算法涉及以下部件:ide
ending, 詞後綴,共有294個,詳細列表見最後學習
condition, 詞後綴去除條件,每一個ending對應一個condition,共有29個,詳細列表見最後優化
transformation, 轉換ending的方式,共有35個,詳細列表見最後rest
算法分爲兩部:code
對英文詞,根據ending列表,按照ending從長到短掃描,找到第一個符合condition的ending
根據剩下的stem應用transformation,將ending轉爲恰當的形式
英文詞爲nationally,按照endling列表,從長到短掃描,首先找到 .09. ationally B
,
對應的規則是B Minimum stem length = 3
,要求去除ending後,剩餘的部分長度大於等於3
nationally 去除 ationally 後只剩下 n, 不符合condition
繼續掃描ending,找到 .07. ionally A
,對應的規則是 A No restrictions on stem
,沒有任何限制。
因而最終選定 ionally
做爲ending
英文詞nationally的stem是nat, 查找transformation,發現沒有符合的transformation,不進行變換直接輸出。
好比又一個詞sitting,第一步獲得stem是sitt, 第二步這裏會應用第一條transformation,最終輸出sit
.11. alistically B arizability A izationally B .10. antialness A arisations A arizations A entialness A .09. allically C antaneous A antiality A arisation A arization A ationally B ativeness A eableness E entations A entiality A entialize A entiation A ionalness A istically A itousness A izability A izational A .08. ableness A arizable A entation A entially A eousness A ibleness A icalness A ionalism A ionality A ionalize A iousness A izations A lessness A .07. ability A aically A alistic B alities A ariness E aristic A arizing A ateness A atingly A ational B atively A ativism A elihood E encible A entally A entials A entiate A entness A fulness A ibility A icalism A icalist A icality A icalize A ication G icianry A ination A ingness A ionally A isation A ishness A istical A iteness A iveness A ivistic A ivities A ization F izement A oidally A ousness A .06. aceous A acious B action G alness A ancial A ancies A ancing B ariser A arized A arizer A atable A ations B atives A eature Z efully A encies A encing A ential A enting C entist A eously A ialist A iality A ialize A ically A icance A icians A icists A ifully A ionals A ionate D ioning A ionist A iously A istics A izable E lessly A nesses A oidism A .05. acies A acity A aging B aical A alist A alism B ality A alize A allic BB anced B ances B antic C arial A aries A arily A arity B arize A aroid A ately A ating I ation B ative A ators A atory A ature E early Y ehood A eless A elity A ement A enced A ences A eness E ening E ental A ented C ently A fully A ially A icant A ician A icide A icism A icist A icity A idine I iedly A ihood A inate A iness A ingly B inism J inity CC ional A ioned A ished A istic A ities A itous A ively A ivity A izers F izing F oidal A oides A otide A ously A .04. able A ably A ages B ally B ance B ancy B ants B aric A arly K ated I ates A atic B ator A ealy Y edly E eful A eity A ence A ency A ened E enly E eous A hood A ials A ians A ible A ibly A ical A ides L iers A iful A ines M ings N ions B ious A isms B ists A itic H ized F izer F less A lily A ness A ogen A ward A wise A ying B yish A .03. acy A age B aic A als BB ant B ars O ary F ata A ate A eal Y ear Y ely E ene E ent C ery E ese A ful A ial A ian A ics A ide L ied A ier A ies P ily A ine M ing N ion Q ish C ism B ist A ite AA ity A ium A ive A ize F oid A one R ous A .02. ae A al BB ar X as B ed E en F es E ia A ic A is A ly B on S or T um U us V yl R s' A 's A .01. a A e A i A o A s W y B
A No restrictions on stem B Minimum stem length = 3 C Minimum stem length = 4 D Minimum stem length = 5 E Do not remove ending after e F Minimum stem length = 3 and do not remove ending after e G Minimum stem length = 3 and remove ending only after f H Remove ending only after t or ll I Do not remove ending after o or e J Do not remove ending after a or e K Minimum stem length = 3 and remove ending only after l, i or u*e L Do not remove ending after u, x or s, unless s follows o M Do not remove ending after a, c, e or m N Minimum stem length = 4 after s**, elsewhere = 3 O Remove ending only after l or i P Do not remove ending after c Q Minimum stem length = 3 and do not remove ending after l or n R Remove ending only after n or r S Remove ending only after dr or t, unless t follows t T Remove ending only after s or t, unless t follows o U Remove ending only after l, m, n or r V Remove ending only after c W Do not remove ending after s or u X Remove ending only after l, i or u*e Y Remove ending only after in Z Do not remove ending after f AA Remove ending only after d, f, ph, th, l, er, or, es or t BB Minimum stem length = 3 and do not remove ending after met or ryst CC Remove ending only after l
1 remove one of double b, d, g, l, m, n, p, r, s, t 2 iev -> ief 3 uct -> uc 4 umpt -> um 5 rpt -> rb 6 urs -> ur 7 istr -> ister 7a metr -> meter 8 olv -> olut 9 ul -> l except following a, o, i 10 bex -> bic 11 dex -> dic 12 pex -> pic 13 tex -> tic 14 ax -> ac 15 ex -> ec 16 ix -> ic 17 lux -> luc 18 uad -> uas 19 vad -> vas 20 cid -> cis 21 lid -> lis 22 erid -> eris 23 pand -> pans 24 end -> ens except following s 25 ond -> ons 26 lud -> lus 27 rud -> rus 28 her -> hes except following p, t 29 mit -> mis 30 ent -> ens except following m 31 ert -> ers 32 et -> es except following n 33 yt -> ys 34 yz -> ys
元音輔音與常見的定義略有不一樣:
元音(Vowel) - A E I O U, 以及輔音後邊的Y
輔音(Consonant) - 除了 A E I O U,以及元音後邊的Y
連續的元音看做元音組V,連續的輔音看做輔音組C,因而任意一個單詞均可以表示成VC交錯的形式,例如:
segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC porter -> p/o/rt/e/r -> CVCVC application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC apple -> a/ppl/e -> V/C/V
綜合起來,能夠表示爲 VC 組的形式:$$ C^m[V] $$
其中參數m相似於Lovin中condition的stem長度,用於後續的判斷
Porter算法以rule爲主,rule的形式爲:
(condition) S1 -> S2
condition做用於去除了S1的stem,除了m還有其餘特徵:
m - 表示VC組的數目
* - 表示任意字符, 和子串,v,d,o配合使用
大寫字母 - 表示子串
v - 表示一個元音字符
d - 表示兩個同樣的輔音
o - 表示cvc, 其中第二個c不能是W,X,Y
S1是詞的後綴,S2的變化後的後綴
和Lovin不一樣,一個詞語通過多個規則的串聯處理,輸出目標詞(Lovin是一次性輸出)
例如 hopping, 首先應用規則(*v*) ING ->
, 變爲hopp
而後應用規則(*d and not (*L or *S or *Z)) -> single letter
,從hopp變爲hop
整個算法是從上往下應用規則,有些規則比較特殊,若是觸發了要處理額外的規則
規則不少,因而對規則進行分組(step),這裏的分組是爲了邏輯上作區分(實際上算法也能夠根據分組優化),整個算法就是從頭到位執行的,流程以下:
do Step_1a
do Step_1b (若是命中step 2b.2 or step 2b.3, 則作一些額外工做)
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5a
do Step_5b
每一個Step的詳細內容見附錄
SSES -> SS IES -> I SS -> SS S ->
(m>0) EED -> EE (*v*) ED -> (*v*) ING -> If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE BL -> BLE IZ -> IZE (*d and not (*L or *S or *Z)) -> single letter (m=1 and *o) -> E
(*v*) Y -> I
(m>0) ATIONAL -> ATE (m>0) TIONAL -> TION (m>0) ENCI -> ENCE (m>0) ANCI -> ANCE (m>0) IZER -> IZE (m>0) ABLI -> ABLE (m>0) ALLI -> AL (m>0) ENTLI -> ENT (m>0) ELI -> E (m>0) OUSLI -> OUS (m>0) IZATION -> IZE (m>0) ATION -> ATE (m>0) ATOR -> ATE (m>0) ALISM -> AL (m>0) IVENESS -> IVE (m>0) FULNESS -> FUL (m>0) OUSNESS -> OUS (m>0) ALITI -> AL (m>0) IVITI -> IVE (m>0) BILITI -> BLE
(m>0) ICATE -> IC (m>0) ATIVE -> (m>0) ALIZE -> AL (m>0) ICITI -> IC (m>0) ICAL -> IC (m>0) FUL -> (m>0) NESS ->
(m>1) AL -> (m>1) ANCE -> (m>1) ENCE -> (m>1) ER -> (m>1) IC -> (m>1) ABLE -> (m>1) IBLE -> (m>1) ANT -> (m>1) EMENT -> (m>1) MENT -> (m>1) ENT -> (m>1 and (*S or *T)) ION -> (m>1) OU -> (m>1) ISM -> (m>1) ATE -> (m>1) ITI -> (m>1) OUS -> (m>1) IVE -> (m>1) IZE ->
(m>1) E -> (m=1 and not *o) E ->
(m > 1 and *d and *L) -> single letter