英文詞幹提取(stemming)算法 - Lovins, Porter

時間 2019-12-04

標籤英文詞幹提取 stemming 算法 lovins porter 简体版

原文原文鏈接

英文詞幹提取有多種方式，在實踐中，可能涉及到機器學習數據挖掘等多方面的內容。
這裏主要介紹的是易於實現的幾種原始算法：算法

Lovins (1968)segmentfault
Porter (1980)app
Porter2 (2000)less

1. Lovins

Lovins是最先的實現機器學習

1.1. 簡介

算法涉及以下部件：ide

ending, 詞後綴，共有294個，詳細列表見最後學習
condition, 詞後綴去除條件，每一個ending對應一個condition，共有29個，詳細列表見最後優化
transformation, 轉換ending的方式，共有35個，詳細列表見最後rest

算法分爲兩部：code

對英文詞，根據ending列表，按照ending從長到短掃描，找到第一個符合condition的ending
根據剩下的stem應用transformation，將ending轉爲恰當的形式

1.2. 例子

第一步

英文詞爲nationally，按照endling列表，從長到短掃描，首先找到 .09. ationally B，
對應的規則是B Minimum stem length = 3，要求去除ending後，剩餘的部分長度大於等於3
nationally 去除 ationally 後只剩下 n, 不符合condition

繼續掃描ending，找到 .07. ionally A，對應的規則是 A No restrictions on stem,沒有任何限制。
因而最終選定 ionally做爲ending

第二步

英文詞nationally的stem是nat, 查找transformation，發現沒有符合的transformation，不進行變換直接輸出。
好比又一個詞sitting，第一步獲得stem是sitt, 第二步這裏會應用第一條transformation，最終輸出sit

1.Appendix.A endings 列表

.11.
alistically B   arizability A   izationally B

.10.
antialness A    arisations A    arizations A    entialness A

.09.
allically C     antaneous A     antiality A     arisation A
arization A     ationally B     ativeness A     eableness E
entations A     entiality A     entialize A     entiation A
ionalness A     istically A     itousness A     izability A
izational A

.08.
ableness A      arizable A      entation A      entially A
eousness A      ibleness A      icalness A      ionalism A
ionality A      ionalize A      iousness A      izations A
lessness A

.07.
ability A       aically A       alistic B       alities A
ariness E       aristic A       arizing A       ateness A
atingly A       ational B       atively A       ativism A
elihood E       encible A       entally A       entials A
entiate A       entness A       fulness A       ibility A
icalism A       icalist A       icality A       icalize A
ication G       icianry A       ination A       ingness A
ionally A       isation A       ishness A       istical A
iteness A       iveness A       ivistic A       ivities A
ization F       izement A       oidally A       ousness A

.06.
aceous A        acious B        action G        alness A
ancial A        ancies A        ancing B        ariser A
arized A        arizer A        atable A        ations B
atives A        eature Z        efully A        encies A
encing A        ential A        enting C        entist A
eously A        ialist A        iality A        ialize A
ically A        icance A        icians A        icists A
ifully A        ionals A        ionate D        ioning A
ionist A        iously A        istics A        izable E
lessly A        nesses A        oidism A

.05.
acies A         acity A         aging B         aical A
alist A         alism B         ality A         alize A
allic BB        anced B         ances B         antic C
arial A         aries A         arily A         arity B
arize A         aroid A         ately A         ating I
ation B         ative A         ators A         atory A
ature E         early Y         ehood A         eless A
elity A         ement A         enced A         ences A
eness E         ening E         ental A         ented C
ently A         fully A         ially A         icant A
ician A         icide A         icism A         icist A
icity A         idine I         iedly A         ihood A
inate A         iness A         ingly B         inism J
inity CC        ional A         ioned A         ished A
istic A         ities A         itous A         ively A
ivity A         izers F         izing F         oidal A
oides A         otide A         ously A

.04.
able A          ably A          ages B          ally B
ance B          ancy B          ants B          aric A
arly K          ated I          ates A          atic B
ator A          ealy Y          edly E          eful A
eity A          ence A          ency A          ened E
enly E          eous A          hood A          ials A
ians A          ible A          ibly A          ical A
ides L          iers A          iful A          ines M
ings N          ions B          ious A          isms B
ists A          itic H          ized F          izer F
less A          lily A          ness A          ogen A
ward A          wise A          ying B          yish A

.03.
acy A           age B           aic A           als BB
ant B           ars O           ary F           ata A
ate A           eal Y           ear Y           ely E
ene E           ent C           ery E           ese A
ful A           ial A           ian A           ics A
ide L           ied A           ier A           ies P
ily A           ine M           ing N           ion Q
ish C           ism B           ist A           ite AA
ity A           ium A           ive A           ize F
oid A           one R           ous A

.02.
ae A            al BB           ar X            as B
ed E            en F            es E            ia A
ic A            is A            ly B            on S
or T            um U            us V            yl R
s' A            's A

.01.
a A             e A             i A             o A
s W             y B

1.Appendix.B conditions 列表

A   No restrictions on stem
B   Minimum stem length = 3
C   Minimum stem length = 4
D   Minimum stem length = 5
E   Do not remove ending after e
F   Minimum stem length = 3 and do not remove ending after e
G   Minimum stem length = 3 and remove ending only after f
H   Remove ending only after t or ll
I   Do not remove ending after o or e
J   Do not remove ending after a or e
K   Minimum stem length = 3 and remove ending only after l, i or u*e
L   Do not remove ending after u, x or s, unless s follows o
M   Do not remove ending after a, c, e or m
N   Minimum stem length = 4 after s**, elsewhere = 3
O   Remove ending only after l or i
P   Do not remove ending after c
Q   Minimum stem length = 3 and do not remove ending after l or n
R   Remove ending only after n or r
S   Remove ending only after dr or t, unless t follows t
T   Remove ending only after s or t, unless t follows o
U   Remove ending only after l, m, n or r
V   Remove ending only after c
W   Do not remove ending after s or u
X   Remove ending only after l, i or u*e
Y   Remove ending only after in
Z   Do not remove ending after f
AA  Remove ending only after d, f, ph, th, l, er, or, es or t
BB  Minimum stem length = 3 and do not remove ending after met or ryst
CC  Remove ending only after l

1.Appendix.C transformations 列表

1   remove one of double b, d, g, l, m, n, p, r, s, t
2   iev   ->   ief
3   uct   ->   uc
4   umpt  ->   um
5   rpt   ->   rb
6   urs   ->   ur
7   istr  ->   ister
7a  metr  ->   meter
8   olv   ->   olut
9   ul    ->   l except following a, o, i
10  bex   ->   bic
11  dex   ->   dic
12  pex   ->   pic
13  tex   ->   tic
14  ax    ->   ac
15  ex    ->   ec
16  ix    ->   ic
17  lux   ->   luc
18  uad   ->   uas
19  vad   ->   vas
20  cid   ->   cis
21  lid   ->   lis
22  erid  ->   eris
23  pand  ->   pans
24  end   ->   ens except following s
25  ond   ->   ons
26  lud   ->   lus
27  rud   ->   rus
28  her   ->   hes except following p, t
29  mit   ->   mis
30  ent   ->   ens except following m
31  ert   ->   ers
32  et    ->   es except following n
33  yt    ->   ys
34  yz    ->   ys

2. Porter

2.1. 簡介

元音與輔音

元音輔音與常見的定義略有不一樣：

元音(Vowel) - A E I O U, 以及輔音後邊的Y
輔音(Consonant) - 除了 A E I O U，以及元音後邊的Y

單詞的分組

連續的元音看做元音組V，連續的輔音看做輔音組C，因而任意一個單詞均可以表示成VC交錯的形式，例如：

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V

綜合起來，能夠表示爲 VC 組的形式：$$ C^m[V] $$
其中參數m相似於Lovin中condition的stem長度，用於後續的判斷

規則

Porter算法以rule爲主，rule的形式爲：

(condition) S1 -> S2

condition做用於去除了S1的stem，除了m還有其餘特徵：

m - 表示VC組的數目
* - 表示任意字符, 和子串，v,d,o配合使用
大寫字母 - 表示子串
v - 表示一個元音字符
d - 表示兩個同樣的輔音
o - 表示cvc, 其中第二個c不能是W,X,Y

S1是詞的後綴，S2的變化後的後綴

和Lovin不一樣，一個詞語通過多個規則的串聯處理，輸出目標詞(Lovin是一次性輸出)
例如 hopping, 首先應用規則(*v*) ING ->, 變爲hopp
而後應用規則(*d and not (*L or *S or *Z)) -> single letter，從hopp變爲hop

流程

整個算法是從上往下應用規則，有些規則比較特殊，若是觸發了要處理額外的規則
規則不少，因而對規則進行分組(step)，這裏的分組是爲了邏輯上作區分(實際上算法也能夠根據分組優化)，整個算法就是從頭到位執行的，流程以下：

do Step_1a
do Step_1b (若是命中step 2b.2 or step 2b.3, 則作一些額外工做)
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5a
do Step_5b

每一個Step的詳細內容見附錄

2.2. 例子

2.Appendix Step 1a

SSES  ->   SS
      IES   ->   I
      SS    ->   SS
      S     ->

2.Appendix Step 1b

(m>0) EED     ->   EE
(*v*) ED      ->
(*v*) ING     ->

If the second or third of the rules in Step 1b is successful, the following is done:

      AT      ->   ATE
      BL      ->   BLE
      IZ      ->   IZE
      (*d and not (*L or *S or *Z)) -> single letter
      (m=1 and *o)  ->   E

2.Appendix Step 1c

(*v*) Y       ->   I

2.Appendix Step 2

(m>0) ATIONAL ->   ATE
(m>0) TIONAL  ->   TION
(m>0) ENCI    ->   ENCE
(m>0) ANCI    ->   ANCE
(m>0) IZER    ->   IZE
(m>0) ABLI    ->   ABLE
(m>0) ALLI    ->   AL
(m>0) ENTLI   ->   ENT
(m>0) ELI     ->   E
(m>0) OUSLI   ->   OUS
(m>0) IZATION ->   IZE
(m>0) ATION   ->   ATE
(m>0) ATOR    ->   ATE
(m>0) ALISM   ->   AL
(m>0) IVENESS ->   IVE
(m>0) FULNESS ->   FUL
(m>0) OUSNESS ->   OUS
(m>0) ALITI   ->   AL
(m>0) IVITI   ->   IVE
(m>0) BILITI  ->   BLE

2.Appendix Step 3

(m>0) ICATE   ->   IC
(m>0) ATIVE   ->
(m>0) ALIZE   ->   AL
(m>0) ICITI   ->   IC
(m>0) ICAL    ->   IC
(m>0) FUL     ->
(m>0) NESS    ->

2.Appendix Step 4

(m>1) AL      ->
(m>1) ANCE    ->
(m>1) ENCE    ->
(m>1) ER      ->
(m>1) IC      ->
(m>1) ABLE    ->
(m>1) IBLE    ->
(m>1) ANT     ->
(m>1) EMENT   ->
(m>1) MENT    ->
(m>1) ENT     ->
(m>1 and (*S or *T)) ION   ->
(m>1) OU      ->
(m>1) ISM     ->
(m>1) ATE     ->
(m>1) ITI     ->
(m>1) OUS     ->
(m>1) IVE     ->
(m>1) IZE     ->

2.Appendix Step 5a

(m>1) E   ->
(m=1 and not *o) E   ->

2.Appendix Step 5b

(m > 1 and *d and *L)   ->   single letter

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。