在《Python天然語言處理》一書中的P121出現來一段利用NLTK自帶的正則表達式分詞器——nlt.regexp_tokenize,書中代碼爲:正則表達式
1 text = 'That U.S.A. poster-print ex-costs-ed $12.40 ... 8% ? _' 2 pattern = r'''(?x) # set flag to allow verbose regexps 3 ([A-Z]\.)+ # abbreviations, e.g. U.S.A. 4 |\w+(-\w+)* # words with optional internal hyphens 5 |\$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% 6 |\.\.\. # ellipsis 7 |(?:[.,;"'?():-_`]) # these are separate tokens; includes ], [ 8 '''
其中text變量結尾的「8%」和「_」是我本身加上去的。post
預期輸出應該是:spa
1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8%', '?', '_']
可實際代碼是:.net
1 [('', '', ''), ('A.', '', ''), ('', '-print', ''), ('', '-ed', ''), ('', '', '.40'), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
會出現這樣的問題是因爲nltk.internals.compile_regexp_to_noncapturing()在V3.1版本的NLTK中已經被拋棄(儘管在更早的版本中它仍然能夠運行),爲此咱們把以前定義的pattern稍做修改(參考:http://www.javashuo.com/article/p-mzwaeppw-dz.html)code
1 pattern = r'''(?x) # set flag to allow verbose regexps 2 (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A. 3 |\w+(?:-\w+)* # words with optional internal hyphens 4 |\$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82% 5 #|\w+(?:-\w+)* 6 |\.\.\. # ellipsis 7 |(?:[.,;"'?():-_`]) # these are separate tokens; includes ], [ 8 '''
實際輸出結果是:regexp
1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8', '?', '_']
咱們發現‘8’應該顯示成‘8%’纔對,後發現將第三行的‘*’去掉或者將第三四行調換位置便可正常顯示,修改後代碼以下:blog
1 pattern = r'''(?x) # set flag to allow verbose regexps 2 (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A. 3 #|\w+(?:-\w+)* # words with optional internal hyphens 4 |\$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82% 5 |\w+(?:-\w+)* 6 |\.\.\. # ellipsis 7 |(?:[.,;"'?():-_`]) # these are separate tokens; includes ], [ 8 '''
此時結果顯示正常,因此得出結論就是‘*’影響了它下面的正則表達式中的百分號'%'的匹配。至於爲何就不得而知了。token