UTF8字符串在lua的截取和字數統計【轉載】

時間 2019-12-15

原文原文鏈接

轉載自：GitHub:pangliang/pangliang.github.comgit

需求

按字面個數來截取

函數(字符串, 開始位置, 截取長度)

utf8sub("你好1世界哈哈",2,5)    =    好1世界哈
utf8sub("1你好1世界哈哈",2,5)    =    你好1世界
utf8sub("你好世界1哈哈",1,5)    =    你好世界1
utf8sub("12345678",3,5)    =    34567
utf8sub("øpø你好pix",2,5)    =    pø你好p

錯誤方法

網上找了一些算法, 都不太正確; 要麼就是亂碼, 要麼就是隻考慮了4 byte 中文的狀況, 不夠全面github

1. string.sub(s,1,截取長度*4)算法

　　網上不少直接使用"`""string.sub(s,1,截取長度*4)`"是確定不對的, 由於若是中英文混合的字符串, 例如`你好1世界`的字符長度分別是`4,4,1,4,4`, 若是截取4個字, 4*4=4+4+1+4+3, 那`世界`的`界`字將會被取前3個byte, 就會出現亂碼ide

2. if byte>128 then index = index + 4函數

問題關鍵

1. utf8字符是變長字符post

2. 字符長度有規律lua

UTF-8字符規律url

字符串的首個byte表示了該utf8字符的長度spa

0xxxxxxx - 1 bytecode

110yxxxx - 192, 2 byte

1110yyyy - 225, 3 byte

11110zzz - 240, 4 byte

正確算法

 1 --
 2 -- lua
 3 -- 判斷utf8字符byte長度
 4 -- 0xxxxxxx - 1 byte
 5 -- 110yxxxx - 192, 2 byte
 6 -- 1110yyyy - 225, 3 byte
 7 -- 11110zzz - 240, 4 byte
 8 local function chsize(char)
 9     if not char then
10         print("not char")
11         return 0
12     elseif char > 240 then
13         return 4
14     elseif char > 225 then
15         return 3
16     elseif char > 192 then
17         return 2
18     else
19         return 1
20     end
21 end
22 
23 -- 計算utf8字符串字符數, 各類字符都按一個字符計算
24 -- 例如utf8len("1你好") => 3
25 function utf8len(str)
26     local len = 0
27     local currentIndex = 1
28     while currentIndex <= #str do
29         local char = string.byte(str, currentIndex)
30         currentIndex = currentIndex + chsize(char)
31         len = len +1
32     end
33     return len
34 end
35 
36 -- 截取utf8 字符串
37 -- str:            要截取的字符串
38 -- startChar:    開始字符下標,從1開始
39 -- numChars:    要截取的字符長度
40 function utf8sub(str, startChar, numChars)
41     local startIndex = 1
42     while startChar > 1 do
43         local char = string.byte(str, startIndex)
44         startIndex = startIndex + chsize(char)
45         startChar = startChar - 1
46     end
47 
48     local currentIndex = startIndex
49 
50     while numChars > 0 and currentIndex <= #str do
51         local char = string.byte(str, currentIndex)
52         currentIndex = currentIndex + chsize(char)
53         numChars = numChars -1
54     end
55     return str:sub(startIndex, currentIndex - 1)
56 end
57 
58 -- 自測
59 function test()
60     -- test utf8len
61     assert(utf8len("你好1世界哈哈") == 7)
62     assert(utf8len("你好世界1哈哈 ") == 8)
63     assert(utf8len(" 你好世 界1哈哈") == 9)
64     assert(utf8len("12345678") == 8)
65     assert(utf8len("øpø你好pix") == 8)
66 
67     -- test utf8sub
68     assert(utf8sub("你好1世界哈哈",2,5) == "好1世界哈")
69     assert(utf8sub("1你好1世界哈哈",2,5) == "你好1世界")
70     assert(utf8sub(" 你好1世界 哈哈",2,6) == "你好1世界 ")
71     assert(utf8sub("你好世界1哈哈",1,5) == "你好世界1")
72     assert(utf8sub("12345678",3,5) == "34567")
73     assert(utf8sub("øpø你好pix",2,5) == "pø你好p")
74 
75     print("all test succ")
76 end
77 
78 test()