Lua截取utf-8編碼的中英文混合字符串

時間 2019-11-12

標籤 lua 截取 utf 編碼英文混合字符串欄目 Lua 简体版

原文原文鏈接

參考博客：UTF8字符串在lua的截取和字數統計【轉載】

需求

按字面個數來截取子字符串html

函數(字符串, 開始位置, 截取長度)

utf8sub("你好1世界哈哈",2,5)    =    好1世界哈
utf8sub("1你好1世界哈哈",2,5)    =    你好1世界
utf8sub("你好世界1哈哈",1,5)    =    你好世界1
utf8sub("12345678",3,5)    =    34567
utf8sub("øpø你好pix",2,5)    =    pø你好p

錯誤方法

網上找了一些算法, 都不太正確; 要麼就是亂碼, 要麼就是隻考慮了4 byte 中文的狀況, 不夠全面算法

1. string.sub(s,1,截取長度*4)函數

　　網上不少直接使用"`""string.sub(s,1,截取長度*4)`"是確定不對的, 由於若是中英文混合的字符串, 例如`你好1世界`的字符長度分別是`4,4,1,4,4`, 若是截取4個字, 4*4=4+4+1+4+3, 那`世界`的`界`字將會被取前3個byte, 就會出現亂碼編碼

2. if byte>128 then index = index + 4lua

問題關鍵

1. utf8字符是變長字符spa

2. 字符長度有規律code

如文字符編碼中所列，utf-8是對unicode字符集的編碼方案。所以其變長編碼方式爲：htm

一字節：0*******blog

兩字節：110*****，10******utf-8

三字節：1110****，10******，10******

四字節：11110***，10******，10******，10******

五字節：111110**，10******，10******，10******，10******

六字節：1111110*，10******，10******，10******，10******，10******

所以，拿到字節串後，想判斷UTF8字符的byte長度，按照上文的規律，只須要獲取該字符的首個Byte，根據其值就能夠判斷出該字符由幾個Byte表示。

其代碼以下：

local funciton charsize(ch)
    if not ch then return 0
    elseif ch >=252 then return 6
    elseif ch >= 248 and ch < 252 then return 5
    elseif ch >= 240 and ch < 248 then return 4
    elseif ch >= 224 and ch < 240 then return 3
    elseif ch >= 192 and ch < 224 then return 2
    elseif ch < 192 then return 1
    end
end

-- 計算utf8字符串字符數, 各類字符都按一個字符計算
-- 例如utf8len("1你好") => 3
function utf8len(str)
    local len = 0
    local aNum = 0 --字母個數
    local hNum = 0 --漢字個數
    local currentIndex = 1
    while currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        local cs = charsize(char)
        currentIndex = currentIndex + cs
        len = len +1
        if cs == 1 then 
            aNum = aNum + 1
        elseif cs >= 2 then 
            hNum = hNum + 1
        end
    end
    return len, aNum, hNum
end

-- 截取utf8 字符串
-- str:            要截取的字符串
-- startChar:    開始字符下標,從1開始
-- numChars:    要截取的字符長度
function utf8sub(str, startChar, numChars)
    local startIndex = 1
    while startChar > 1 do
        local char = string.byte(str, startIndex)
        startIndex = startIndex + chsize(char)
        startChar = startChar - 1
    end

    local currentIndex = startIndex

    while numChars > 0 and currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        numChars = numChars -1
    end
    return str:sub(startIndex, currentIndex - 1)
end

-- 自測
function test()
    -- test utf8len
    assert(utf8len("你好1世界哈哈") == 7)
    assert(utf8len("你好世界1哈哈 ") == 8)
    assert(utf8len(" 你好世 界1哈哈") == 9)
    assert(utf8len("12345678") == 8)
    assert(utf8len("øpø你好pix") == 8)

    -- test utf8sub
    assert(utf8sub("你好1世界哈哈",2,5) == "好1世界哈")
    assert(utf8sub("1你好1世界哈哈",2,5) == "你好1世界")
    assert(utf8sub(" 你好1世界 哈哈",2,6) == "你好1世界 ")
    assert(utf8sub("你好世界1哈哈",1,5) == "你好世界1")
    assert(utf8sub("12345678",3,5) == "34567")
    assert(utf8sub("øpø你好pix",2,5) == "pø你好p")

    print("all test succ")
end

test()