lua 截取字符，以及取字符個數（非字符串長度）

時間 2019-11-06

標籤 lua 截取字符以及個數字符串長度欄目 Lua 简体版

原文原文鏈接

需求

按字面個數來截取算法

函數(字符串, 開始位置, 截取長度)

utf8sub("你好1世界哈哈",2,5)	=	好1世界哈
utf8sub("1你好1世界哈哈",2,5)	=	你好1世界
utf8sub("你好世界1哈哈",1,5)	=	你好世界1
utf8sub("12345678",3,5)		=	34567
utf8sub("øpø你好pix",2,5)	=	pø你好p

錯誤方法

網上找了一些算法, 都不太正確; 要麼就是亂碼, 要麼就是隻考慮了4 byte 中文的狀況, 不夠全面函數

string.sub(s,1,截取長度*4)spa

網上不少直接使用」""string.sub(s,1,截取長度*4)「是確定不對的, 由於若是中英文混合的字符串, 例如你好1世界的字符長度分別是4,4,1,4,4, 若是截取4個字, 4*4=4+4+1+4+3, 那世界的界字將會被取前3個byte, 就會出現亂碼code
if byte>128 then index = index + 4字符串

問題關鍵

utf8字符是變長字符
字符長度有規律

UTF-8字符規律

字符串的首個byte表示了該utf8字符的長度string

0xxxxxxx - 1 byte
110yxxxx - 192, 2 byte
1110yyyy - 225, 3 byte
11110zzz - 240, 4 byte

各類正確算法

-- 判斷utf8字符byte長度
-- 0xxxxxxx - 1 byte
-- 110yxxxx - 192, 2 byte
-- 1110yyyy - 225, 3 byte
-- 11110zzz - 240, 4 byte
local function chsize(char)
	if not char then
		print("not char")
		return 0
	elseif char > 240 then
		return 4
	elseif char > 225 then
		return 3
	elseif char > 192 then
		return 2
	else
		return 1
	end
end

-- 計算utf8字符串字符數, 各類字符都按一個字符計算
-- 例如utf8len("1你好") => 3
function utf8len(str)
	local len = 0
	local currentIndex = 1
	while currentIndex <= #str do
		local char = string.byte(str, currentIndex)
		currentIndex = currentIndex + chsize(char)
		len = len +1
	end
	return len
end

-- 截取utf8 字符串
-- str:			要截取的字符串
-- startChar:	開始字符下標,從1開始
-- numChars:	要截取的字符長度
function utf8sub(str, startChar, numChars)
	local startIndex = 1
	while startChar > 1 do
		local char = string.byte(str, startIndex)
		startIndex = startIndex + chsize(char)
		startChar = startChar - 1
	end

	local currentIndex = startIndex

	while numChars > 0 and currentIndex <= #str do
		local char = string.byte(str, currentIndex)
		currentIndex = currentIndex + chsize(char)
		numChars = numChars -1
	end
	return str:sub(startIndex, currentIndex - 1)
end

-- 自測
function test()
	-- test utf8len
	assert(utf8len("你好1世界哈哈") == 7)
	assert(utf8len("你好世界1哈哈 ") == 8)
	assert(utf8len(" 你好世 界1哈哈") == 9)
	assert(utf8len("12345678") == 8)
	assert(utf8len("øpø你好pix") == 8)

	-- test utf8sub
	assert(utf8sub("你好1世界哈哈",2,5) == "好1世界哈")
	assert(utf8sub("1你好1世界哈哈",2,5) == "你好1世界")
	assert(utf8sub(" 你好1世界 哈哈",2,6) == "你好1世界 ")
	assert(utf8sub("你好世界1哈哈",1,5) == "你好世界1")
	assert(utf8sub("12345678",3,5) == "34567")
	assert(utf8sub("øpø你好pix",2,5) == "pø你好p")

	print("all test succ")
end

test()