Go string 一清二楚

時間 2021-02-16

標籤 html golang 數組數據結構 ide 函數學習編碼 .net 欄目 HTML 简体版

原文原文鏈接

前言

字符串（string）做爲 go 語言的基本數據類型，在開發中必不可少，咱們務必深刻學習一下，作到一清二楚。html

本文假設讀者已經知道切片（slice）的使用，如不瞭解，可閱讀 Go 切片基本知識點golang

爲了更好的理解後文，推薦先閱讀 Unicode 字符集，UTF-8 編碼數組

是什麼

In Go, a string is in effect a read-only slice of bytes.數據結構

在 go 語言中，字符串其實是一個只讀的字節切片，其數據結構定義以下：ide

// runtime/string.go
type stringStruct struct {
	str unsafe.Pointer	// 指向底層字節數組的指針
	len int				// 字節數組的長度 
}

注意：byte 實際上是 uint8 的類型別名函數

// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8

怎麼用

func main() {
	// 使用字符串字面量初始化
	var a = "hi,狗"
	fmt.Println(a)

	// 能夠使用下標訪問，但不可修改
	fmt.Printf("a[0] is %d\n", a[0])
	fmt.Printf("a[0:2] is %s\n", a[0:2])
	// a[0] = 'a' 編譯報錯，Cannot assign to a[0]
    
    // 字符串拼接
	var b = a + "狗"
	fmt.Printf("b is %s\n", b)

	// 使用內置 len() 函數獲取其長度
	fmt.Printf("a's length is: %d\n", len(a))

	// 使用 for;len 遍歷
	for i := 0; i < len(a); i++ {
		fmt.Println(i, a[i])
	}

	// 使用 for;range 遍歷
	for i, v := range a {
		fmt.Println(i, v)
	}
}


/* output
hi,狗

a[0] is 104
a[0:2] is hi

b is hi,狗狗

a's length is: 6

0 104
1 105
2 44
3 231
4 139
5 151

0 104
1 105
2 44
3 29399
*/

若是讀者在看上面的代碼時有疑惑，不用着急，下文將會挨個解讀。學習

只讀

字符串常量會在編譯期分配到只讀段，對應數據地址不可寫入，相同的字符串常量不會重複存儲ui

func main() {
	var a = "hello"
	fmt.Println(a, &a, (*reflect.StringHeader)(unsafe.Pointer(&a)))
	a = "world"
	fmt.Println(a, &a, (*reflect.StringHeader)(unsafe.Pointer(&a)))
	var b = "hello"
	fmt.Println(b, &b, (*reflect.StringHeader)(unsafe.Pointer(&b)))
}

/* output
字符串字面量 該變量的內存地址 底層字節切片
hello 0xc0000381f0 &{5033779 5}
world 0xc0000381f0 &{5033844 5}
hello 0xc000038220 &{5033779 5}
*/

能夠看到 hello 在底層只存儲了一份編碼

for;len 遍歷

go 的源代碼都是 UTF-8 編碼格式的，上例中的」狗「字佔用三個字節，即 231 139 151（Unicode Character Table），因此上例的運行結果很清楚。.net

於此同時，也能夠將字符串轉化爲字節切片

func main() {
	var a = "hi,狗"
	b := []byte(a)
	fmt.Println(b)	// [104 105 44 231 139 151]
}

for;range 遍歷

The Unicode standard uses the term "code point" to refer to the item represented by a single value.

在 Unicode 標準中，使用術語 code point 來表示由單個值表示的項，通俗點來講，U+72D7（十進制表示爲 29399）表明符號」狗「

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune.

code point 有點拗口，因此在 go 語言中專門有一個術語來表明它，即 rune

注意：rune 實際上是 int32 的類型別名

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32

在對字符串類型進行 for;range 遍歷時，實際上是按照 rune 類型來解碼的，因此上例的運行結果也很清晰。

與此同時，也能夠將字符串轉化爲 rune 切片

func main() {
	// 使用字符串字面量初始化
	var a = "hi,狗"
	r := []rune(a)
	fmt.Println(r) // [104 105 44 29399]
}

固然咱們也能夠使用 "unicode/utf8" 標準庫，手動實現 for;range 語法糖相同的效果

func main() {
	var a = "hi,狗"
	for i, w := 0, 0; i < len(a); i += w {
		runeValue, width := utf8.DecodeRuneInString(a[i:])
		fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
		w = width
	}
}

/* output
U+0068 'h' starts at byte position 0
U+0069 'i' starts at byte position 1
U+002C ',' starts at byte position 2
U+72D7 '狗' starts at byte position 3
*/