MySQL Charset & Collation 初步學習總結

時間 2021-01-30

標籤 html mysql sql 數據庫 express 服務器 session app 函數學習欄目 MySQL 简体版

原文原文鏈接

寫在前面： 本文——mysql字符集(character set)和排序規則(collation)的初步總結，源於學習過程當中對select length('漢字');的好奇，因爲學習階段及時間問題，部分疑問最終沒有很好的解決.暫時再也不探究。總結粗糙，理解不精,主要爲我的學習過程記錄，方便後期複習，僅供網友參考，歡迎提出看法。html

MySQL中字符集(character set )指的是由一對對symbol和encoding的對應關係組成的集合(粗略理解爲編碼方式)，排序規則(collation )主要用於指明字符間的比較方式。( MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. )詳見8.0手冊第10章總述及10.1節。（本文參考手冊皆指官方mysql 8.0手冊）

MySQL 8.0 默認的 character set(字符集) and collation(排序規則) 是 utf8mb4 和 utf8mb4_0900_ai_ci, 具體來說能夠分別指定：服務器(server)，數據庫(database)，表(table)，列(column)以及原義字符串(character string literal )的character set 和對應的 collationmysql
- 1.1 查看MySQL支持的[全部的]character set 和collation
  - show character set ; show collation ; 二者均可添加限定條件語句：like或where clause
    #character set 能夠簡寫爲charset;
    手冊10.3.1 節詳細介紹了character set 和 collation 的命名規則
  - 在MySQL中，所有的charset 與collation的信息都存放在information_schema庫中。除上述方法外，還可進入information_schema庫中查看CHARACTER_SETS與COLLATIONS表
    use information_schema;
    select * from character_sets,collations [where clause];
- 1.2查看系統當前設置的各類字符集/排序規則
  show [session ]variables like 'char%'/'collation%';
  或 select * from performance_schema.session_variables where variable_name like 'character_set_%';
表1
- 1.4 collation 的命名規則參見手冊10.3.1節。

1.5可解以下困惑：sql
- 1.4.1 select length('張'); mysql > 2
  在此查詢中，漢字'張' 即原義(或譯爲常量?)字符串(見手冊10.3.6節/下文2)，因在查詢時沒有特別指定character set 以及collation ，故爲默認值。由表1：character_set_connection | gbk可知，編碼方式爲GBK，而GBK編碼使用兩個字節來標識漢字字符碼，因此上述運行結果爲 2 。
  - 1.4.2 use pra ;select legnth(stu_name) from stuinfo where stuid = 1; mysql> 6
    
    字符串' 張三 ' 佔用6個字節，故單個漢字(字符)佔用3個字節。能夠解釋爲：
    - ① 因：建立pra表中的各字段時，未特地指定編碼類型，故根據手冊10.3.5可知編碼方式應爲其所屬的table的編碼類型，使用3.2命令查看，爲默認的utf8b4
    - ② 手冊10.9.1節詳細介紹了utf8b4字符集類型，指明瞭：
      - 在編碼 BMP字符時utf8mb4與utf8/utf8mb3 能夠大體等同，每一個字符編碼存儲都佔用相同的字節數(英文字符1個字節，漢字3個字節)；走出了各類論壇中的不精確表達" utf8mb4存儲漢字佔用4個字節，utf8mb3佔用3個字節 " 的思惟定勢。
      - 在編碼SMP字符時，utf8mb4才佔用4個字節(固然，utf8/utf8mb3 不支持存儲SMP字符，這中字符集類型很快會被官方棄用)
      - 兩種類型同時存在時，通常會自動轉化爲utf4mb4類型。
      - BMP字符能夠粗略理解爲經常使用字符，SMP理解爲不經常使用字符，好比emoj符號。
    - ③ 詳細的各類類型的字符編碼，可參見博客園：字符串，那些事

Character string literal 譯爲原義字符串，指的是在Query clause 中的字符串，脫離於表，與表無關。手冊10.3.6節。數據庫
- 2.1 形式爲 [_charset_name]'string' [COLLATE collation_name]；express
- 2.2 解釋：The _charset_name expression is formally called an introducer. It tells the parser, 「the string that follows uses character set charset_name.」 。服務器
- 2.3
  session
- 2.4 困惑在系統CMD窗口
  ① select length('你') ; mysql>2
  ②select length(_utf8mb4 '你') ; mysql>2
  運行結果不變，經過命令1.3(表1)可看到character_set_connection = gbk ，①可理解，那麼②呢？
  字符串'你'以前的 introducer 無效嗎？2.2解釋It tells the parser, 「the string that follows 　　uses character set charset_name.到底什麼意思，中間還涉及什麼過程。app
  - 2.4.1 換用了MySQL Client CMD 運行，鏈接字符集一樣爲gbk，運行結果也都爲 2 。換用navicat命令行執行，鏈接字符集變爲utf8mb4 (client ,results字符集也都變爲utf8mb4)，兩條select語句執行結果都變爲 3 (第二條的introducer 修改成 _gbk)。
    結論：introducer 對於字符串自己沒有影響,仍是受character_set_connection或其餘變量影響 (說法不許確)函數
  - 2.4.2 查閱手冊10.3.8 introducer相關知識：學習
```
An introducer does not change the string to the introducer character set like CONVERT() would do. It does not change the string value, although padding may occur. The introducer is just a signal. (不太理解)
```
  ① 查閱12.11節 Cast Functions and operators 的 convert(expr using transcoding_name)函數：converts data between different character sets. ，貌似是真正的轉換。
  
  ② 這不一樣於introducer中的表述：It does not change the string value, although padding may occur. The introducer is just a signal(它究竟是幹嗎的)
  
  ③ 運行 select length(convert('你' using utf8mb4))； mysql> 3 ,而此時character_set_connection 仍然爲gbk
  那麼結合①，到底introducer 到底發揮什麼樣的做用，character_set_connection 發揮什麼樣的做用， ?
  - 2.4.3 查看手冊10.4節 Connection character set and collation ,該部分涉及到了客戶端與服務器的交互過程當中的編碼轉換過程。
    - 1. 客戶端與服務器的交互大體涉及三個變量：character_set_client , character_set_conneciton , character_set_results.
    - 1. 總體過程可粗略解釋以下，更詳細可參考七把刀簡書博文。
      ① 服務器從客戶端接收以character_set_client 編碼的語句(statements)；
      ② 服務器將接收到的statements 從character_set_client 轉譯(translate/convert)爲character_set _connection.
      此處提到：For string literals that have an introducer such as _utf8mb4 or _latin2, the introducer determines the character set. (怎麼determine呢，上面沒感受determine呀)
      
      又說起：collation_connection 對於literal strings 的比較是重要的,對於表列中的字符串的比較可有可無。
      ③ server 將執行結果以character_set_results 的編碼形式傳回client
      
      ④：在七把刀簡書博文中的介紹部分不能理解：指定introducer 後的解釋。
    - 1. 沒法類比當前所糾結的查詢的實際過程。直接的一個函數到達服務器後是如何執行的。過程當中的編碼是如何轉換的。考慮查看源碼？
  - 2.4.4 總：與當下學習任務關聯度不大，在該問題上耗時過長，再也不花費時間糾結。粗略結論：①introducer 在整個過程當中沒有起到多大做用；
    ② convert函數能夠實實在在的看到效果；
    ③單獨或一同修改(client,connection,results)並結合三者(無introducer ，有introducer, convert轉換)試驗後，效果迷離，心累，再也不探究；
    ④問題關鍵仍是沒有理解introducer, 各字符集，以及客戶端與服務器交互時的編碼轉換過程。往後涉及，在經驗積累的基礎上再行探究。
分類
- 3.1 字段級別
  - 查看某一 table 全部字段的詳細信息(含排序規則collation一列(根據10.3.1中的collation命名規則易知對應的character set))
    show full columns from table_name;
  - 查看當前選中的數據庫中全部表的信息(含table_name , Engine, version , create_time , update_time, collation 等)
    show table status [ from databse_name / where name like '%name%'];
  - 修改字段的charset 和 collation
    alter table table_name modify filed_name field_type field_charset_name;
- 3.2 表級別
  - 查看建表語句(最新的，含修改過的 )( 其中包含當前表設置的默認的character set ，collation信息)
    show create table table_name;
  - 修改表的charset和collation
    alter table table_name charset charset_name;
- 3.3 數據庫級別
  - 查看當前數據庫默認的字符集，以及排序規則
    - show variables where variable_name = 'character_set_database'
    - use database_name;
      select @@character_set_database,@@collation_database;
    - select default_character_set_name,default_collation_name
      from information_shema.schemata
      where shcema_name = 'db_name';
      (可能這種方法涉及當前用戶的權限問題，未查證)
  - 查看建數據庫語句，從而瞭解當前數據庫默認的字符集，以及排序規則。
    show create database database_name
  - 修改數據庫的character set 和collation
    alter database database_name charset utf8mb4
- 3.4 服務器級別
  - 查看服務器字符集配置
    show variables where variable_name = 'character_set_server ;'
    也能夠簡單使用 show variables like 'char%';
  - 配置 server 的默認charset 和collation (詳細參見10.3.2節)
    - 永久性配置：修改my.ini文件中的mysqld --character-set-server=utf8mb4 ,重啓MySQL服務。(my.ini文件通常位於C盤 Program Files或者Programdata Files文件夾下的mysql目錄下)
    - 暫時配置：命令行輸入：set character_set_server= utf8mb4;
    - 手冊中還介紹了cmke 命令。
- 3.5 查看**connection , client , results ** 字符集，排序規則
  例如： select @@character_set_connection,@@collation_connetcion;
  或者： show variables like 'char%' ;
4.其餘
- 函數 length()，char_length()，character_length() 區別參見手冊12 章 Functions and operators 。
- 手冊13.7節Database Adminstration Statement 的13.7.6.3 介紹了Set Names Statements.
  - set names('charset_name' [collate 'collation_name'] | default);
  - 該語句將三個session 系統變量 character_set_client , character_set_connection , character_set_results 同時設置爲了指定的字符集 charset_name ，collate 語句可選。執行後可以使用1.3命令查看效果。(該設置僅當次會話中有效)
  - 可使用default值恢復默認映射。默認值取決於服務器配置
    The default mapping can be restored by using a value of DEFAULT. The default depends on the server configuration
- 細節及注意點，查看手冊。博客set name statements總結詳細，可參考。