在 C 程序中處理 UTF-8 文本

時間 2019-11-10

標籤程序處理 utf 文本欄目 C&C++ 简体版

原文原文鏈接

若是你對 UTF-8 編碼不是很是瞭解，就不要試圖在 C 程序中徒手處理 UTF-8 文本。若是你對 UTF-8 很是瞭解，就更不必這樣作。找一個提供了 UTF-8 文本處理功能而且能夠跨平臺運行的 C 庫來作這件事吧！數組

GLib 就是這樣的庫。bash

從問題出發

下面的這段文本是 UTF-8 編碼的（我之因此如此肯定，是由於我用的是 Linux 系統，系統默認的文本編碼是 UTF-8）：函數

個人 C81 天天都在口袋裏
                      @

我須要在 C 程序中讀入這些文本。在讀到 '@' 字符時，我須要斷定 '@' 左側與之處於同一行的文本是否都是空白字符。編碼

簡單起見，我忽略了文件讀取的過程，將上述文本表示爲 C 字符串：spa

gchar *demo_text =
        "個人 C81 天天都在口袋裏\n"
        "                      @";

注：在 GLib 中，gchar 就是 char，即 typedef char gchar;指針

下文，當我說『demo_text 字符串』時，指的是以 demo_text 指針的值爲基地址的 strlen(demo_text) + 1 個字節的內存空間，這是 C 語言字符串的基本常識。code

UTF-8 文本長度與字符定位

爲了模擬程序讀到 '@' 字符這一時刻，我須要用一個 char * 類型的指針對 demo_text 字符串中的 '@' 字符進行定位。內存

'@' 字符在 demo_text 的末尾。我須要一個偏移距離，而這個偏移距離就是 demo_text 字串在 UTF-8 編碼層次上的長度，經過這個偏移距離，我能夠從 demo_text 字符串的基地址跳到 '@' 字符的基地址。字符串

GLib 提供了 g_utf8_strlen 函數計算 UTF-8 字符串長度，所以我能夠獲得從 demo_text 字串的基地址到 '@' 字符基地址的偏移距離：get

glong offset = g_utf8_strlen(demo_text, -1);

結果是 38，剛好是 demo_text 字符串在 UTF-8 編碼層次上的長度（不含字串結尾的 null 字符，亦即 '\0' 字符）。

g_utf8_strlen 的原型以下：

glong g_utf8_strlen(const gchar *p, gssize max);

注：glong 即 long，而 gssize 即 signed long。

g_utf8_strlen 第二個參數 max 的設定規則以下：

若是它是負數，那麼就假定字符串是以 null 結尾的（這是 C 字符串常識），而後統計 UTF-8 字符的個數。
若是它爲 0，就是不檢測字符串長度……這個值純粹是出來打醬油的。
若是它爲正數，表示的是字節數。g_utf8_strlen 會按照字節數從字符串中截取字節，而後再統計所截取的字節對應的 UTF-8 字符的個數。

有了偏移距離，就能夠在 demo_text 中定位 '@' 字符了，即：

gchar *tail = g_utf8_offset_to_pointer(demo_text, offset - 1);

此時 tail 的值即是 '@' 字符的基地址。

在 UTF-8 文本中游走

如今已經得到了 '@' 的位置，接下來就是從這個位置開始向左（也就是逆序）遍歷 demo_text 字符串的其它字符。GLib 爲此提供了 g_utf8_prev_char 函數：

gchar * g_utf8_prev_char(const gchar *str, const gchar *p);

藉助 g_utf8_prev_char 函數能夠從 str 中得到 p 以前的一個 UTF-8 字符的基地址（p 是當前 UTF-8 字符的基地址）。若是 p 與 str 相同，即 p 已經指向了字符串的基地址，那麼 g_utf8_find_prev_char 會返回 NULL。

對於本文要解決的問題而言，利用這個函數，能夠寫出從 demo_text 中的 '@' 字符所在位置開始逆序遍歷 '@' 以前的全部 UTF-8 字符的過程：

glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (1) {
        viewer = g_utf8_prev_char(viewer);
        if (viewer != demo_text) {
                /* do somthing here */
        } else {
                break;
        }
}

GLib 還提供了一個 g_utf8_next_char，它能夠返回當前位置的下一個 UTF-8 字符的基地址。

提取 UTF-8 字符

雖然藉助 g_utf8_prev_char 與 g_utf8_next_char 可讓指針在 UTF-8 文本中走動，可是隻能將一個指針定位到某個 UTF-8 字符的基地址，若是咱們想獲得這個 UTF-8 字符，就不是那麼容易了。

例如

viewer = g_utf8_prev_char(viewer);

此時，雖然能夠將 viewer 向前移動一個 UTF-8 字符寬度的距離，到達了一個新的 UTF-8 字符的基地址，可是若是我想將這個新的 UTF-8 字符打印出來，像下面這樣作確定是不行的：

g_print("%s", viewer);

注：g_print 函數與 C 標準庫中的 printf 函數功能基本等價，只不過 g_print 能夠藉助 g_set_print_handler 函數實現輸出的『重定向』。

由於 g_print 要經過 viewer 打印單個 UTF-8 字符，前提是這個 UTF-8 字符以後須要有個 '\0'，這樣就是將一個 UTF-8 字符做爲一個普通的 C 字符串打印了出來。這個 UTF-8 字符後面不可能有 '\0'，除非它是 demo_text 字符串中的最後一個字符。

要解決這個問題，只能是將 viewer 所指向的 UTF-8 字符相應的字節數據提取出來，放到一個字符數組或在堆中爲其建立存儲空間，而後再打印這個字符數組或堆空間中的數據。例如：

gchar *new_viewer = g_utf8_next_char(viewer);

sizt_t n = new_viewer - viewer;
gchar *utf8_char = malloc(n + 1);
memcpy(utf8_char, viewer, n);
utf8_char[n] = '\0';
g_print("%s", utf8_char);
free(utf8_char);

這樣顯然太繁瑣了。不過，這意味着咱們應該寫一個函數專門作這件事。這個函數可取名爲 get_utf8_char，定義以下：

static gchar * get_utf8_char(const gchar *base) {
        gchar *new_base = g_utf8_next_char(base);
        gsize n = new_base - base;
        gchar *utf8_char = g_memdup(base, (n + 1));
        utf8_char[n] = '\0';
        return utf8_char;
}

藉助這個函數，就能夠實現從 demo_text 的 '@' 所在位置開始，逆序打印 '@' 以前的全部 UTF-8 字符：

glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (1) {
        gchar outbuf[7] = {'\0'};
        viewer = g_utf8_prev_char(viewer);
        if (viewer != demo_text) {
                gchar *utf8_char = get_utf8_char(viewer);
                g_print("%s", utf8_char);
                g_free(utf8_char);
        } else {
                break;
        }
}
g_print("\n");

注：g_memdup 等價於 C 標準庫中的 malloc + memcpy，而 g_free 則等價與 C 標準庫中的 free。

空白字符比較

如今，假設給定一個 UTF-8 字符 x，怎麼判斷它與某個 UTF-8 字符相等？

不要忘記，所謂的一個 UTF-8 字符，本質上只不過是 char * 類型的指針引用的一段內存空間。基於這一事實，利用 C 標準庫提供的 strcmp 函數便可實現 UTF-8 字符的比較。

下面，我定義了函數 is_space，用它判斷一個 UTF-8 字符是否爲空白字符。

static gboolean is_space(const gchar *s) {
        gboolean ret = FALSE;
        char *space_chars_set[] = {" ", "\t", "　"};
        size_t n = sizeof(space_chars_set) / sizeof(space_chars_set[0]);
        for (size_t i = 0; i < n; i++) {
                if (!strcmp(s, space_chars_set[i])) {
                        ret = TRUE;
                        break;
                }
        }
        return ret;
}

注：gboolean 是 GLib 定義的布爾類型，其值要麼是 TRUE，要麼是 FALSE。

在 is_space 函數中，我只是判斷了三種空白字符類型——英文空格、中文全角空格以及製表符。

雖然回車符與換行符也是空白字符，可是爲了解決這篇文章開始時提出的問題，我須要單獨爲換行符定義一個判斷函數：

static gboolean is_line_break(const gchar *s) {
        return (!strcmp(s, "\n") ? TRUE : FALSE);
}

解決問題

如今萬事俱備，只欠東風，咱們應該着手解決問題了。若是讀到此處已經忘記了問題是什麼，那麼請回顧第一節。

儘管下面這段代碼看上去挺醜，可是它可以解決問題。

gboolean is_right_at_sign = TRUE;
glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (viewer != demo_text) {
        viewer = g_utf8_prev_char(viewer);
        gchar *utf8_char = get_utf8_char(viewer);
        if (!is_space(utf8_char)) {
                if (!is_line_break(utf8_char)) {
                        is_right_at_sign = FALSE;
                        g_free(utf8_char);
                        break;
                } else {
                        g_free(utf8_char);
                        break;
                }
        }
        g_free(utf8_char);
}
if (is_right_at_sign) g_print("Right @ !\n");

對上述代碼略作簡化，可得：

gboolean is_right_at_sign = TRUE;
glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (viewer != demo_text) {
        viewer = g_utf8_prev_char(viewer);
        gchar *utf8_char = get_utf8_char(viewer);
        if (!is_space(utf8_char)) {
                if (!is_line_break(utf8_char)) is_right_at_sign = FALSE;
                g_free(utf8_char);
                break;
        }
        g_free(utf8_char);
}
if (is_right_at_sign) g_print("Right @ !\n");

其實，若是將 UTF-8 字符的提取與內存釋放過程置入 is_space 與 is_line_break 函數，即：

static gboolean is_space(const gchar *c) {
        gboolean ret = FALSE;
        gchar *utf8_char = get_utf8_char(c);
        char *space_chars_set[] = {" ", "\t", "　"};
        size_t n = sizeof(space_chars_set) / sizeof(space_chars_set[0]);
        for (size_t i = 0; i < n; i++) {
                if (!strcmp(utf8_char, space_chars_set[i])) {
                        ret = TRUE;
                        break;
                }
        }
        g_free(utf8_char);
        return ret;
}

static gboolean is_line_break(const gchar *c) {
        gboolean ret = FALSE;
        gchar *utf8_char = get_utf8_char(c);
        if (!strcmp(utf8_char, "\n")) ret = TRUE;
        g_free(utf8_char);
        return ret;
}

能夠獲得進一步的簡化結果：

gboolean is_right_at_sign = TRUE;
glong offset = g_utf8_strlen(demo_text, -1);
gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
while (viewer != demo_text) {
        viewer = g_utf8_prev_char(viewer);
        if (!is_space(viewer)) {
                if (!is_line_break(viewer)) is_right_at_sign = FALSE;
                break;
        }
}
if (is_right_at_sign) g_print("Right @ !\n");

附：完整的代碼

#include <string.h>
#include <glib.h>

gchar *demo_text =
        "個人 C81 天天都在口袋裏\n"
        "                      @";

static gchar * get_utf8_char(const gchar *base) {
        gchar *new_base = g_utf8_next_char(base);
        gsize n = new_base - base;
        gchar *utf8_char = g_memdup(base, (n + 1));
        utf8_char[n] = '\0';
        return utf8_char;
}

static gboolean is_space(const gchar *c) {
        gboolean ret = FALSE;
        gchar *utf8_char = get_utf8_char(c);
        char *space_chars_set[] = {" ", "\t", "　"};
        size_t n = sizeof(space_chars_set) / sizeof(space_chars_set[0]);
        for (size_t i = 0; i < n; i++) {
                if (!strcmp(utf8_char, space_chars_set[i])) {
                        ret = TRUE;
                        break;
                }
        }
        g_free(utf8_char);
        return ret;
}

static gboolean is_line_break(const gchar *c) {
        gboolean ret = FALSE;
        gchar *utf8_char = get_utf8_char(c);
        if (!strcmp(utf8_char, "\n")) ret = TRUE;
        g_free(utf8_char);
        return ret;
}

int main(void) {
        gboolean is_right_at_sign = TRUE;
        glong offset = g_utf8_strlen(demo_text, -1);
        gchar *viewer = g_utf8_offset_to_pointer(demo_text, offset - 1);
        while (viewer != demo_text) {
                viewer = g_utf8_prev_char(viewer);
                if (!is_space(viewer)) {
                        if (!is_line_break(viewer)) is_right_at_sign = FALSE;
                        break;
                }
        }
        if (is_right_at_sign) g_print("Right @ !\n");

        return 0;
}

如果在 Bash 中使用 gcc 編譯這份代碼，可以使用如下命令：

$ gcc `pkg-config --cflags --libs glib-2.0` utf8-demo.c -o utf8-demo

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。