This article is about reading and writing Unicode to character streams in UTF-8 encoding. And as a consequence is about an often mis-known aspect of the C++ STL / Iostream library: locales.ios
The documentation that come with the STL itself, although technically perfect does not help so much in understanding the relation between the object that are involved a even simple expression like a_stream >>variable;
, also because some of the detail are hidden by the underlying logic.express
Also, the behavior of the STL and the relation with the operating system are not always so evidents, letting certain operations sometime "mysterious"windows
This article goes into some aspect of the Unicode encoding, STL locales, and the relation with Windows.api
The code we are referring is thought as serving VC8 (I used C++ 2005 Express), as well as MINGW (and how this compiler is distributed for Unicode if far from been obvious)promise
The history of characters encoding is not the most linear as it could be: a number of assumption made at certain time that had been reverted later made -and still make- some confusions.app
In the beginning there where the typewriters (mechanical machines to type characters on paper) and the teletypes, that -in essence- where typewriters with wires between the keyboard and the paper.less
In the need to grant interoperability between the two halves of these machines, the ANSI committee defined the ASCII character set.socket
This set was designed to provide a binary representation for the 26 Latin characters, either in small and capital format, some punctuation and accents, and to some "commands" of typical use in teletyping like CR, LF, FF, etc. and was designed to stay into 7 bit, to help the hardware manufacturer in having the 8th bit available for error checking (parity).ide
The missing of accented character was not a big deal, since teletyping is over-impressive: an "à" was simply written as " a BS ' " (BS is the backspace)oop
Different countries where also not necessary needed to be supported: ASCII is the "American Standard Code ..." Who's not American (Or is not comfortable with the American standard), just used another encoding scheme.
Interactive computers (with a keyboard and a monitor) complicated a while this aspect:
A number of attempt to extend the ASCII character set where done first by hardware manufacturer.
IBM -when introducing the first PC- came with an 8 bit char-set having 256 symbols matching the ASCII ones from 32 to 126 (the "ASCII printable") and adding some accented letters, some mathematical symbols, some semi-graphics.
All of those some-s where a result of a compromise that -in fact- didn't match everyone's needs: it just attempt to satisfy 90% of the user o 90% of the IBM served countries at that time. But it fitted 8 bits.
To better solve the problem, the concept of Code-page was introduced.
Essentially the correspondence between codes and glyphs was made configurable, so that every country could configure the 2nd half of the char-set with the characters that it more needed. Interoperability was assured only by the first 128 codes.
Later DOS versions - and earlier Windows - used the 8 bit ANSI code, with a number of code-pages for a variety of "editions".
The drawback of this method was that was essentially impossible to hold texts mixing different very heterogeneous languages: mixing up Arabic and Japanese was practically impossible.
And reading a French text with an Arabic PC was sometimes a pleasure, and even French to Italian leaded to strangemis-writings due to the fact that same accented characters had different coding.
Also, a problem was still present with languages that require more that 128 specific symbols (think to Chinese): for them multi-byte code-pages had been introduced giving MBCS.
Unicode was introduced mainly to try to cleanup all of this mess: assuming that the world cannot fit into 8 bits, it gave a distinct ID to every encoded symbol.
This is known as UCS - Universal Character Set.
In its first definition it was containing less than 65536 characters, and this made many software developer confident that 16bits where enough to represent them all.
This is known as UCS-2.
The actual situation sees an UCS defined up to 0x10FFFF (although with many still unassigned elements), thus requiring 21 bits.
UCS-4 (4 bytes) using unsigned int
(often typedef
-ined as dchar_t
) as characters certainly fits everything, but for many languages it is a wasting of space.
Also, many communication channels drive bytes, not short
or int
, and the way bytes are ordered into shorts andints depends on the architectures of processors and machines, hence a pure binary dump of dchars or wchars is not practicable for files that are designated to communication or interoperation between different machines or devices.
To track the above problems, a number of encodings, attempting to keep safe the interoperability with legacy environments have been deployed for 7, 8, 16 and 32 bit environments.
In particular, with the windows environments, where Unicode was originally deployed as UCS-2 (16 bits) and communications are still working on bytes, the 8 and 16 bits are particularly useful and comfortable.
These encodings are known as UTF-8 and UTF-16:
The encoding used to represent Unicode into bytes is based on rules that define how to break-up the bit-string representing an UCS into bytes.
As far the actual UCS space concerned, no encoding should exist for more than 21 bits, hence the last two rules don't have an actual application, and -in fact- actual Unicode specifications consider them invalid.
It is clearly an encoding that privileges the low codes resulting in shorter encoding, against the hight codes.
Wikipedia has a good article about UTF-8 that shows the trade-offs against UCS-4 and UTIF-16.
It must be however to take into account that all markups used to represent texts into pages (think to HTML) or data into messages (think to XML) are ASCII.
This may balance the longest encoding of -for example- Chinese text strings.
Also, endiannes is irrelevant, being all the codes "bytes", with no need to define an order.
For those reasons, UTF-8 became popular as a format to store texts across the Internet, since they remain the same independently on who read/writes them.
The encoding was introduced after the definition of UCS-2, that was in turn the "as-is" representation of UCS up to 16 bits, essentially after the discovery that 16 bits where not enough to encode everything.
UTF-16, in essence, takes advantage of an unassigned "band" in UCS (from 0xD800 to 0xDFFF) to represent what cannot stay into 16 bits.
Of course, there is the strong suspect that such unassigned band had been left after discovering that no space remained to encode what was in progress to be encoded
In essence, characters are encoded as:
It is important to note how UTF-8 can be wider than the actual Unicode (can go up to 31 bits), but UTIF-16 is stuck to 21 bits as a maximum. It will be interesting to see what inventions will be done when Unicode will become wider than the actual specified 21 bits... (if it ever will)
The Windows operating systems evolved from the original IBM character set (or better, OEM char-set, since different manufacturer may have differentiated) towards the ANSI char-set and code-pages.
This refers to 8-bit characters and language-dependent encodings, and was used mainly in Win16.
With the incoming of Win32, Unicode was adopted first as pure UCS-2, and then extended to support UTF-16 surrogates.
The API binaries that manipulate characters where doubled and renamed by adding an "A" for "ANSI" and a "W" for "Wide" (For example MessageBoxA
and MessageBoxW
) the firsts taking char
s based parameters and the seconds taking wchar_t
based parameters.
A number of preprocessor "magics" are then defined in <tchar.h>
where, depending on the definition of the UNICODE and MBCS preprocessor symbols the traditional API names are mapped into the corresponding A
or W
.
To take care of the difference various countries and cultures may have in representing numbers, dates, currency etc. Windows introduced the concept of Locale as a set of informations that can be retrieved by APIs and that can be user customizable, to help programs to adapt to user habitudes.
Unfortunately this is sometimes misleaded causing not only text, but also structured data to be represented in localized form also on communication and storage media, with all the problems about misinterpretation of dates etc. (what date is 11/10 ... or should it be 10/11 ?)
All those informations are stored in the system registry. The OS provides a set of default value for the various countries, but users can override them by providing their own specifications.
For example, Italy uses '.' as thousand separator and ',' as decimal separator.
It is however frequent for Italian user to replace the '.' with an ''' to have number as 10'000 less prone to reading errors than 10.000 especially where it is not clear where it comes from (and hence ... it could be just 10).
Inside Windows, UTF-8 to 16 and 16 to 8 conversions are possible through the WideCharToMultiByte
andMultiByteToWideChar
functions, by specifying CP_UTF8
as the codepage parameter.
It is anyway a string-to-string conversion, not an encode/decode of a stream.
C language completely pre-dates either Unicode and Windows, and -in fact- does not provide any direct support for Unicode, but a number of library function had been adapted to take care of internationalization.
In this environment, a number of character oriented functions like atof
gained their corespondent _wtof
and -with the same preprocessor magic of <tchar.h>
- a _ttof
is defined as one or the other depending on the definition of aUNICODE
or MBCS
symbol.
What char
and wchar_t
effectively represent depends on the code-page used, that -in turn with a "locale" - defines the way numbers are represented and how characters are encoded.
Unfortunately, the way C library is implemented defines "locale" characteristics based on a set of static data, selectable with a set_locale
function.
Such data have noting to deal with the ones provided by the operating system concept of "Locale" and are not "user customizable" (think to the case of the thousand separator, replaced by an '''. There is -however- the possibility to covert an UTF-8 string into UTF-16 with the mbcstowcs
function, specifying a locale having a UTF-8 codepage.
That's far more difficult than to be said, since library doumentation are not so generous in those kind of information.
For example you can discover that encoding can be specified for example in fopen
: es.
fopen("newfile.txt", "rw, ccs=<encoding>");
where <encoding> can be "UTF-8", although it's not documented as a standard.
But as you move to C++ it is practically impossible to re-find a similar functionality in fstream
-s.
C++ approach to i/o is based on the "stream" concept. How streams relates to files it is not so obvious since the STL documentation, does not provide a plain description on that. You have to read about a number of details of a variety of classes before having a clue about what architecture is behind.
So let's go in the detail as much is sufficient to understand where is the clue.
This classes has a collaborative role in playing input output.
numget
and numput
facets) and of the translation of characters from the program representation from and to the external representation (with the codecvt
facet).All that is a set of family of classes that manage different character representations (char
or wchar_t
) and different nature of external streams (files or strings).
In their abstract definition, streams are rooted from a virtual ios_base
(character type independent), than frombasic_istream<.>
and basic_ostream<.>
, than from those two into basic_iostream<.>
, thus giving this hierarchy:
All streams must hold a "buffer" derived form basic_streambuf<.>
and a locale
, initialized by default as the C++ global locale (initialized in turn as the classic "C" locale).
In particular file streams are noting more than basic streams initialized with a basic_filebuf<.>
, that overrides the basic_streambuf<.>
virtual function to manage file i/o, plus some pass-through function like open
, close
etc.
Similarly, string streams are basic streams initialized with basic_stringbuf<.>
.
The template parameters defines the type of "elements" used by the stream internally to the program. In windows environments they are normally char
for ANSI oriented character representation and to wchar_t
for Unicode (UTF-16) oriented representation.
But something strange happens with file streams: try this:
#include <fstream> #include <windows.h> #pragma comment(lib, "kernel32.lib")#pragma comment(lib, "user32.lib")#pragma comment(lib, "gdi32.lib")int main() { std::wofstream fs("testout.txt"); const wchar_t* txt = L"some Unicode text òàè逧"; MessageBoxW(0,txt,L"verify",MB_OK); fs << txt << std::flush; MessageBoxW(0,fs.good()? L"Good": L"Bad",L"verify",MB_OK); return 0; }
The call to the Unicode MessageBoxW
confirm the proper string (should end with the § symbol, and have the Euro glyph as second-last).
Here's the dump of txt
from the debugger
0x0041770C 73 00 6f 00 6d 00 65 00 20 00 55 00 6e 00 69 00 s.o.m.e. .U.n.i. 0x0041771C 63 00 6f 00 64 00 65 00 20 00 74 00 65 00 78 00 c.o.d.e. .t.e.x. 0x0041772C 74 00 20 00 f2 00 e0 00 e8 00 e9 00 ac 20 a7 00 t. .ò.à.è.é.¬ §.
That's Unicode represented by UTF-16 in LE form (73-00, in WORD
format is 0x0073 and is just the plain 0x73 ('s') ASCII, while AC-20 is 0x20AC that is the Euro symbol €, that cannot be represented as a single bye)
Now look the output file content with and hex editor.
you should get (I used Notepad++ with the Hexedit plug-in)
"000000000 73 6F 6D 65 20 55 6E 69-63 6F 64 65 20 74 65 78 |some Unicode tex|" "000000010 74 20 F2 E0 E8 E9 |t òàèé |"
That's ANSI, with the text that appear truncated by the € symbol (and fs.good()
is false).
This -at least- with VC8.
Doing the same test with another compiler (MinGW 3.4.5, I used Codelite as IDE) with that same source (note that MinGW use UTF-8 for sources, while VC8 use ANSI. String literals don't survive, hence a different file is needed) thing are even worst:
t1.cpp: In function `int main()': t1.cpp:9: error: `wofstream' is not a member of `std' t1.cpp:9: error: expected `;' before "fs" t1.cpp:10:23: converting to execution character set: t1.cpp:12: error: `fs' was not declared in this scope
In fact all the wchar_t
related stuff is under a conditional compilation driven by the _GLIBCXX_USE_WCHAR_T
symbol.
Introducing this workaround (essentially defining in the regular std
namespace the missing types) it compiles.
#ifdef __MINGW32_VERSION #ifndef _GLIBCXX_USE_WCHAR_Tnamespace std { typedef basic_ios<wchar_t> wios; typedef basic_streambuf<wchar_t> wstreambuf; typedef basic_istream<wchar_t> wistream; typedef basic_ostream<wchar_t> wostream; typedef basic_iostream<wchar_t> wiostream; typedef basic_stringbuf<wchar_t> wstringbuf; typedef basic_istringstream<wchar_t> wistringstream; typedef basic_ostringstream<wchar_t> wostringstream; typedef basic_stringstream<wchar_t> wstringstream; typedef basic_filebuf<wchar_t> wfilebuf; typedef basic_ifstream<wchar_t> wifstream; typedef basic_ofstream<wchar_t> wofstream; typedef basic_fstream<wchar_t> wfstream; }#endif #endif
But running it still show fs
go bad
and no output produced (the file is created, but remains empty).
Debugging shows that the basic_streambuf::xsputn
function catch an exception and sets the stream as bad. That exception is produced here
template<typename _Facet> inline const _Facet& __check_facet(const _Facet* __f) { if (!__f) __throw_bad_cast(); return *__f; }
where the actual type for _Facet
is std::codecvt<wchar_t,char,int>
. char
?! where does it comes from?.
Going back to the VC, we find this strange note, in the basic_filebuf
(the derivation for basic_streambuf
for file streams) documentation:
Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containingchar characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of typewchar_t and set it using the basic_streambuf::pubsetbuf() method. To see an example that demonstrates this behavior, see below.
In essence, it seems nobody wants to say clearly that -independently on what we use in a program- outputs going to a FILE
(yes, the old C FILE
, that's what basic_filebuf
writes to and read from, there no magic behind that), by default is always attempted to be converted into char
-s, using the actual global locale facets.
In VC8 this happens with a std::codecvt<wchar_t,char,mbstate_t>
facet that's part of the default global locale, operating on an internal char buffer (as stated in the note), in MINGW no such facet is declared, and hence the locale cannot provide it (hence the exception).
... And that's the key to go towards UTF-8: let's provide to the basic_filebuf
the facet it wants.
Not another facet, with proper type and id, (locales can support whatever number of "facets") since that's not what the buffer class is seeking for.
We have to derive the proper std::codecvt<wchar_t,char,mbstate_t>
and -where no such type is defined- we have to define it.
If we assume to work with MinGW with no wchar_t
support enabled, we have to make the base facet to exist.
Since all facet are defined as template, that's easy:
#ifdef __MINGW32_VERSION #ifndef _GLIBCXX_USE_WCHAR_Tnamespace std { template<> class codecvt<wchar_t,char,mbstate_t>: public __codecvt_abstract_base<wchar_t,char,mbstate_t> { protected: explicit codecvt(size_t refs=0) :__codecvt_abstract_base<wchar_t,char,mbstate_t>(refs) {} public: static locale::id id; }; typedef basic_ios<wchar_t> wios; typedef basic_streambuf<wchar_t> wstreambuf; typedef basic_istream<wchar_t> wistream; typedef basic_ostream<wchar_t> wostream; typedef basic_iostream<wchar_t> wiostream; typedef basic_stringbuf<wchar_t> wstringbuf; typedef basic_istringstream<wchar_t> wistringstream; typedef basic_ostringstream<wchar_t> wostringstream; typedef basic_stringstream<wchar_t> wstringstream; typedef basic_filebuf<wchar_t> wfilebuf; typedef basic_ifstream<wchar_t> wifstream; typedef basic_ofstream<wchar_t> wofstream; typedef basic_fstream<wchar_t> wfstream; }#endif #endif
We are defining a specialization of codecvt<InnerType,OuterType,StateType>
for<wchar_t,char,mstate_t>
, that is exactly what the compiler is searching.
And we supply a locale::id
static object as required by the STL implementation. This requires a cpp file to instantiate the static object (<rant>I hate globals ...</rant>).
At this point we can -either for MinGW and for VC8- derive codecvt<InnerType,OuterType,StateType>
overriding the virtual function to implement a wchar_t
to char
conversion where wchar_t
is UTF-16 and char
is UTF-8.
It is the codecvt
derivation, implemented as a translator UTF-16 <-> UCS <-> UTF-8, using the mbstate_t
parameter as a carry between function invocation.
First of all, we have to decide what to do in case of invalid characters: sequence that may be present in input but that are not valid UTF or even valid Unicode codepoints.
According to Unicode specifications, invalid characters or sequences must be treated as "errors". But what does it mean treated is left to many interpretations.
If our purpose is to validate the input we will probably like something that makes us aware of something going wrong; but if we are just reading a text we are probably more suited to something that doesn't stop reading just because of a miswritten char.
That's what the bool
template parameter is for: if set to true
the implementation will have a strict behavior and upon every reading or writing of erroneous or illegal sequences or characters will throw a gel::stdx::utf_error
exception, derived from std::runtime_error
.
The STL implementation of basic_streambuf
should catch this and set owner stream as "bad", thus blocking it. The logic will have so no difference from the one normally used by regular stream processing.
If the bool
parameter is set to false, the respect of Unicode restrictions are relaxed and invalid sequence are processed coherently with the algorithm.
It is so possible to support up to 28bit codepoints (we need 4 bits to manage the conversion steps), and read overlog UTF-8 sequence as if "legal".
Overriding codecvt
is trivial for at least three functions:
do_always_noconv
always return false, since a conversion needs to be done.do_max_length
return 6 since this is the longest UTF-8 possibility. Proper Unicode will never produce more than 4 char
s, but an arbitrary wchar_t
can go over.do_encoding
always return -1, since the conversion is state dependent.More complex is do_length
: there is the need to almost decode to understand what about the length of the converting sequence. we found simpler to return a cautelative value (min(_Len2, (size_t)(_Last1-_First1)
). The consequence is probably a wider buffer allocation by the filebuffrer
classes, but -as it can be experiment- tt seems that either MinGW and VC8 buffers implementations don't call this function.
UTF-8 (external) to 16 (internal) is done using _Next1
and _Next2
as iterators walking from _First1
and _First2
to _Last1
and _Last2
(couldn't it be different?) incremented as something is read or written. If a buffer end is reached, the function return partilal
if there is a residual state, or ok
elsewhere.
The _State
value is used as follow:
char
forming a character This is the carry of UTF-8 to UCS conversion.The first part of the function (under if(!(_Sate & 0x80000000))
) is executed to convert UTF-8 in UCS (the if
protects against an UCS to UTF-16 in progress)
The second part (under if(!(_State & 0x70000000))
) is executed when no UTF-8 residual exist (hence_Sate & 0x0FFFFFFF
contains a complete UCS) and saves the UCS as a 16-bit only or as a 21bit surrogate (10 + 10bits + 0x100000), by saving the first part, setting the _State bit 31, and upon that setting, saving the second part on the next loop.
Note that the concept of "netx loop" depends on the politic adopted by the basic_streambuf
inside the STL implementation.
It may be inside the main for
loop or it may be upon a further invocation of the function. Since we use _State
as a carry between loops, and state is allocated and maintained outside the function (is passed by the caller) we can essentially not care about the buffers residual length.
UTF-16 (internal) to 8 (external) is done similarly.
_State
bit 31 is used as a carry flag indicating a partial reading / writing.
When a complete UCS had been read (from UTF-16) the first UTF-8 byte is saved. Subsequent bytes are written with a call to unshift
, that is also called by the basic_streambuf
itself to complete the output when no more input is present.
We need an std::locale
where to replace the codecvt<wchar,char,mbstate_t>
facet with agel::stdx::utf8cvt
to be imbue
-d into the stream buffer.
This very cryptic assertion, merely means this:
std::locale utf8_locale(std::locale(), new gel::stdx::utf8cvt<true>); std::wfstream fs; fs.imbue(utf8_locale); fs.open(yourfile, mode); //whatever I/O to the stream
utf8_locale is -in this case- take from the global locale, with an utf8cvt
given to it.
The stream is just whatever wstream
(wfstream
in this example, that means basic_fstream<wchar_t>
).
Of course, instead of imbue, we can replace the global locale by calling
std::locale::global(utf8_locale);
just after the utf8_locale
declaration.
We can also use a locale taken from another specific locale when it is the case, by simply create a locale like
std::locale utf8_it(std::locale("It"), new gel::stdx::utf8cvt);
Thus having a UTF-8 italian locale.
Since the frequent use of the UTF_8 neutral locale, I declare a gel::stdx::utf8_locale<true>
as an always accessible global object.
The "gel" directory contains both the header ("stdutif.h") and the cpp source ("gel.cpp") needed to properly instantiate the global and static variables.
The parent directory contains a tester program ("tester.cpp") and all the stuff to arrange either a VS8 project and a Codelite project (to test also a MinGW compilation) and the grandparent directory contains the VS8 solution ans well as a Codelite workspace.
The tester program is a console application accepting up to two filename in the command line (the first as input file and the second as output).
The default value for the input is testerin.txt and for the output is testerout.txt.
This application perform a number of congruence tests about the act of reading and writing with the UTF-8 facet, logging its activity on the standatrd output (std::cout
) and using the ANSI and Unicode Win32 API MessageBox[A/W] to test how the read file is displayed,
The testing program proceed as follows:
std::ifstream
(thus, char
based) into a std::string
, and displayed byMessageBoxA
.std::wifstream
(thus wchar_t
based) into a std::wstring
using the normal locale and displayed by MessageBoxW
.std::wifstream
imbue-d with a gel::stdx::utf8_locale
, displayed byMessageBoxW
and an output file is written as std::wofstream
, imbue-d with agel::stdx::utf8_locale
, with the string just read.std::ifstream
) and compared by by byte to check there are no difference.As a sample, the file "testerin.txt" is provided in the tester project directory, with the following content:
Tester text: grade paragraph eacute egrave euro °§èé€
The file has been create with Notepad++ and saved as UTF-8 without BOM.
Running the tester program,
step 1 completes with the following message box:
Note that the bytes read and displayed as ANSI representing the 5 Unicode as 2,2,2,2,3 bytes.
Step 2 completes in the following way:
VC8 version | MinGW version |
---|---|
Note that MinGW report a blank display for the reason described in the article (missing of a defaultcodecvt<wchar_t,char,mbstate_t>
), while VC8 attempts an ANSI to Unicode conversion. It keep safe the ASCII part, but the non-ASCII translation is codepage (hence system) dependent.
Step 3 completes as follow:
Thats the correct display from the UTF-8 file.
Step 4, at this point checks that the correctly read file is also correctly written bye comparing the two byte sequences:
The entire execution is logged in the console as such:
(Note: if something where different between the input and output file, the "=" will be replaced by "x".
Also, note how UTF-8 reading appears shorter than ANSI reading, because of the compacting of 2 and 3 bytes characters).
Debugging the two version the do_in function manifest a different calling algorithm from the two implementation of the STL.
VC8 calls the function repeatedly passing to it one byte for each call (thus reflecting the while(fs.get(c))
in the tester loop).
MinGW, in turn, calls the function once supplying the entire sequence of bytes (and letting the function inner loop do to the job).
Finally some recommendations, based on my opinion and experience:
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)