Reading UTF-8 with C++ streams

時間 2019-11-13

標籤 reading utf c++ streams 欄目 C&C++ 简体版

原文原文鏈接

Introduction

This article is about reading and writing Unicode to character streams in UTF-8 encoding. And as a consequence is about an often mis-known aspect of the C++ STL / Iostream library: locales.ios

The documentation that come with the STL itself, although technically perfect does not help so much in understanding the relation between the object that are involved a even simple expression like a_stream >>variable;, also because some of the detail are hidden by the underlying logic.express

Also, the behavior of the STL and the relation with the operating system are not always so evidents, letting certain operations sometime "mysterious"windows

This article goes into some aspect of the Unicode encoding, STL locales, and the relation with Windows.api

The code we are referring is thought as serving VC8 (I used C++ 2005 Express), as well as MINGW (and how this compiler is distributed for Unicode if far from been obvious)promise

Summary

A little bit about Unicode

The history of characters encoding is not the most linear as it could be: a number of assumption made at certain time that had been reverted later made -and still make- some confusions.app

In the beginning

In the beginning there where the typewriters (mechanical machines to type characters on paper) and the teletypes, that -in essence- where typewriters with wires between the keyboard and the paper.less

In the need to grant interoperability between the two halves of these machines, the ANSI committee defined the ASCII character set.socket

This set was designed to provide a binary representation for the 26 Latin characters, either in small and capital format, some punctuation and accents, and to some "commands" of typical use in teletyping like CR, LF, FF, etc. and was designed to stay into 7 bit, to help the hardware manufacturer in having the 8th bit available for error checking (parity).ide

The missing of accented character was not a big deal, since teletyping is over-impressive: an "à" was simply written as " a BS ' " (BS is the backspace)oop

Different countries where also not necessary needed to be supported: ASCII is the "American Standard Code ..." Who's not American (Or is not comfortable with the American standard), just used another encoding scheme.

Interactive computers (with a keyboard and a monitor) complicated a while this aspect:

The circulation of software was anymore a country limited affair and...
The nature of a display required a character matrix associating a code representing the displayed "character" at a given position, hence...
tricks like the use of backspace to get composed characters where no more effective.

A number of attempt to extend the ASCII character set where done first by hardware manufacturer.

IBM -when introducing the first PC- came with an 8 bit char-set having 256 symbols matching the ASCII ones from 32 to 126 (the "ASCII printable") and adding some accented letters, some mathematical symbols, some semi-graphics.

All of those some-s where a result of a compromise that -in fact- didn't match everyone's needs: it just attempt to satisfy 90% of the user o 90% of the IBM served countries at that time. But it fitted 8 bits.

The incoming of Code-pages

To better solve the problem, the concept of Code-page was introduced.
Essentially the correspondence between codes and glyphs was made configurable, so that every country could configure the 2^nd half of the char-set with the characters that it more needed. Interoperability was assured only by the first 128 codes.

Later DOS versions - and earlier Windows - used the 8 bit ANSI code, with a number of code-pages for a variety of "editions".

The drawback of this method was that was essentially impossible to hold texts mixing different very heterogeneous languages: mixing up Arabic and Japanese was practically impossible.

And reading a French text with an Arabic PC was sometimes a pleasure, and even French to Italian leaded to strangemis-writings due to the fact that same accented characters had different coding.

Also, a problem was still present with languages that require more that 128 specific symbols (think to Chinese): for them multi-byte code-pages had been introduced giving MBCS.

Unicode

Unicode was introduced mainly to try to cleanup all of this mess: assuming that the world cannot fit into 8 bits, it gave a distinct ID to every encoded symbol.

This is known as UCS - Universal Character Set.

In its first definition it was containing less than 65536 characters, and this made many software developer confident that 16bits where enough to represent them all.
This is known as UCS-2.

The actual situation sees an UCS defined up to 0x10FFFF (although with many still unassigned elements), thus requiring 21 bits.

UCS-4 (4 bytes) using unsigned int (often typedef-ined as dchar_t) as characters certainly fits everything, but for many languages it is a wasting of space.

Also, many communication channels drive bytes, not short or int, and the way bytes are ordered into shorts andints depends on the architectures of processors and machines, hence a pure binary dump of dchars or wchars is not practicable for files that are designated to communication or interoperation between different machines or devices.

Popular Unicode encodings

To track the above problems, a number of encodings, attempting to keep safe the interoperability with legacy environments have been deployed for 7, 8, 16 and 32 bit environments.

In particular, with the windows environments, where Unicode was originally deployed as UCS-2 (16 bits) and communications are still working on bytes, the 8 and 16 bits are particularly useful and comfortable.

These encodings are known as UTF-8 and UTF-16:

The first is important in files, where an interchangeability with also non-windows environments is required, since it does not depend on machine endiannes.
The second is important since it replaced UCS-2 inside the Win32 APIs, thus granting support also for character not fitting the first 655536 codes, by using a "multi-word" scheme. (Note that UCS-2 and UTIF-16 are not the same: They coincide only for the first 65536-2048 code-points)

UTF-8 encoding scheme

The encoding used to represent Unicode into bytes is based on rules that define how to break-up the bit-string representing an UCS into bytes.

If an UCS fits 7 bits, its coded as 0xxxxxxx. This makes ASCII character represented by themselves
If an UCS fits 11 bits, it is coded as 110xxxxx 10xxxxxx
If an UCS fits 16 bits, it is coded as 1110xxxx 10xxxxxx 10xxxxxx
If an UCS fits 21 bits, it is coded as 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
If an UCS fits 26 bits, it is coded as 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
If an UCS fits 31 bits, it is coded as 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

As far the actual UCS space concerned, no encoding should exist for more than 21 bits, hence the last two rules don't have an actual application, and -in fact- actual Unicode specifications consider them invalid.

It is clearly an encoding that privileges the low codes resulting in shorter encoding, against the hight codes.
Wikipedia has a good article about UTF-8 that shows the trade-offs against UCS-4 and UTIF-16.

It must be however to take into account that all markups used to represent texts into pages (think to HTML) or data into messages (think to XML) are ASCII.
This may balance the longest encoding of -for example- Chinese text strings.

Also, endiannes is irrelevant, being all the codes "bytes", with no need to define an order.

For those reasons, UTF-8 became popular as a format to store texts across the Internet, since they remain the same independently on who read/writes them.

UTF-16 encoding schemes

The encoding was introduced after the definition of UCS-2, that was in turn the "as-is" representation of UCS up to 16 bits, essentially after the discovery that 16 bits where not enough to encode everything.

UTF-16, in essence, takes advantage of an unassigned "band" in UCS (from 0xD800 to 0xDFFF) to represent what cannot stay into 16 bits.
Of course, there is the strong suspect that such unassigned band had been left after discovering that no space remained to encode what was in progress to be encoded

In essence, characters are encoded as:

If an UCS fits 16 bits, it is encode as itself (note that codes in the range described above must be avoided completely)
If an UCS fits 21 bits, 0x10000 is subtracted (thus giving 20 bits), and the result is broken into two 10 bits sequence to be or-ed to 0xD800 (most significant) and 0xDC00 (less significant)

It is important to note how UTF-8 can be wider than the actual Unicode (can go up to 31 bits), but UTIF-16 is stuck to 21 bits as a maximum. It will be interesting to see what inventions will be done when Unicode will become wider than the actual specified 21 bits... (if it ever will)

Windows and Unicode

The Windows operating systems evolved from the original IBM character set (or better, OEM char-set, since different manufacturer may have differentiated) towards the ANSI char-set and code-pages.

This refers to 8-bit characters and language-dependent encodings, and was used mainly in Win16.

With the incoming of Win32, Unicode was adopted first as pure UCS-2, and then extended to support UTF-16 surrogates.

The API binaries that manipulate characters where doubled and renamed by adding an "A" for "ANSI" and a "W" for "Wide" (For example MessageBoxA and MessageBoxW) the firsts taking chars based parameters and the seconds taking wchar_t based parameters.

A number of preprocessor "magics" are then defined in <tchar.h> where, depending on the definition of the UNICODE and MBCS preprocessor symbols the traditional API names are mapped into the corresponding A or W.

Windows and localization

To take care of the difference various countries and cultures may have in representing numbers, dates, currency etc. Windows introduced the concept of Locale as a set of informations that can be retrieved by APIs and that can be user customizable, to help programs to adapt to user habitudes.

Unfortunately this is sometimes misleaded causing not only text, but also structured data to be represented in localized form also on communication and storage media, with all the problems about misinterpretation of dates etc. (what date is 11/10 ... or should it be 10/11 ?)

All those informations are stored in the system registry. The OS provides a set of default value for the various countries, but users can override them by providing their own specifications.

For example, Italy uses '.' as thousand separator and ',' as decimal separator.
It is however frequent for Italian user to replace the '.' with an ''' to have number as 10'000 less prone to reading errors than 10.000 especially where it is not clear where it comes from (and hence ... it could be just 10).

Inside Windows, UTF-8 to 16 and 16 to 8 conversions are possible through the WideCharToMultiByte andMultiByteToWideChar functions, by specifying CP_UTF8 as the codepage parameter.

It is anyway a string-to-string conversion, not an encode/decode of a stream.

C and Unicode

C language completely pre-dates either Unicode and Windows, and -in fact- does not provide any direct support for Unicode, but a number of library function had been adapted to take care of internationalization.

In this environment, a number of character oriented functions like atof gained their corespondent _wtof and -with the same preprocessor magic of <tchar.h>- a _ttof is defined as one or the other depending on the definition of aUNICODE or MBCS symbol.

What char and wchar_t effectively represent depends on the code-page used, that -in turn with a "locale" - defines the way numbers are represented and how characters are encoded.

Unfortunately, the way C library is implemented defines "locale" characteristics based on a set of static data, selectable with a set_locale function.
Such data have noting to deal with the ones provided by the operating system concept of "Locale" and are not "user customizable" (think to the case of the thousand separator, replaced by an '''. There is -however- the possibility to covert an UTF-8 string into UTF-16 with the mbcstowcs function, specifying a locale having a UTF-8 codepage.

That's far more difficult than to be said, since library doumentation are not so generous in those kind of information.

For example you can discover that encoding can be specified for example in fopen: es.

Collapse | Copy Code

fopen("newfile.txt", "rw, ccs=<encoding>");

where <encoding> can be "UTF-8", although it's not documented as a standard.
But as you move to C++ it is practically impossible to re-find a similar functionality in fstream-s.

C++ and Unicode

C++ approach to i/o is based on the "stream" concept. How streams relates to files it is not so obvious since the STL documentation, does not provide a plain description on that. You have to read about a number of details of a variety of classes before having a clue about what architecture is behind.

So let's go in the detail as much is sufficient to understand where is the clue.

Streams buffers and locales

This classes has a collaborative role in playing input output.

Streams are responsible to provide the interface for insertion and extraction operators and formatting manipulators.
Buffers are responsible to provide the transit storage for the elements that are read and written from the external source and to manage such reading and writing
Locale provides the functionality to handle "representation" and "conversion". They do so by a number of "facets" that deals with the conversion of numbers into character strings (with the numget and numputfacets) and of the translation of characters from the program representation from and to the external representation (with the codecvt facet).

All that is a set of family of classes that manage different character representations (char or wchar_t) and different nature of external streams (files or strings).

In their abstract definition, streams are rooted from a virtual ios_base (character type independent), than frombasic_istream<.> and basic_ostream<.>, than from those two into basic_iostream<.>, thus giving this hierarchy:

All streams must hold a "buffer" derived form basic_streambuf<.> and a locale, initialized by default as the C++ global locale (initialized in turn as the classic "C" locale).

In particular file streams are noting more than basic streams initialized with a basic_filebuf<.>, that overrides the basic_streambuf<.> virtual function to manage file i/o, plus some pass-through function like open, closeetc.
Similarly, string streams are basic streams initialized with basic_stringbuf<.>.

The template parameters defines the type of "elements" used by the stream internally to the program. In windows environments they are normally char for ANSI oriented character representation and to wchar_t for Unicode (UTF-16) oriented representation.

But something strange happens with file streams: try this:

Collapse | Copy Code

#include <fstream> #include <windows.h> #pragma comment(lib, "kernel32.lib")#pragma comment(lib, "user32.lib")#pragma comment(lib, "gdi32.lib")int main()
{
 std::wofstream fs("testout.txt");
 const wchar_t* txt = L"some Unicode text òàèé€§";
 MessageBoxW(0,txt,L"verify",MB_OK);
 fs << txt << std::flush;
 MessageBoxW(0,fs.good()? L"Good": L"Bad",L"verify",MB_OK);
 return 0;
}

The call to the Unicode MessageBoxW confirm the proper string (should end with the § symbol, and have the Euro glyph as second-last).

Here's the dump of txt from the debugger

Collapse | Copy Code

0x0041770C 73 00 6f 00 6d 00 65 00 20 00 55 00 6e 00 69 00 s.o.m.e. .U.n.i.
0x0041771C 63 00 6f 00 64 00 65 00 20 00 74 00 65 00 78 00 c.o.d.e. .t.e.x.
0x0041772C 74 00 20 00 f2 00 e0 00 e8 00 e9 00 ac 20 a7 00 t. .ò.à.è.é.¬ §.

That's Unicode represented by UTF-16 in LE form (73-00, in WORD format is 0x0073 and is just the plain 0x73 ('s') ASCII, while AC-20 is 0x20AC that is the Euro symbol €, that cannot be represented as a single bye)

Now look the output file content with and hex editor.

you should get (I used Notepad++ with the Hexedit plug-in)

Collapse | Copy Code

"000000000 73 6F 6D 65 20 55 6E 69-63 6F 64 65 20 74 65 78 |some Unicode tex|" "000000010 74 20 F2 E0 E8 E9 |t òàèé |"

That's ANSI, with the text that appear truncated by the € symbol (and fs.good() is false).

This -at least- with VC8.

Doing the same test with another compiler (MinGW 3.4.5, I used Codelite as IDE) with that same source (note that MinGW use UTF-8 for sources, while VC8 use ANSI. String literals don't survive, hence a different file is needed) thing are even worst:

Collapse | Copy Code

t1.cpp: In function `int main()': t1.cpp:9: error: `wofstream' is not a member of `std' t1.cpp:9: error: expected `;' before "fs" t1.cpp:10:23: converting to execution character set: 
t1.cpp:12: error: `fs' was not declared in this scope

In fact all the wchar_t related stuff is under a conditional compilation driven by the _GLIBCXX_USE_WCHAR_Tsymbol.

Introducing this workaround (essentially defining in the regular std namespace the missing types) it compiles.

Collapse | Copy Code

#ifdef __MINGW32_VERSION
#ifndef _GLIBCXX_USE_WCHAR_Tnamespace std
{
	typedef basic_ios<wchar_t>           wios;
	typedef basic_streambuf<wchar_t>     wstreambuf;
	typedef basic_istream<wchar_t>       wistream;
	typedef basic_ostream<wchar_t>       wostream;
	typedef basic_iostream<wchar_t>      wiostream;
	typedef basic_stringbuf<wchar_t>     wstringbuf;
	typedef basic_istringstream<wchar_t> wistringstream;
	typedef basic_ostringstream<wchar_t> wostringstream;
	typedef basic_stringstream<wchar_t>  wstringstream;
	typedef basic_filebuf<wchar_t>       wfilebuf;
	typedef basic_ifstream<wchar_t>      wifstream;
	typedef basic_ofstream<wchar_t>      wofstream;
	typedef basic_fstream<wchar_t>       wfstream;
}#endif #endif

But running it still show fs go bad and no output produced (the file is created, but remains empty).

Debugging shows that the basic_streambuf::xsputn function catch an exception and sets the stream as bad. That exception is produced here

Collapse | Copy Code

template<typename _Facet>
 inline const _Facet&
 __check_facet(const _Facet* __f)
 {
 if (!__f)
 __throw_bad_cast();
 return *__f;
 }

where the actual type for _Facet is std::codecvt<wchar_t,char,int>. char?! where does it comes from?.

Going back to the VC, we find this strange note, in the basic_filebuf (the derivation for basic_streambuf for file streams) documentation:

Objects of type basic_filebuf are created with an internal buffer of type char * regardless of the char_type specified by the type parameter Elem. This means that a Unicode string (containing wchar_t characters) will be converted to an ANSI string (containingchar characters) before it is written to the internal buffer. To store Unicode strings in the buffer, create a new buffer of typewchar_t and set it using the basic_streambuf::pubsetbuf() method. To see an example that demonstrates this behavior, see below.

In essence, it seems nobody wants to say clearly that -independently on what we use in a program- outputs going to a FILE (yes, the old C FILE, that's what basic_filebuf writes to and read from, there no magic behind that), by default is always attempted to be converted into char-s, using the actual global locale facets.

In VC8 this happens with a std::codecvt<wchar_t,char,mbstate_t> facet that's part of the default global locale, operating on an internal char buffer (as stated in the note), in MINGW no such facet is declared, and hence the locale cannot provide it (hence the exception).

Going to UTF-8

... And that's the key to go towards UTF-8: let's provide to the basic_filebuf the facet it wants.

Not another facet, with proper type and id, (locales can support whatever number of "facets") since that's not what the buffer class is seeking for.

We have to derive the proper std::codecvt<wchar_t,char,mbstate_t> and -where no such type is defined- we have to define it.

MinGW declarations

If we assume to work with MinGW with no wchar_t support enabled, we have to make the base facet to exist.

Since all facet are defined as template, that's easy:

Collapse | Copy Code

#ifdef __MINGW32_VERSION
#ifndef _GLIBCXX_USE_WCHAR_Tnamespace std
{
	template<>
	class codecvt<wchar_t,char,mbstate_t>:
		public __codecvt_abstract_base<wchar_t,char,mbstate_t>
	{
	protected:
		explicit codecvt(size_t refs=0)
			:__codecvt_abstract_base<wchar_t,char,mbstate_t>(refs)
		{}
	public:
		static locale::id id;
	};

	typedef basic_ios<wchar_t> 			wios;
	typedef basic_streambuf<wchar_t> 		wstreambuf;
	typedef basic_istream<wchar_t> 			wistream;
	typedef basic_ostream<wchar_t> 			wostream;
	typedef basic_iostream<wchar_t> 		wiostream;
	typedef basic_stringbuf<wchar_t> 		wstringbuf;
	typedef basic_istringstream<wchar_t>		wistringstream;
	typedef basic_ostringstream<wchar_t>		wostringstream;
	typedef basic_stringstream<wchar_t> 		wstringstream;
	typedef basic_filebuf<wchar_t> 			wfilebuf;
	typedef basic_ifstream<wchar_t> 		wifstream;
	typedef basic_ofstream<wchar_t> 		wofstream;
	typedef basic_fstream<wchar_t> 			wfstream;
}#endif #endif

We are defining a specialization of codecvt<InnerType,OuterType,StateType> for<wchar_t,char,mstate_t>, that is exactly what the compiler is searching.

And we supply a locale::id static object as required by the STL implementation. This requires a cpp file to instantiate the static object (<rant>I hate globals ...</rant>).

At this point we can -either for MinGW and for VC8- derive codecvt<InnerType,OuterType,StateType>overriding the virtual function to implement a wchar_t to char conversion where wchar_t is UTF-16 and char is UTF-8.

gel::stdx::utf8cvt<bool>

It is the codecvt derivation, implemented as a translator UTF-16 <-> UCS <-> UTF-8, using the mbstate_tparameter as a carry between function invocation.

Invalid characters

First of all, we have to decide what to do in case of invalid characters: sequence that may be present in input but that are not valid UTF or even valid Unicode codepoints.

According to Unicode specifications, invalid characters or sequences must be treated as "errors". But what does it mean treated is left to many interpretations.

If our purpose is to validate the input we will probably like something that makes us aware of something going wrong; but if we are just reading a text we are probably more suited to something that doesn't stop reading just because of a miswritten char.

That's what the bool template parameter is for: if set to true the implementation will have a strict behavior and upon every reading or writing of erroneous or illegal sequences or characters will throw a gel::stdx::utf_errorexception, derived from std::runtime_error.

The STL implementation of basic_streambuf should catch this and set owner stream as "bad", thus blocking it. The logic will have so no difference from the one normally used by regular stream processing.

If the bool parameter is set to false, the respect of Unicode restrictions are relaxed and invalid sequence are processed coherently with the algorithm.
It is so possible to support up to 28bit codepoints (we need 4 bits to manage the conversion steps), and read overlog UTF-8 sequence as if "legal".

Trivial functions

Overriding codecvt is trivial for at least three functions:

do_always_noconv always return false, since a conversion needs to be done.
do_max_length return 6 since this is the longest UTF-8 possibility. Proper Unicode will never produce more than 4 chars, but an arbitrary wchar_t can go over.
do_encoding always return -1, since the conversion is state dependent.

More complex is do_length: there is the need to almost decode to understand what about the length of the converting sequence. we found simpler to return a cautelative value (min(_Len2, (size_t)(_Last1-_First1)). The consequence is probably a wider buffer allocation by the filebuffrer classes, but -as it can be experiment- tt seems that either MinGW and VC8 buffers implementations don't call this function.

do_in

UTF-8 (external) to 16 (internal) is done using _Next1 and _Next2 as iterators walking from _First1 and _First2to _Last1 and _Last2 (couldn't it be different?) incremented as something is read or written. If a buffer end is reached, the function return partilal if there is a residual state, or ok elsewhere.

The _State value is used as follow:

The less significant 28 bit stores the incoming character (Unicode requires 21, hence we can do more than actual Unicode)
Bits 28-29-30 are used to contain the counter of the remaining char forming a character This is the carry of UTF-8 to UCS conversion.
Bit 31 is used as a carry for the UCS to UTF-16 conversion (in case of surrogate)

The first part of the function (under if(!(_Sate & 0x80000000)) ) is executed to convert UTF-8 in UCS (the ifprotects against an UCS to UTF-16 in progress)

The second part (under if(!(_State & 0x70000000)) ) is executed when no UTF-8 residual exist (hence_Sate & 0x0FFFFFFF contains a complete UCS) and saves the UCS as a 16-bit only or as a 21bit surrogate (10 + 10bits + 0x100000), by saving the first part, setting the _State bit 31, and upon that setting, saving the second part on the next loop.

Note that the concept of "netx loop" depends on the politic adopted by the basic_streambuf inside the STL implementation.

It may be inside the main for loop or it may be upon a further invocation of the function. Since we use _State as a carry between loops, and state is allocated and maintained outside the function (is passed by the caller) we can essentially not care about the buffers residual length.

do_out

UTF-16 (internal) to 8 (external) is done similarly.

_State bit 31 is used as a carry flag indicating a partial reading / writing.

When a complete UCS had been read (from UTF-16) the first UTF-8 byte is saved. Subsequent bytes are written with a call to unshift, that is also called by the basic_streambuf itself to complete the output when no more input is present.

Using the facet

We need an std::locale where to replace the codecvt<wchar,char,mbstate_t> facet with agel::stdx::utf8cvt to be imbue-d into the stream buffer.

This very cryptic assertion, merely means this:

Collapse | Copy Code

std::locale utf8_locale(std::locale(), new gel::stdx::utf8cvt<true>);
 std::wfstream fs;
 fs.imbue(utf8_locale);
 fs.open(yourfile, mode);
 //whatever I/O to the stream

utf8_locale is -in this case- take from the global locale, with an utf8cvt given to it.

The stream is just whatever wstream (wfstream in this example, that means basic_fstream<wchar_t>).

Of course, instead of imbue, we can replace the global locale by calling

Collapse | Copy Code

std::locale::global(utf8_locale);

just after the utf8_locale declaration.

We can also use a locale taken from another specific locale when it is the case, by simply create a locale like

Collapse | Copy Code

std::locale utf8_it(std::locale("It"), new gel::stdx::utf8cvt);

Thus having a UTF-8 italian locale.

Since the frequent use of the UTF_8 neutral locale, I declare a gel::stdx::utf8_locale<true> as an always accessible global object.

The supplied code

The "gel" directory contains both the header ("stdutif.h") and the cpp source ("gel.cpp") needed to properly instantiate the global and static variables.

The parent directory contains a tester program ("tester.cpp") and all the stuff to arrange either a VS8 project and a Codelite project (to test also a MinGW compilation) and the grandparent directory contains the VS8 solution ans well as a Codelite workspace.

The tester program is a console application accepting up to two filename in the command line (the first as input file and the second as output).
The default value for the input is testerin.txt and for the output is testerout.txt.

This application perform a number of congruence tests about the act of reading and writing with the UTF-8 facet, logging its activity on the standatrd output (std::cout) and using the ANSI and Unicode Win32 API MessageBox[A/W] to test how the read file is displayed,

Testing sequence

The testing program proceed as follows:

The file is read as an std::ifstream (thus, char based) into a std::string, and displayed byMessageBoxA.
Since this is all ANSI, an UTF-8 will be displayed correctly only for the ASCII compatible characters.
The file is read as an std::wifstream (thus wchar_t based) into a std::wstring using the normal locale and displayed by MessageBoxW.
This will be treated by VC8 as an ANSI to Unicode conversion, and from MinGW as an "impossible to read" stream.
The file is read as a std::wifstream imbue-d with a gel::stdx::utf8_locale, displayed byMessageBoxW and an output file is written as std::wofstream, imbue-d with agel::stdx::utf8_locale, with the string just read.
Both the files are read as raw bytes sequence (as std::ifstream) and compared by by byte to check there are no difference.

A practical sample

As a sample, the file "testerin.txt" is provided in the tester project directory, with the following content:

Collapse | Copy Code

Tester text:
grade paragraph eacute egrave euro
°§èé€

The file has been create with Notepad++ and saved as UTF-8 without BOM.

Running the tester program,
step 1 completes with the following message box:

Note that the bytes read and displayed as ANSI representing the 5 Unicode as 2,2,2,2,3 bytes.

Step 2 completes in the following way:

VC8 version	MinGW version

Note that MinGW report a blank display for the reason described in the article (missing of a defaultcodecvt<wchar_t,char,mbstate_t>), while VC8 attempts an ANSI to Unicode conversion. It keep safe the ASCII part, but the non-ASCII translation is codepage (hence system) dependent.

Step 3 completes as follow:

Thats the correct display from the UTF-8 file.

Step 4, at this point checks that the correctly read file is also correctly written bye comparing the two byte sequences:

The entire execution is logged in the console as such:

(Note: if something where different between the input and output file, the "=" will be replaced by "x".
Also, note how UTF-8 reading appears shorter than ANSI reading, because of the compacting of 2 and 3 bytes characters).

Other MinGW and VC8 notable differences

Debugging the two version the do_in function manifest a different calling algorithm from the two implementation of the STL.

VC8 calls the function repeatedly passing to it one byte for each call (thus reflecting the while(fs.get(c)) in the tester loop).

MinGW, in turn, calls the function once supplying the entire sequence of bytes (and letting the function inner loop do to the job).

When to use

Finally some recommendations, based on my opinion and experience:

This is not the panacea to solve everything. Nor is a professional implementation code. It just want to fill a gap I hope somebody else will soon fill with better structured code, may be even iside the STL implementation itself.
Yes, boost have this, I know. But this is just 10KB long and boost dosn't document such facet (it seems it has been included because o a private exigence of an author, needing it to do something else: you have to hack boost code to discover it)
This facet can make your code slower, of course, because of the cost of the conversion. If you don't need to understand the content of a stream contained text, but just move it around, don't convert it: just treat it as "raw bytes".
If you have to parse a ASCII delimited markup (like HTML or XML), with no need to interact with the marked-up text don't convert it. just treat UTF-8 as "false ASCII".
If you have to get some UTF-8 text and pass it to windows API for user interaction, then yes, you've to convert it into Unicode, since ANSI cannot correctly represent it. And this may be a way to do that. The other is reading as raw bytes and convert the strings once in memory by MultiByteToWideChar (and specifying UTF-8 as codepage).
The ideal situation for this kind of approach (and that's what it has been designed for) is the reading / writing of relatively small configuration files saved as UTF-8 for interoperablity between systems. or for simple encoding/deconding of network messages across sockets.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。