XML Tricks for C# - Unicode
(Page 4 of 5 )
The Unicode Consortium states "Unicode provides a unique number for every character, no matter what the platform, no matter what the program and no matter what the language." I like that, and I think you will like it too when you know more about it from the next section.
Looking at all these difference character sets--ASCII, Extended ASCII, ANSI (windows-1252), Shift-JIS, ISO-2022-JP, J-EUC and many other character sets--it became clear that some kind of standardization was needed. The problems of these different character sets were solved in 1996 when the Unicode Consortium released Unicode version 2.0 Standard.
Unicode Standard provides us with only one single huge character set that covers all the characters of the languages of the world. With this we don't have to go from one character set to another character set when developing our international applications. You must know also that Unicode is built into almost all the common software applications and fully supported by Windows NT, the Windows 2000 server family, Windows .NET servers, the Windows 2003 family, and Windows XP.
But if Unicode is such an important character set, why don't all the vendors support it? The short answer is that Unicode is one character set that you can use in your application. Using it you can represent any language, which is a valuable feature. Some would say that efficiency is sacrificed using a character set with larger 16 to 24 bit characters (for the Asian languages), when all I need to program for are the shorter Latin-based characters (which take only 7 or 8 bits). This was a major debate with the folks in the Unicode Consortium. Although Unicode uses the same character set for storing all the known (and unknown) characters, the folks in Unicode Consortium offer three types of encoding.
Unicode is a multi-byte character set (MBCS) and it uses a number of bytes to store (encode) each character. Here are the 3 types of encoding:
- UTF-23 which uses a single 32-bit unit to encode each character
- UTF-16 which uses one or two 16-bit units to encode each character
- UTF-8 which uses one to four 8-bit units to encode each character
UTF-32 uses 4 bytes to encode each character so it's not supported by software applications. But UTF-16 and UTF-8 are extensively supported and required for the XML Parser. If there are many characters that require more than 2 bytes for encoding it starts to be more efficient to use UTF-16 because if we have non-Latin characters taking 2 bytes to encode, it will be faster to read one 16-bit unit. UTF-8 use is maximized when you are storing only Latin-characters.
Now after this simple introduction, let's get down to Encoding with XML
Next: Encoding with XML >>
More XML Articles
More By Michael Youssef