XML Tricks for C# - A First Look at Encoding
(Page 3 of 5 )
To adequately cover character encoding, I should write a 500+ page book. However, that's something that I can't do right now, so instead, I'll teach you the fundamentals, you can take it into account when writing your applications--especially XML Documents!
The world of encoding and how the characters are encoded represent a common problem for developers on every level. Even if you don't develop international applications, you still need to understand how the characters are encoded. Because XML has become the best method for describing data, it's now a problem that you must solve when writing XML Documents.
As you know, data is stored in a binary format, 0's and 1's. Characters like A, B and C are stored in that format because computers don't understand anything except binary format. So character encoding defines the way that these binary numbers will map to the actual characters that we know. Here is an example:
Consider the Spanish character ñ (pronounced "Ehnyeh"). A computer will store this character using binary format. Then, using encoding, it formats it to that thing "ñ" that we can read or write. The process begins by changing from binary format to something easier to read, like the decimal or hexadecimal format. The process of printing (or retrieving) the Spanish character ñ goes like this:
First the computer changes the character format (using encoding) from its binary format to its decimal format. Then, it is represented as the characters that we know.
Let's exactly see the binary and decimal format and how it works:
11110001 (data retrieved)---THEN ENCODED TO---> 241 --------> ñ
Now, some of you may ask, "Why does encoding convert the binary to decimal or hexadecimal format?" Actually it's not converting it; it's just a representation of the character. No more.
NOTE: you can write the Spanish character Eñe (ñ) by pressing alt + 0241 (from the calculator) or by using the character map (Start --> Run --> then type charmap).
The ASCII character set has been very popular for many years. There are two versions:
- Uses 7-bits to encode each character. This limits us to 128 encodings.
- Uses 8-bits to encode each character. This limits us to 256 character encodings.
There is also the ANSI character set (also called windows-1252). It derived from ASCII encoding. There is also ISO-8859-1. It's very similar to ANSI and ASCII because they all use 8-bit to encode the characters. They differ on the characters they encode. For example, if you are moving characters between platforms (from Windows to Macintosh) the extended ASCII characters may have different meanings (letter ç on Windows will appear as Á). Extended character sets are also referred to as code pages.
NOTE: The first character codes of any of the ASCII character sets are always identical to the ISO 8859 or to the ANSI character set.
In fact, Windows gives you the option to a number of character sets. If you fail to choose different character sets, it will go to the predefined default. The Notepad application uses the ANSI coding as the default encoding. When you save your XML document without specifying any encoding, you are saving with the default ANSI character set, which will work fine for the Latin character set, but which doesn't include Japanese, Chinese, or non-Latin character-based languages. Actually Japanese has three major standards, Shift-JIS, ISO-2022-JP, and J-EUC, all different from each other. Encoding in this character sets use two or three bytes for each character. If you think about it, you will say it's challenging to develop applications that use that character set. But wait--there is Unicode!
Next: Unicode >>
More XML Articles
More By Michael Youssef