XML Tricks for C# - Encoding with XML
(Page 5 of 5 )
When you write an XML document, by default, you will use the ANSI character set because editors like Notepad save documents using the ANSI character set.
However, when an XML parser parses the document, it has a built-in mechanism to know the format and how to interpret the characters. XML Parsers use a built-in mechanism called Byte Order Mark (BOM). When a file is saved a BOM may be inserted as the beginning of the file to indicate the encoding. When using Windows, the default is Windows-1252 (where all Latin characters are supported), so when you save a file using the default encoding in Windows there will be no BOM. If you save the file as Unicode a BOM is inserted at the start of the file.
Actually you will not see these BOM characters in most editors because they understand Unicode, so they strip out header information that the viewer is not supposed to see. How then does an XML parser read these documents and then ensure that it parses and outputs the correct character interpretations? When an XML parser reads an XML file, the W3C defines the following three rules to decides how the document should be read:
- If there is a BOM, the BOM defines the file encoding
- If there is no BOM, then the encoding attribute in the XML declaration is definitive
- If there are neither of these, then assume the XML document is UTF-8 encoded
Of course, if the BOM is incorrect, then it is likely that the XML file won't be correctly parsed and will throw an error. Equally, if there is no BOM or encoding declared and the default UTF-8 is used but the document is not UTF-8 encoded, then equally an error will be thrown. These should really not be a surprise; how can it decode characters when its definition is completely wrong? As I said before the first 128 characters of Unicode are the same as that of ASCII. So if your file consisted only of these characters you would be fine. However, if you include ASCII characters beyond 128, such as ñ and ç, you will run into difficulties.
I'd like now to address a big problem with XML documents that we create. When writing XML documents, you can use the encoding attribute to specify the encoding character set that you use for your document. This is very confusing for beginners.
At first, I want to tell you that when you open your Notepad to write the following simple XML document:
< ? xml version="1.0" encoding="UTF-8" ? >
<name>Michael Youssef </name>
And save it using the File -> Save dialog box.
Note the Encoding drop-down list which you can choose the encoding character set from a few character sets.
Save the file with the default (ANSI character set). Now how does the XML parser decide the saved format? Look above at Rule #3. If there is no BOM, you will know that the XML parser will use the encoding in the encoding attribute. That is UTF-8 and it will read the characters very normally because, as we've by now learned, the first 128 of all encoding character sets will be the same.
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |