XML Tricks for C#

In this article, gain knowledge about the difference between elements and attributes in XML, as well as differences in character sets.  The author shows the benefits and drawbacks of using XML components and why you should carefully consider your character set when developing your software. 

Contributed by
Rating: 4 stars4 stars4 stars4 stars4 stars / 69
March 24, 2004
Rate this Article:
MEH MEH++


SEARCH ASP FREE
TOOLS YOU CAN USE

advertisement

I find that C# programmers who start their careers with .NET, and have little experience with win32 platform or Classic ASP, struggle with XML and its related technologies. So if you are one of them, I advise you to spend some time with XML- it is the future. Let's talk about elements vs. attributes.

Should I Be Using Elements or Attributes?

This is a question every developer asks when working with XML. This was one of the biggest topics of debate in the world of XML. Attributes or Elements? I can tell you much of the time they are quite interchangeable, but sometimes it is better to choose one over the other. Here are a few topics that you must think about each time you ask yourself the elements vs. attributes question.

Which is Easier to Work With?

When you use Attributes in your XML Documents you will notice that they are easier to manipulate and to work with.  Here's an example:


<university>
  <
school>Information Technology</school>
</
university>

Here I used the element <school> inside the <university> element. With elements, you must open the element tag, then write your data (PCDATA), then close the tag--which is time-consuming. Imagine that your XML document contains 100 elements; there will be a lot of closing tags for your elements. Let's rewrite this fragment using Attributes


<university school="Information Technology"></university>


It's now easier because we don't have the school element with its opening and closing tags. You could even make it better like this:


<university school="Information Technology" />

As you know, university is an empty element (because there is not data between its opening and closing tags) so we can write it like that and again save text so our XML document will look simpler.

Attributes are Strongly Validated by DTDs.

This is an advantage of attributes over elements. DTDs strongly validate attributes over elements, because elements can contain either PCDATA (any text character data) or sub-elements. As such, there's no need for validation. With attributes you can declare either CDATA types (character data), an ID type, an Entity, an option from an enumeration list, or other types.

Use Attributes to Describe Elements' behavior.

The best use of attributes is for determination of the element behavior. In other words, you can use attributes to specify metadata about your data (about your elements). Consider the following example:


<person>
  
<name>Michael Youssef</name>
</person>

This fragment describes a person.


<person position="XML Consultant">
<name>Michael Youssef</name>
</person>

Now the attribute position provides us with data (metadata) about our data (here I mean the element <person>). When I was discussing the above fragment with one developer from Microsoft, he said that the <name> sub-element also stores the metadata because it stores the name of that <person> element.

Note that I used two different methods to store the metadata about the person element:

  1. The sub-element <name> to store the name.
  2. An attribute to store the position of the person element.

But we can write the above code using only attributes:


<person name="Michael Youssef" position="XML Consultant"> </person>

And again a better writing for that code as following:


<person name="Michael Youssef" position="XML Consultant" />

Note that it's up to you to decide the formula about your meta data. You can use whatever you find better suits the solution you are working on. So using attributes to store metadata is more efficient than using elements, unless you have complex data to store and retrieve.

Attributes and Document Complexity

When you write your XML document using elements only, your document size will be larger and more difficult to read than writing it using attributes. Let's consider the following XML Document:


<?xml version="1.0"? >
<Consultants>
 
<Consultant>
  
<name>Michael Youssef</name>
   
<position>XML Consultant</position>
    
<age>21</age>
 
</Consultant>
 
<Consultant>
  
<name>Prakhar Deva</name>
   
<position>.NET Consultant</position>
    
<age>23</age>
 
</Consultant>
</Consultants>

We can rewrite this XML document better by using attributes.


<?xml version="1.0"? >
<Consultants>
 
<Consultant position="Microsoft.NET Consultant" age="21">Michael Youssef</Consultant>
 
<Consultant position=".NET Developer" age="23">Prakhar Deva</Consultant>
</Consultants>

Now our document is much more readable and efficient (5 lines instead of 13).

Until now, I've told you about why you should use attributes. Now, let me tell you also why you should use elements.

Use elements when you need a complex structure.

When you build a complex structure you have to consider using elements. Consider the following invalid XML document:


< ? xml version="1.0" ? >
 
<members>
   
<member  phone number="0020123658513" phone number="3331684">Michael Youssef</member>
 
</members>

Here each member has two phone numbers.  This is invalid because you can't have two attributes with the same name in one element. It will be better logically if you rewrite these phone number attributes as elements so you can know for certain that this person has 2 phone numbers. Let's rewrite the document again.


< ? xml version="1.0" ? >
 
<members>
   
<member name="Michael Youssef">
    
<phoneNumber>0020123658513</phoneNumber>
    
<phoneNumber>3331684</phoneNumber>
   
</member>
 
</members>

It's up to you (your experience and the situation) to use elements or attributes. I've shown you some advantages of using both so make sure that you understand this article. But if you have a complex structure to build then definitely you need to use elements. You could also gain some useful information by extracting some data from a SQL Server 2000 database into XML documents and see how the documents are structured.

Now let's go to our second subject, encoding.

A First Look at Encoding

To adequately cover character encoding, I should write a 500+ page book. However, that's something that I can't do right now, so instead, I'll teach you the fundamentals, you can take it into account when writing your applications--especially XML Documents!

The world of encoding and how the characters are encoded represent a common problem for developers on every level. Even if you don't develop international applications, you still need to understand how the characters are encoded. Because XML has become the best method for describing data, it's now a problem that you must solve when writing XML Documents.

As you know, data is stored in a binary format, 0's and 1's. Characters like A, B and C are stored in that format because computers don't understand anything except binary format. So character encoding defines the way that these binary numbers will map to the actual characters that we know. Here is an example:

Consider the Spanish character ñ (pronounced "Ehnyeh"). A computer will store this character using binary format. Then, using encoding, it formats it to that thing "ñ" that we can read or write. The process begins by changing from binary format to something easier to read, like the decimal or hexadecimal format. The process of printing (or retrieving) the Spanish character ñ goes like this:

First the computer changes the character format (using encoding) from its binary format to its decimal format. Then, it is represented as the characters that we know.

Let's exactly see the binary and decimal format and how it works:

11110001 (data retrieved)---THEN ENCODED TO---> 241 --------> ñ

Now, some of you may ask, "Why does encoding convert the binary to decimal or hexadecimal format?" Actually it's not converting it; it's just a representation of the character. No more.

NOTE: you can write the Spanish character Eñe (ñ) by pressing alt + 0241 (from the calculator) or by using the character map (Start --> Run --> then type charmap).

The ASCII character set has been very popular for many years.  There are two versions:

  1. Uses 7-bits to encode each character.  This limits us to 128 encodings.
  2. Uses 8-bits to encode each character.  This limits us to 256 character encodings.

There is also the ANSI character set (also called windows-1252). It derived from ASCII encoding. There is also ISO-8859-1. It's very similar to ANSI and ASCII because they all use 8-bit to encode the characters. They differ on the characters they encode. For example, if you are moving characters between platforms (from Windows to Macintosh) the extended ASCII characters may have different meanings (letter ç on Windows will appear as Á). Extended character sets are also referred to as code pages.

NOTE: The first character codes of any of the ASCII character sets are always identical to the ISO 8859 or to the ANSI character set.

In fact, Windows gives you the option to a number of character sets.  If you fail to choose different character sets, it will go to the predefined default. The Notepad application uses the ANSI coding as the default encoding. When you save your XML document without specifying any encoding, you are saving with the default ANSI character set, which will work fine for the Latin character set, but which doesn't include Japanese, Chinese, or non-Latin character-based languages. Actually Japanese has three major standards, Shift-JIS, ISO-2022-JP, and J-EUC, all different from each other. Encoding in this character sets use two or three bytes for each character. If you think about it, you will say it's challenging to develop applications that use that character set. But wait--there is Unicode!

Unicode

The Unicode Consortium states "Unicode provides a unique number for every character, no matter what the platform, no matter what the program and no matter what the language."  I like that, and I think you will like it too when you know more about it from the next section.

Looking at all these difference character sets--ASCII, Extended ASCII, ANSI (windows-1252), Shift-JIS, ISO-2022-JP, J-EUC and many other character sets--it became clear that some kind of standardization was needed. The problems of these different character sets were solved in 1996 when the Unicode Consortium released Unicode version 2.0 Standard.

Unicode Standard provides us with only one single huge character set that covers all the characters of the languages of the world. With this we don't have to go from one character set to another character set when developing our international applications. You must know also that Unicode is built into almost all the common software applications and fully supported by Windows NT, the Windows 2000 server family, Windows .NET servers, the Windows 2003 family, and Windows XP.

But if Unicode is such an important character set, why don't all the vendors support it? The short answer is that Unicode is one character set that you can use in your application. Using it you can represent any language, which is a valuable feature. Some would say that efficiency is sacrificed using a character set with larger 16 to 24 bit characters (for the Asian languages), when all I need to program for are the shorter Latin-based characters (which take only 7 or 8 bits). This was a major debate with the folks in the Unicode Consortium. Although Unicode uses the same character set for storing all the known (and unknown) characters, the folks in Unicode Consortium offer three types of encoding. 

Unicode is a multi-byte character set (MBCS) and it uses a number of bytes to store (encode) each character. Here are the 3 types of encoding:

  • UTF-23 which uses a single 32-bit unit to encode each character
  • UTF-16 which uses one or two 16-bit units to encode each character
  • UTF-8 which uses one to four 8-bit units to encode each character

UTF-32 uses 4 bytes to encode each character so it's not supported by software applications. But UTF-16 and UTF-8 are extensively supported and required for the XML Parser. If there are many characters that require more than 2 bytes for encoding it starts to be more efficient to use UTF-16 because if we have non-Latin characters taking 2 bytes to encode, it will be faster to read one 16-bit unit. UTF-8 use is maximized when you are storing only Latin-characters.

Now after this simple introduction, let's get down to Encoding with XML

Encoding with XML

When you write an XML document, by default, you will use the ANSI character set because editors like Notepad save documents using the ANSI character set.

However, when an XML parser parses the document, it has a built-in mechanism to know the format and how to interpret the characters. XML Parsers use a built-in mechanism called Byte Order Mark (BOM). When a file is saved a BOM may be inserted as the beginning of the file to indicate the encoding. When using Windows, the default is Windows-1252 (where all Latin characters are supported), so when you save a file using the default encoding in Windows there will be no BOM. If you save the file as Unicode a BOM is inserted at the start of the file.

Actually you will not see these BOM characters in most editors because they understand Unicode, so they strip out header information that the viewer is not supposed to see. How then does an XML parser read these documents and then ensure that it parses and outputs the correct character interpretations?  When an XML parser reads an XML file, the W3C defines the following three rules to decides how the document should be read:

  1. If there is a BOM, the BOM defines the file encoding
  2. If there is no BOM, then the encoding attribute in the XML declaration is definitive
  3. If there are neither of these, then assume the XML document is UTF-8 encoded

Of course, if the BOM is incorrect, then it is likely that the XML file won't be correctly parsed and will throw an error. Equally, if there is no BOM or encoding declared and the default UTF-8 is used but the document is not UTF-8 encoded, then equally an error will be thrown. These should really not be a surprise; how can it decode characters when its definition is completely wrong? As I said before the first 128 characters of Unicode are the same as that of ASCII. So if your file consisted only of these characters you would be fine. However, if you include ASCII characters beyond 128, such as ñ and ç, you will run into difficulties.

I'd like now to address a big problem with XML documents that we create. When writing XML documents, you can use the encoding attribute to specify the encoding character set that you use for your document. This is very confusing for beginners.

At first, I want to tell you that when you open your Notepad to write the following simple XML document:


< ? xml version="1.0" encoding="UTF-8" ? >
 
<name>Michael Youssef </name>

And save it using the File -> Save dialog box.

Note the Encoding drop-down list which you can choose the encoding character set from a few character sets.

Save the file with the default (ANSI character set). Now how does the XML parser decide the saved format? Look above at Rule #3. If there is no BOM, you will know that the XML parser will use the encoding in the encoding attribute. That is UTF-8 and it will read the characters very normally because, as we've by now learned, the first 128 of all encoding character sets will be the same.

blog comments powered by Disqus
XML ARTICLES

- More on Triggers and Styles and Control Temp...
- Looking at Triggers with Styles and Control ...
- A Closer Look at Styles and Control Templates
- Styles and Control Templates
- Properties and More in XAML
- Elements and Attributes in XAML
- XAML in a Nutshell
- Importing XML Files into Access 2007
- Using MSXML3.0 with VB 6.0
- MSXML, concluded
- MSXML, continued
- MSXML Tutorial
- Generating XML Schema Dynamically Using VB.N...
- XSL Transformations using ASP.NET
- Applying XSLT to XML Using ASP.NET

ASP Web Hosting ASP.Net Web Hosting Windows Web Hosting
ASP Free Forums 
 RSS  Tutorials RSS
 RSS  Forums RSS
 RSS  All Feeds
Site Map 
Request Media Kit
Write For Us Get Paid 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Privacy Policy 
Support 


© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 2 - Follow our Sitemap
Most Popular Topics
All ASP.Net Tutorials