Working with Regular Expressions in C#

Every now and then during our software development adventures we face situations where, if we are familiar with Regular Expressions, we could save ourselves a few minutes, if not more, of precious time. RegExes offer flexibility to identify, work with, and search for particular patterns or characters in strings. In this tutorial we are going to learn how to take advantage of Regular Expressions in C#.

Contributed by
Rating: 4 stars4 stars4 stars4 stars4 stars / 14
August 19, 2008
Rate this Article:
MEH MEH++


SEARCH ASP FREE
TOOLS YOU CAN USE

advertisement

Regular expressions are widely used in various areas in computer sciences. Due to their popularity and usefulness, nowadays we can find regular expressions processors incorporated almost everywhere. Surely, the .NET framework doesn’t lack this kind of critical utility. However, to get the most out of RegExes, you need to fully understand their formal language, which is ubiquitous throughout interpreters.

Interpreters are usually built into most programming languages and text editors. They are all based on the same formal language, which is quite concise and can become very complex at times. If you understand the basics and get some practice, however, you won’t have any trouble at all recognizing those patterns because they are pretty redundant and straightforward. Contrary to our first impressions, they do make sense.

When, how, and why exactly should we work with regular expressions? Well, a regular expression processor (or interpreter) does all of the work instead of you, if you know how to communicate with it. This is the formal language that I was talking about. The processor works based on patterns. As long as you can phrase your requirements in this concise and formal syntax, then the processor will gladly help you out.

As a result, this article shows you how to evoke the RegEx interpreter from the .NET Framework and how to do this appropriately in C#. After that it will focus mostly on examples, lots of examples, accompanied by illustrations. I will also include a table of the syntax that will contain most of the frequent regular expressions meta-characters. We will slowly move towards more complex examples after starting with the easier ones.

Additionally, we won’t neglect the classes that are already included in the name-space we are going to use. Each one of them has lots of interesting built-in methods that can aid us throughout our coding (routine) tasks. In short, we need to learn two things from this article: how to work with RegExes and then how to format patterns. As soon as we excel in these two, all that there is left to do is practice—a lot. Enjoy!

Working with RegExes

Decades ago people in general were getting familiar with the wildcards in the MS-DOS operating system. It wasn’t unusual to see their use pretty frequently in everyday commands such as: “copy *.exe a:” and so forth. This equaled copying all of the executable files (ending in .exe format) to the a: floppy disk drive. Now, regular expressions are much like these nifty wildcards, but they are a lot more powerful.

You have a lot of flexibility when using regular expressions because you can format practically any sort of pattern if you are familiar with it and know its exact and correct syntax. Surely, you may say that glancing over the table of syntax isn’t that hard; that’s true, but that doesn’t guarantee that you will also learn the little details of their usage. In this article we’re going to use the .NET framework’s RegEx processor in C#.

Nowadays, working with the regular expressions in .NET languages has been made easy because the assembly is merged into the main framework. Therefore, all you need to do is include the necessary System.Text.RegularExpressions namespace with the help of the “using” statement at the beginning of your application.

There are a couple of ways to work with regular expressions. Deciding on which route to follow depends only on what you want to accomplish using RegExes. For example, you can use them to check the existence of a specific pattern in a particular source string, or you may want to search and store the matches in a collection, or who knows, you may want to validate the syntax of an e-mail address (anything@blah.net).

As a result, you not only need to know the syntax of regular expressions to “talk” with the RegEx processor, you also want to know the pre-existing classes in the said namespace. Chances are, you may face situations where you need to code a bit “more” than just matching strings and all that. Surely, memorizing the table of RegExes helps in the case of UNIX system administrators… grep does wonders, you know.

Okay, I was getting ahead of myself. Let’s get down to work. In this section we’ll create some basic applications that work with Regular Expressions. You will find C# code snippets here. And in the next section I’ll present to you the periodic table of the RegEx syntax; it’s not really periodic… but it looks quite cryptic for beginners at first glance.

First things first— there’s the Regex.IsMatch() method. It has three parameters: string input, string pattern, and RegexOptions options. This method indicates whether the specified pattern can be found inside the source string. It returns a plain boolean (true or false). The latter parameter is optional but it may be useful in such cases where ignoring the case is required; this can be done using the following RegexOption:

System.Text.RegularExpressions.RegexOptions.IgnoreCase

Let’s enumerate a few of the RegexOptions:

  • IgnorePatternWhitespace – removes un-escaped white spaces.
  • IgnoreCase – ignores the case of the input string.
  • CultureInvariant – ignores the culture of the string.
  • Singleline – changes the mode to single-line.
  • RightToLeft – the input string is read from right to left; this is great for some languages.
  • Multiline – changes the mode to multi-line (^ and $ will apply for lines).

Working with RegExes, continued

Moving on, you have the Regex.Replace() method. This replaces the occurrences of the specified regular expressions pattern in the input string. There are various variations and usage routes of this because it can be overloaded. The easiest is specifying two parameters: the first is the input string, while the second would be the pattern.

Check out the following example that changes all of the @’s to [at]’s in the input string.

string pattern = @"@+";

Regex rgx = new Regex(pattern);

string input = stevenpriest@corporation.net;

string output = rgx.Replace(input, "[at]");

Console.Write("Modified string: " + output);

For the entire list of Regex methods just check out MSDN’s documentation. But keep in mind that Regex is a class and it has more than a dozen of methods. Needless to say, the namespace System.Text.RegularExpressions namespace has lots of classes, not only Regex. Each one of them is enumerated over at MSDN here.

Note: The at (@) sign is used right before the first quotation mark when specifying the regular expressions pattern because we don’t want the C# compiler to interpret the backslash () as an escape character.

The really important ones are Match (along with MatchCollection), Group (along with GroupCollection), Capture (with its CaptureCollection), and of course,  Regex. The collection ones are to be used when sequences of captured strings are required. These sequences are called collections. For example, you may want to store all of the matches of a pattern, and then you’ll use the MatchCollection. Here’s an example:

string pattern = @"b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}b";

MatchCollection matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);

This example would search for all the email addresses that are valid in terms of the RFC 2822 standards – that is, it can contain any of the A-Z, 0-9, _%+- special characters, and then the same is true for the domain, which ends with a valid A-Z suffix that can be 2, 3, or 4 characters long (such as .tw, .co, .edu, .com, .net, .org, .biz, .mil, .gov, .info, etc.). You can read more about validating email addresses via RegExes here.

Also, it stores the matches inside the MatchCollection called matches. You can iteratively walk through it now and print out or work with the matches.

Another way to work with regular expressions is, of course, working directly with regex constructors. We can do this by the following declaration:

Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);

After this point we can simply call the Match or any other method referring to the aforementioned constructor (rgx). Here’s an example of doing this:

Match mtch = rgx.Match(input);

Now what if there are more matches and we don’t want to store the results in a MatchCollection? We can iteratively move through the input string like this. Pay attention to the Capture class (herein including its CaptureCollection); that’s how we “capture” the results.

while (mtch.Success)

{

Console.WriteLine("The match: " + mtch);

CaptureCollection captcoll = mtch.Captures;

foreach (Capture capt in captcoll)

{

Console.WriteLine("The capture: " + capt);

}

mtch = mtch.NextMatch(); // move forward and find the next match

}

As another example, what if we want to group certain substrings of the matches? Thankfully, there’s the Group class that has representing the results from a single capturing group as its purpose. These groups can capture multiple strings from one match; that’s why there exists a GroupCollection class too.

Please be aware that the Group inherits from the Capture; that’s how you can refer back directly to the last string that was captured. Concerning how to use them, let’s continue our previous example. Suppose we have a <firstname> and <lastname> delimited as groups in our pattern, such as in the following example:

string pattern = @"(?<firstname>w+)s+(?<lastname>w+)s*";

Now with the above pattern, if we extend the previous match and captures example with groups, it would look like this:

Group fname = mtch.Groups["firstname");

Group lname = mtch.Groups["lastname");

Console.WriteLine("The first group: " + fname);

foreach (Capture capt in fname.Captures)

{

Console.WriteLine("The captures of the first group: " + capt);

}

foreach (Capture capt in lname.Captures)

{

Console.WriteLine("The captures of the second group: " + capt);

}

Console.WriteLine("The second group: " + lname);

Obviously, the continuation from the previous example cannot be left out:

mtch = mtch.NextMatch();

Since we don’t want to stop at the first match, I think that the above examples have pretty much presented the way we can work with regular expressions in C#. Now you have the ground work and the basic knowledge to study and learn from MSDN’s documentation about the rest of the classes and methods. Don’t forget that to deepen your knowledge, you also need to practice. So fire up your Visual Studio and have fun!

Lots of Examples

Let’s check out the somewhat complete list of all of the regular expressions and syntax elements that you can use to phrase, format, and build any kind of patterns, from the simplest to the most complex, cryptic-looking, and pretty deceiving.

.

matches any single character, except new line

*

matches the preceding character zero or many times

+

matches the preceding character once or many times

?

matches the preceding character zero times or only once

$

matches the end of the input data string

^

matches the beginning of the input data string

<

matches the empty string at the beginning of a word (start-of-the-word)

>

matches the empty string at the end of a word (end-of-the-word)

_

matches the next character (such as $ in place of _) as a special character

( )

matches the pattern enclosed in the brackets: (pattern)

[ ]

matches any of the enclosed characters in the set

[^]

matches any of the characters that are not enclosed in the set

{n}

matches the preceding character exactly the enclosed integer times (n)

{n,}

matches the preceding character at least the enclosed integer times (n)

n

matches any new line character

r

matches any carriage return character

f

matches any form-feed character

t

matches any tab character

v

matches any vertical tab character

s

matches any white space character; equivalent to [nrftv]

S

matches any non-white space character; equivalent to [^nrftv]

w

matches any word character, herein including underscore and digits

W

matches any non-word character, the negation of the above

b

matches any word boundary: position between the word and the space

B

matches any non-word boundary, negation of the above

d

matches any digit character; equivalent to [0-9]

D

matches any non-digit character; equivalent to [^0-9]

e

matches any escape character

__

matches any octal escape value specified in place of __; 1, 2, 3 digits long

x__

matches any hexadecimal escape value; hex value must be 2 digits long

digit

back reference operator; reaches back to the matches of the preceding digit-th grouping operator; thus, it always follows a grouping operator

-

range operator, it is used when specifying ranges in sets such as [0-9]

I think that a few of the above should be illustrated in code. Check out the following real-world practical pattern examples. We’ll just state a pattern and answer what kind of results it would return if used in the case of a match, for example. What we need to understand is how they work and what exactly each one of them does. If we know that, then we can do a myriad of other things using the classes presented earlier.

"d.v" -> matches “dev”, “d5v”… it is a placeholder of any single character.

"dev*" -> matches “de” but also “dev”, “devv”, or “devvvvv”, etc.

"dev+" -> matches “dev”, or “devvvvv”, and so forth.

"de?v" -> matches “dv” and “dev”

"dev|shed" -> matches either of the two: dev OR shed.

"dev[0-9]" -> matches “dev0”, “dev1”, “dev2”, … “dev8”, “dev 9”.

"dev[^0-9]" -> matches “deva”, “devb”, anything but digits in the end.

"shedb " -> matches any word ending in “shed”

"sshed" -> matches “tshed”, “nshed”, “ shed” (space as first char)…

"da*" -> matches any digits optionally followed by any times of an “a”

"^5[1-5][0-9]{14}$" -> matches all valid MASTERCARDs.

Furthermore, I’d like to recommend downloading and trying out the following free application: Rad Software Regular Expression Designer [link: here, download: here]. I am not affiliated, nor have any relation with the author(s) of this nifty application. All I know is that it is small, efficient, and very practical. And I mean it. It is an amazing utility that helps you practice and deepen your knowledge of regular expressions.

It supports such features as specifying a regular expression, adding text as input string(s), and letting you simulate the pattern matching and/or replacing process. You can also fix your regular expressions easier because after every change you can just match the expression again, again, and again. It also has a tree-view that contains all of the language elements and you can pick them one by one. Check it out!

Final Words

We have just arrived to the end of this article. I hope you have found this journey educational and informative. By now you should be familiar with most of the inner secrets of Regular Expressions. Formatting a pattern in its formal and concise language won’t scare you away. Neither will you look surprised when you try to continue the piece of code that your co-worker has written and it contains funny looking patterns.

Surely, regular expressions aren’t that critical because the truth is you could pretty much write pattern matching algorithms right out of memory; you may be very familiar with the well-known and efficient algorithms such as the Knuth-Morris-Pratt and/or the Boyer-Moore variations. However, implementing these, especially in the case of patterns that we can’t know in advance (where meta-chars help) can become quite challenging.

Additionally, anybody that works in software development knows that resource efficiency—both computer resources and time— should never be sacrificed. Therefore, it makes sense to utilize something that is specifically designed to accomplish the task, and it does the job amazingly well; you only need to know how to use it.

Taking advantage of regular expressions can become quite fun. Once you get the hang of it you will be positively impressed by how frequently you’ll be able to use them, starting with your favorite word processor, various utilities, up to scripting and programming languages as well. You can save hundreds of lines of code with one line.

As a finale, I’d like to invite you to join our ever-growing and friendly community of tech-professionals at Dev Hardware Forums. We focus on all areas of IT&C: hardware, software, consumer electronics, around-the-clock IT news and more. You might also want to check out the forums of our sister site at Dev Shed. Take care.

blog comments powered by Disqus
C# ARTICLES

- Beginning C#
- ASP.NET RedirectPermanent Method using C# an...
- C Programming Language and UNIX Pioneer Pass...
- Using Facebook JavaScript SDK in ASP.NET wit...
- ASP.NET Export to Excel and Word using VB.NE...
- WAV and MP3 Streaming with ASP.Net and C#
- Game Programming using SDL: the File I/O API
- C# and Java Developer Jobs on the Rise
- The Future Evolution of C# and VB.NET
- C# If and Else-if Statements
- How To Use the C# String Replace Method
- 5 Ways to Parse XML in C#
- C# Meets Design Patterns
- Coding a CRC-Generating Algorithm in C
- Cyclic Redundancy Check

ASP Web Hosting ASP.Net Web Hosting Windows Web Hosting
 
 
 

ASP Free Forums 
 RSS  Tutorials RSS
 RSS  Forums RSS
 RSS  All Feeds
Site Map 
Request Media Kit
Write For Us Get Paid 
Weekly Newsletter
 
Developer Updates  
Free Website Content 
Privacy Policy 
Support 


© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 9 - Follow our Sitemap
Most Popular Topics
All ASP.Net Tutorials