Regular Expressions in VBScript

At this point in your scripting you’ve certainly come across VBScript’s InStr and Replace functions for searching and replacing within strings. While these string functions can come in pretty handy, they are also intrinsically limited. It’s time to step up into the Big Leagues and learn how to search and replace based on patterns.

Like most other programming languages, VBScript uses regular expressions to perform pattern matching.  A regular expression is a textual expression that contains symbols and literal characters used to match a pattern of characters.  These can be extremely simple or scaled to be very complex, but if you take things one step at a time you’ll pick them up very easily.

So when would you use regular expressions?  Well, for starters you can have much more powerful string replacements.  You might also use regular expressions to validate form entry in an HTML application or ASP page.  You’ll understand this better and realize the flexibility as you see some real-world examples.

VBScript provides the RegExp object for creating and handling regular expressions.  It provides a series of methods and properties for searching and replacing based upon regular expression patterns.

Set objRegExp = New RegExp

The RegExp object is a native VBScript object.  However, it is not instantiated by default like most VBScript objects you’ve seen so far.  Thus it must be instantiated manually by referencing the RegExp class using a New statement.  A new statement simply returns the object instantiated by a specified class.

objRegExp.Global = False

objRegExp.IgnoreCase = True

objRegExp.Multiline = False

objRegExp.Pattern = "some pattern"

The RegExp class provides several properties for controlling its behavior when matching or replacing.  The Global property is used to determine whether a RegExp replacement will replace all occurrences or only the first match found.  The IgnoreCase property accepts a Boolean value that indicates whether a match should be case sensitive.  The Multiline property is an undocumented property that indicates whether whitespaces should match line break characters.  By default, all three of these properties are set to False.  Finally, the Pattern property is a text string pattern that determines what matches are found.  You’ll learn more about constructing patterns in a bit.

{mospagebreak title=Searching and Replacing}

On the surface, regular expressions just match a string of characters.  For instance, the regular expression “word” would match the letters w-o-r-d in both “word” and “sword.”  Replacements are simply string substitutions.  You provide the substitution and VBScript’s RegExp object makes a replacement when it finds matches.

strTest = "This is my test string."

 

Set objRegExp = New RegExp

objRegExp.IgnoreCase = True

objRegExp.Pattern = "test"

 

If objRegExp.Test(strTest) Then WScript.Echo "A match was found."

The RegExp object’s Test method returns a Boolean value that indicates whether a match was found.  It requires one parameter—a text string to perform the test against.  In this example, Test returns a value of True because the pattern “test” can be found in the string.

strTest = "This is my test string."

 

Set objRegExp = New RegExp

objRegExp.IgnoreCase = True

objRegExp.Pattern = "test"

 

Set colMatches = objRegExp.Execute(strTest)

 

For Each objMatch In colMatches

   WScript.Echo "RegExp matched " & objMatch.Length & " characters " & chr(34) & objMatch.Value & Chr(34) & _

       " beginning at position " & objMatch.FirstIndex

Next

 

Output:

RegExp matched 4 characters "test" beginning at position 11

If you want to work with the matches, you can use the RegExp object’s Execute method instead.  This method returns a collection of Match objects that represent the matches that were found.  Each Match object contains three properties that you can use to work with the matched patterns.  The length property returns the number of characters matched by the pattern, the Value property returns the text string that was matched, and the FirstIndex property returns an integer representing the string position where the match occurred.

WScript.Echo colMatches.Item(0).Value

WScript.Echo colMatches(0).Value

 

Output:

test

test

As with any collection, you can access individual objects directly.  Here I’m accessing the first match that was found.  In the first line, I use standard notation for accessing the collection.  Take a close look at the second line though.  The Item property is the default property for the Matches collection, so I don’t have to specify it directly.  Either syntax is perfectly acceptable.

strTest = "This is my test string."

 

Set objRegExp = New RegExp

objRegExp.IgnoreCase = True

objRegExp.Pattern = "test"

 

strNew = objRegExp.Replace(strTest, "new")

 

WScript.Echo strTest

WScript.Echo strNew

 

Output:

This is my test string.

This is my new string.

RegExp’s Replace method is used to perform a text replacement based upon pattern matches.  The first parameter indicates the text string to search and the second parameter is a string that represents the replacement text.

{mospagebreak title=Constructing Patterns}

At this point regular expressions don’t seem any better that VBS’s own string functions.  In fact, they only seem to take more code!  But that’s because you haven’t seen the magic of patterns yet.  What if you wanted to know how many words were in a sentence?

strTest = "This is my test string."

 

Set objRegExp = New RegExp

objRegExp.Global = True

objRegExp.IgnoreCase = True

objRegExp.Pattern = "w+"

 

Set colMatches = objRegExp.Execute(strTest)

WScript.Echo colMatches.Count

Or which words begin with the letter “t”?

strTest = "This is my test string."

 

Set objRegExp = New RegExp

objRegExp.Global = True

objRegExp.IgnoreCase = True

objRegExp.Pattern = "bt[a-z]+b"

 

Set colMatches = objRegExp.Execute(strTest)

For Each objMatch In colMatches

   WScript.Echo objMatch.Value

Next

You can quickly see how patterns can make all of the difference.  But what are patterns and how do you make them?  A pattern is a string of literal characters to be matched.  However, there are a series of reserved and escaped characters that can be used to control match positions, occurrences, wild cards, and more.  We’ll begin with match positions as listed in Table 1 below.

Table 1: Position Matching

Symbol

Description

^

Matches the beginning of a string.

“^This” would match the word “This” if it appeared at the beginning of a string.

$

Matches the end of a string.

“.$” matches the period at the end of a string.

b

Matches a word boundary.

“bt” matches the letter t at the beginning of a word.

B

Matches a non-word boundary.

“BxB” matches any letter x that does not appear at the beginning or end of a word.

After positioning, you’ll want to match literal characters.  Alphanumeric characters are treated as literals.  However, some of them have special meanings.  Those characters must be escaped by a back-slash.

Table 2: Matching Literals

Symbol

Description

Alphanumeric

Matches any alphanumeric character literally.

n

Matches a new line

f

Matches a form feed

r

Matches a carriage return

t

Matches horizontal tab

v

Matches a vertical tab

?

Matches a ?

*

Matches a *

+

Matches a +

.

Matches a .

|

Matches a |

{

Matches a {

}

Matches a }

Matches a

[

Matches a [

]

Matches a ]

(

Matches a (

)

Matches a )

xxx

Matches the ASCII character expressed by the Octal number.

“50” matches “(“ or Chr(40)

xdd

Matches the ASCII character expressed by the Hex number.

“x28” matches “(“ or Chr(40)

uxxxx

Matches the ASCII character expressed by the Unicode number.

“u00A3” matches “£”

Once you have the ability to match literal characters, you’ll probably find the need to expand a bit.  You may want to match any one character in a range of characters, or perhaps everything except a specified character.  This is done with character classes.

Table 3: Matching Character Classes

Symbol

Description

[xyz]

Matches any character is the character set.  Hyphens denote ranges.

“[a-z]” matches any character a through z

[^xyz]

Matches any character not in the character set.

“[^0-9] matches any non-digit character

.

Matches any character except n.

w

Match any word character.  Equivalent to [a-zA-Z_0-9]

W

Match any non-word character.  Equivalent to [^a-zA-Z_0-9]

d

Match any digit.  Equivalent to [0-9]

D

Match any non-digit character.  Equivalent to [^0-9]

s

Match any whitespace character. Equivalent to [ trnvf]

S

Match any non-whitespace character.  Equivalent to [^ trnvf]

At this point, your patterns will still be matching one character at a time.  To unleash the power of regular expressions, you need to match repeating characters.

Table 4: Matching Repetition

Symbol

Description

{x}

Matches x occurrences.

“d{5}” matches 5 digits.

{x,}

Matches x or more occurrences.

“d{2,}” matches 2 or more consecutive digits.

{x,y}

Matches x to y occurrences.

“d{2,3}” matches no less than two digits and no more than three.

?

Matches 0 or 1 occurrence.  Equivalent to {0, 1}.

“d?” matches 0 or 1 digit.

*

Matches 0 or more occurrences.  Equivalent to {0,}.

“d*” matches 0 or more digits.

+

Matches 1 or more occurrences.  Equivalent to {1,}.

“d+” matches 1 or more digits.

Finally, grouping and alternation offer the ability to make extremely complex regular expressions.  Grouping allows you to match clauses.  Alternation allows you to add more than one clause and match any one of them.

Table 5: Grouping and Alternation

Symbol

Description

()

Grouping creates a clause.  Clauses may be nested.

“(ab)?(c)” matches “abc” or “c”.

()|()

Alternation groups clauses into one expression and then matches any one of the clauses.

“(ab)|(cd)|(ef)” matches “ab”, “cd”, or “ef”.

Regular expressions also allow a feature called back referencing.  Back referencing allows you to reuse part of an expression.  This is done by providing a back-slash followed by a digit.  For example, the expression “(w+)s+1” matches any one word that occurs twice in a row.  In other words, the same match must be made twice in a row.

{mospagebreak title=Building useful patterns}

Now that you have all of the tools, let’s look at how to make them work.  Say we wanted to build a regular expression to match a standard ten-digit phone number in the form of (xxx) xxx-xxxx.

The regular expression could easily begin as “(ddd) ddd-dddd”.  This is a string of literals: an opening parenthesis followed by three digits, a closing parenthesis, a space, three more digits, a hyphen, and the last four digits.

If we apply repetition, we can condense this a bit to “(d{3}) d{3}-d{4}”.  Both expressions mean the same thing.  Now what if we wanted to make the parentheses optional?  Of course, if we do, the space should be a hyphen.  Enter grouping and alternation.

To begin, we need to specify what it should look like with parentheses.  Thus our expression should begin with “(d{3) ” as before.  Now we want to add an alternate possibility in case parentheses aren’t used.  The expression then becomes “((d{3))|(d{3}-)”.  This will match three digits between parentheses followed by a space, OR three digits followed by a hyphen.  We then complete the expression by adding the remaining part of the match.  The final expression looks like “((d{3))|(d{3}-)d{3}-d{4}”.  This expression would match either “(123) 456-7890” or “123-456-7890.

Another example would be a common U.S. zip code.  A U.S. zip code consists of five numeric digits followed by an optional four more separated by a hyphen.  Just as in the previous example, you can use grouping to accomplish this quite easily.  That expression would look like “d{5}(-d{4})?”.  Notice this time that I’m using the ? symbol to match either 0 or 1 of the last group.

You can see that building regular expressions can provide very powerful tools for matching and replacing text strings.  Very complex expressions can be built using a simple set of character symbols.  Stayed tuned for a future article that demonstrates more advance uses of VBScript regular expressions.

If you’re interested in testing your regular expressions, I’ve built a Regular Expression Tester HTML application using the regular expression test code found here.  Until next time, keep coding!

[gp-comments width="770" linklove="off" ]