Regular Expression Tutorial

Introduction

This section offers a real example of the XReplace-32 regular expressions mechanism. It explains step by step a complex replacement that saved hours of time to a real XReplace-32 user. This is a live problem from the industry that was impossible to solve without regular expressions.

Solving a Real Problem

There're hundreds of files of the following format:
Software - Package - (Path:C:\WINNT\SYSTEM32\NTOSKRNL.EXE) - Version = 4.0
Hardware - Package - (Path:T:\32bit\system\4nt\PRODUCTS.TXT) - Name = Microsoft
Windows NT SP3
Software - Package - (Path:N:\32bit\Xreplace\xrep32.exe) - File Size = 914688
Hardware - Package - (Path:C:\WINNT\SYSTEM32\MFC42.DLL) - File Date = 862437600
No
The task is to remove paths inside the brackets to produce the following output:
Software - Package - NTOSKRNL.EXE - Version = 4.0
Hardware - Package - PRODUCTS.TXT - Name = Microsoft
Windows NT SP3
Software - Package - xrep32.exe - File Size = 914688
Hardware - Package - MFC42.DLL - File Date = 862437600
No

Writing Regular Expression Ranges

A regular expression is a series of patterns that are matched to the real data. Once the pattern is matched, it's cut into pieces following the pattern format. Each piece is identified by a number and can be altered, copied or replaced.

A range defines what kind of characters a pattern can contain. For example, all capital letters range from A to Z. XReplace-32 regexp range for capital letters is written
[A-Z]
. A range for all alphanumeric characters is thus:
[A-Z,a-z,0-9]
. To include a single bracket ( into the pattern you'll have to use a backslash:
[A-Z,a-z,0-9,\(]

Writing Regular Expression Patterns

Patterns are separated in XReplace-32 by putting them into parenthesis. A full pattern of all alphanumeric characters is written
([A-Z,a-z,0-9])
and will match any character between A and Z, a and z or 0 and 9. To match a sequence of characters it is necessary to express the fact that a pattern is repeated multiple times (as many as possible). This is done by adding a * to the pattern:
([A-Z,a-z,0-9]*)

Multiple patterns are easy to write. For example, a filename is of the form name.extension and will be matched by
([A-Z,a-z,0-9]*\.[A-Z,a-z,0-9]*)
This reads: a sequence of alphanumeric characters followed by a dot and an another sequence of alphanumeric characters.

Getting to the Point

We can now attempt to get rid of (Path:?:\ and replace it by a simple backslash in order to obtain:
Software - Package - \WINNT\SYSTEM32\NTOSKRNL.EXE) - Version = 4.0
for the fisrt line.

The source replacement is:
(\(Path:)([A-Z,a-z]\:\\)
and the target is
\\ 
which will match a left parenthesis followed by Path: , a single character, a column and a backslash.

We must now get rid of all sequences between backslashes:
Software - Package - \WINNT\SYSTEM32\NTOSKRNL.EXE) - Version = 4.0
Software - Package - \SYSTEM32\NTOSKRNL.EXE) - Version = 4.0
Software - Package - \NTOSKRNL.EXE) - Version = 4.0

There's no mechanism to repeat a replacement till no more occurrences are found. We'll have to repeat the operation that will replace a backslash followed by a sequence of alphanumeric characters (directory names) and terminated by another backslash.

Rewriting Patterns

The replacement is almost finished. It's obviously easy to replace a filename starting by a backslash and ending with a right parenthesis. The source pattern sequence is:
(\\)([A-Z,a-z,0-9]*\.[A-Z,a-z,0-9]*)(\))

As we replaced by a simple backslash in previous steps, we'll have to rewrite the filename as the result of the replacement. Remember that each pattern could be identified by a number. In this example, patterns are the following:
1: \\
2: [A-Z,a-z,0-9]*\. [A-Z,a-z,0-9]*
3: \)

The target replacement is \2 , which is the filename only.

Getting it Together

The full replacements sequence is finally:

Imagine making such a replacement in 10'000 files by hand!

Note

Use the prompted mode to try replacements before they are effectively made. Regular expressions have lots of side effects when it comes to complex replacements because of the very extensive pattern matching algorithms.