Fun with Regular Expressions
AAHC

Click on A to make all fonts on the page smaller.

Click on A to make all fonts on the page larger.

Click on HC to toggle high contrast mode. When you move your mouse over some bold words in high contrast mode, related words are automatically highlighted. Text is shown in black and white.


Regular Expressions
So up until now we have claimed that Perl makes text manipulation easy. Regular expressions are what gives Perl a lot of its power. Unfortunately, working with regular expressions is not easy. Simple word matching patterns aren't difficult, but regular expressions can get very complex. For this reason, we're going to spend some extra time on regular expressions.

Unix regular expressions and Perl regular expressions aren't identical. They have very similar, but a lot of the things that can be done with Perl will not work on the Unix command line.
Quick Review
Let us briefly go over what we already know about regular expressions. We know that matching a simple string is fairly easy. We just include it inside of a pair of forward slashes.

/string/

A group of characters can be matched by using brackets. For example

/[abc]/

will match a, b, or c. We can combine these to come up with something like this:

/str[io]ng/

Let's look carefully at the way Perl would use this regular expression to match a string. Say we're searching for this pattern in the string below:


We start at the first character and continue searching until we match the first thing in the pattern.


After we see an S, we look for t and r. If both of those are found, we try to match the group [io].


The "u" is not an i or an o so we stop matching. We continue through the string looking for the beginning of our pattern again.


Here it is. Does the [io] group match?


It sure does. So now all we have to do is check for ng, if that matches we are done.


We have found our match! We can stop searching for the pattern now.

We can even match whole classes of data by using a dash.

/[A-Za-z]tr[a-z]ng[0-9]/
Multipliers
Multipliers allow us to match part of a pattern more than once. For instance, what if I wanted to match "Ned" and "need" with the same expression? Somehow I have to match multiple e's.

/[Nn]e+d/

The plus sign (+) stands for one or more. This expression will match an upper or lowercase n, followed by one or more e's and a d. But what if we needed to match "Nd" as well?

/[Nn]e*d/

The star (*) works in the same way except that it stands for zero or more. The last simple multiplier we should talk about is the question mark (?). The question mark lets us match zero or one of something. It is kind of like asking, "Is there one of these?"

/ab?c/

This would match abc and ac. Another way to use multipliers is to specify the minimum and maximum values directly using braces. There are three main ways to define a multiplier using braces. First, a single number in braces, such as {4}, means it will match exactly 4 times. If we follow that up with a comma, as in {4,}, it would match 4 or more times. The third and final way would be to specify a maximum value. To match 4 to 6 times we would use {4,6}. Let us see how this method can be used to make equivalents to the simple multipliers we already talked about.

+ can be represented by {1,}
* can be represented by {0,}
? can be represented by {0,1}
Special Characters
If there is one thing Perl has a lot of, it is special characters. The +, *, and ? from above are examples of special characters (characters that have special meanings depending on how they are used).

Two very useful special characters are the karat (^) and the dollar sign ($). When used in a regular expression, the ^ symbol will match the beginning of a line and $ matches the end. For example, if we wanted to match all lines that contain "The" at the beginning we would use this:

/^The/

Matching the end of a line works much the same way.

/the end$/

$ matches before any newline characters so you do not have to worry about those getting in the way. Let us not forget that these two are special characters. Remember that $ is used when we reference variables as well. The ^ also has another use. If it is included at the beginning of a character group with the brackets, it not match the list. Let us look at an example:

/str[^io]ng/

In this case, the grouping will match anything but i and o. Another special character is the period (.). We have already seen that it can be used as the concatenation operator for strings. But inside of a regular expression it is a whole different story. A period represents any character.

/str.ng/

This expression will match "string", "strong", "str9ng", etc. The only thing a period will not match is a newline. Periods are often used in conjunction with multipliers.

/^The.*nice/

Here we are matching any line that starts with "The" and contains any number of characters followed by "nice". That is all fine and good, but you are probably wondering how we would match an actual period in the text.
The Escape Character
The escape character, represented by a backslash (\), indicates to Perl that the character right after it should be treated differently than it normally would. This is used for two main purposes. First, it can make special characters have no special meaning or it can make ordinary characters take on a new meaning.

Say we wanted to match a period inside of our text. Let us take it a little further and try to match a dollar amount. The decimal point and the dollar sign are both special characters.

/\$[0-9]*\.[0-9]*/

The escape character has been used here to disregard the normal meaning of the dollar sign and decimal point. Can you spot any problems with the expression? Is it really useful? Are there any cases where the expression would not match a dollar amount? Aha! This expression will not match things like $1,000.00 and $5. Here is a better way (changes are in blue):

/\$[,0-9]*\.?[0-9]{0,2}/

The escape character functions with quotes, semi-colons, at symbols (@), etc. As mentioned, the escape character not only removes the special meaning from some symbols, it also adds meaning to others. Below is a table of the most commonly used of these:

CharacterMatches
\walphanumeric, including underscore (_)
\swhitespace
\dnumeric
\na newline
\Wnon-alphanumeric
\Snon-whitespace
\Dnon-digit

These special characters are used in a regular expressions just like any other character or group of characters. You are doing great! See you at the next lesson.