At the highest level, a regular expression is one or more branches separated by the
vertical bar character (|). This character is considered to have the properties of a logical-
OR. Any of the branches could match with an evaluated string. Table 16-1 provides a
few examples.
Each branch contains one or more atoms. These atoms may be followed by characters
that modify the number of times the atom may be matched in succession. An asterisk (*)
means the atom can match any number of times. A plus sign (+) means the atom must
match at least once. A question mark (?) signifies that the atom may match once or not at
all.
Alternatively, the atom may be bound, which means it is followed by curly braces, { and
}, that contain integers. If the curly braces contain a single number, then the atom must
be matched exactly that number of times. If the curly braces contain a number followed
by a comma, the atom must be matched that number of times or more. If the curly braces
contain two numbers separated by a comma, the atom must match at least the first
number of times, but not more than the second number. See Table 16-2 for some
examples of repetition.
An atom is a series of characters, some having special meaning, others simply standing
for a character that must be matched. A period (.) matches any single character. A carat
(^) matches the beginning of the string. A dollar sign ($) matches the end of the string. If
you need to match one of the special characters (^. [] $ () | *? {} ), put a
backslash in front of it. In fact, any character preceded by a backslash will be treated
literally, even if it has no special meaning. Any character with no special meaning will be
considered just a character to be matched, backslash or not. You may also group atoms
with parentheses so that they are treated as an atom.
Table 16-1. Branches in a Regular Expression
Sample Description
apple (^) Matches the word apple.
apple|ball (^) Matches either apple or ball.
begin|end|break (^) Matches either begin, end, or break.
Table 16-2. Allowing Repetition of Patterns in Regular Expressions
Sample Description
a(b*) (^) Matches a, ab, abb, ... — an a plus any number of b's.
a(b+) (^) Matches ab, abb, abbb, ... — an a plus one or more b's.
a(b?) (^) Matches either a or ab — an a possibly followed by a b.
a(b{3}) (^) Matches only abbb.
a(b{2,}) (^) Matches abb, abbb, abbbb, ... — an a followed by two or more b's.
a(b{2,4}) (^) Matches abb, abbb, abbbb — an a followed by two to four b's.