590 APPENDIX A Languages and Regular Sets
Definition 6 allows identifiers to be of any finite length. Although a compiler will
normally have a maximum allowable length, this part of the theory works out much more
nicely if no maximum is assumed. Actually, a few strings of letters are not legal identifiers
because of their restricted use in a language. These words are called reserved words, and
they may only be used as intended, not as identifiers. The definition above can be modified
to show that the actual set of identifiers is a regular set.
Example 4.
(a) An integer constant consists of a sequence of one or more digits, possibly preceded by
a + or - sign, such as 3 or +9876543210. The set of integer constants is also regular.
Let
UnsignedInteger = (Digit) +
which is regular, since Digit is regular. Then,
Integer = {±, =, A } (UnsignedInteger+)
Since {+, _, A} is finite, and, thus, regular, Integer is regular.
(b) One way to write a real number constant uses an optional sign (± or -), followed by
one or more digits, followed by a decimal point, followed by one or more digits, such as
in -3.1415926535. The set of strings of one or more digits was called UnsignedInteger
above, and the set of real number constants of this form is
SimpleReal = Integer._Unsignedlnteger
which is regular.
(c) There are two other ways to write a real constant, both of which use scientific notation.
The number 6.02 x 1023 would be expressed as 6.02E23, with the E preceding the
power of 10. It could also be written as 6.02E+23. This form has what was called a
SimpleReal followed by E, followed by an integer. So, the set of all constants of this
form is SimpleRealEInteger, which is regular. The set of strings of the third form is
IntegerEInteger, which is also regular. Finally, the set of reals is
(SimpleReal) U (SimpleRealEInteger) U (IntegerEInteger)
which is also regular.
How is all this used in compiling a computer program? The language designer for
each type of token in the language must define what its (regular) set of possible values are,
often using something called regular set notation. The first phase of a compiler is called a
scanner or lexical analyzer. It scans the input program for tokens. At each step, it takes the
longest string of symbols that is in any of those regular sets. Suppose it has come to a line
in the program Jones <= Smith: J, Jo, Jon, and Jones are all legal variable names, and no
token may include a blank. So, the scanner identifies Jones as a variable name-actually, it
identifies Jones as an identifier. Another part of the compiler decides whether this identifier
names a variable, a subprogram, or whatever else. The symbols <= and < are both legal
comparison operations, so it chooses <=, the longer, as its next token. Finally, Smith is a
legal token, but Smith, is not. So, it chooses Smith and ý as its final tokens.