Discrete Mathematics for Computer Science

590 APPENDIX A Languages and Regular Sets

Definition 6 allows identifiers to be of any finite length. Although a compiler will normally have a maximum allowable length, this part of the theory works out much more nicely if no maximum is assumed. Actually, a few strings of letters are not legal identifiers because of their restricted use in a language. These words are called reserved words, and they may only be used as intended, not as identifiers. The definition above can be modified to show that the actual set of identifiers is a regular set. Example 4. (a) An integer constant consists of a sequence of one or more digits, possibly preceded by a + or - sign, such as 3 or +9876543210. The set of integer constants is also regular. Let UnsignedInteger = (Digit) + which is regular, since Digit is regular. Then, Integer = {±, =, A } (UnsignedInteger+) Since {+, _, A} is finite, and, thus, regular, Integer is regular. (b) One way to write a real number constant uses an optional sign (± or -), followed by one or more digits, followed by a decimal point, followed by one or more digits, such as in -3.1415926535. The set of strings of one or more digits was called UnsignedInteger above, and the set of real number constants of this form is

SimpleReal = Integer._Unsignedlnteger

which is regular. (c) There are two other ways to write a real constant, both of which use scientific notation. The number 6.02 x 1023 would be expressed as 6.02E23, with the E preceding the power of 10. It could also be written as 6.02E+23. This form has what was called a SimpleReal followed by E, followed by an integer. So, the set of all constants of this form is SimpleRealEInteger, which is regular. The set of strings of the third form is IntegerEInteger, which is also regular. Finally, the set of reals is

(SimpleReal) U (SimpleRealEInteger) U (IntegerEInteger) which is also regular. How is all this used in compiling a computer program? The language designer for each type of token in the language must define what its (regular) set of possible values are, often using something called regular set notation. The first phase of a compiler is called a scanner or lexical analyzer. It scans the input program for tokens. At each step, it takes the longest string of symbols that is in any of those regular sets. Suppose it has come to a line

in the program Jones <= Smith: J, Jo, Jon, and Jones are all legal variable names, and no

token may include a blank. So, the scanner identifies Jones as a variable name-actually, it identifies Jones as an identifier. Another part of the compiler decides whether this identifier

names a variable, a subprogram, or whatever else. The symbols <= and < are both legal

comparison operations, so it chooses <=, the longer, as its next token. Finally, Smith is a

legal token, but Smith, is not. So, it chooses Smith and ý as its final tokens.

Discrete Mathematics for Computer Science

SimpleReal = Integer._Unsignedlnteger

in the program Jones <= Smith: J, Jo, Jon, and Jones are all legal variable names, and no

names a variable, a subprogram, or whatever else. The symbols <= and < are both legal

comparison operations, so it chooses <=, the longer, as its next token. Finally, Smith is a

Get our desktop app

Company

Features

Documentation

Resources