588 APPENDIX A Languages and Regular Sets
the set of words to be the set of all legal tokens of the language where a token in a computer
language corresponds roughly to a word or a punctuation mark of a human language. For
example, the program
program Programmer (input, output);
begin
write (3 <= 4)
end.
contains the tokens
program Programmer ( ut output )
beg write t 3 <= 4 1 end
The traditional compiler has a procedure called a scanner that divides the program
into tokens. Here, as noted above, the alphabet E is the set of all characters that can be
typed onto a computer screen except for space, which only divides tokens.
Another very different way to look at a language is to define it to be the set of gram-
matically correct sentences. Now, the alphabet is the set of words of the language plus
the set of punctuation marks, and the strings are called sentences. Words are separated by
spaces. Punctuation marks also may need to be separated by spaces, depending on context.
So, the following string contains five tokens:
Wherefore art thou Romeo?
with the last token being the question mark. Grammatical correctness is determined by the
grammar, or syntax, of the language.
Similarly, a computer language can be defined to be the set of syntactically correct
programs in the language. The alphabet is the set of (grammatical categories of) tokens.
The program Programmer is a string of 10 (categories of) tokens. The traditional compiler
has a procedure called a parser that does most of the grammatical checking.
We have described two very different ways of looking at a language, neither of which
has mentioned meaning! English could also be defined to be the set of grammatically cor-
rect and meaningful sentences of English. For example, Chomsky's example Colorless
green ideas sleep furiously is in the set of grammatically correct sentences but is not in the
set of meaningful sentences.
As an illustration of a mathematical formalism, we return to the question of how you
describe the words in a language. Identifying an approximation to the words of English
is relatively easy: Just specify English as the set of all words listed in the Oxford English
Dictionary. For a computer language, this identification becomes harder. A computer pro-
gram called a compiler must read a program in one language and translate it into another
(a machine language). A part of the compiler divides a computer program into tokens. The
rules for what is a token must be clear. In theory, the set of possible tokens in most com-
puter languages is infinite. Although the number of actual tokens allowed by almost any
actual compiler is finite (for example, variable names of 5 million characters usually are
not acceptable), having a dictionary of all possible tokens and looking up candidate to-
kens in the dictionary would be prohibitively expensive. Fortunately, there is a theoretical
concept-that of a regular set-that elegantly describes the tokens of many languages.
Definition 4. Let x be a string over a finite alphabet E. For, n G N, define x^0 = A and
Xn+i = X .Xn.