Discrete Mathematics for Computer Science

588 APPENDIX A Languages and Regular Sets

the set of words to be the set of all legal tokens of the language where a token in a computer language corresponds roughly to a word or a punctuation mark of a human language. For example, the program program Programmer (input, output); begin write (3 <= 4) end.

contains the tokens program Programmer ( ut output ) beg write t 3 <= 4 1 end The traditional compiler has a procedure called a scanner that divides the program into tokens. Here, as noted above, the alphabet E is the set of all characters that can be typed onto a computer screen except for space, which only divides tokens. Another very different way to look at a language is to define it to be the set of grammatically correct sentences. Now, the alphabet is the set of words of the language plus the set of punctuation marks, and the strings are called sentences. Words are separated by spaces. Punctuation marks also may need to be separated by spaces, depending on context. So, the following string contains five tokens:

Wherefore art thou Romeo?

with the last token being the question mark. Grammatical correctness is determined by the grammar, or syntax, of the language. Similarly, a computer language can be defined to be the set of syntactically correct programs in the language. The alphabet is the set of (grammatical categories of) tokens. The program Programmer is a string of 10 (categories of) tokens. The traditional compiler has a procedure called a parser that does most of the grammatical checking. We have described two very different ways of looking at a language, neither of which has mentioned meaning! English could also be defined to be the set of grammatically correct and meaningful sentences of English. For example, Chomsky's example Colorless green ideas sleep furiously is in the set of grammatically correct sentences but is not in the set of meaningful sentences. As an illustration of a mathematical formalism, we return to the question of how you describe the words in a language. Identifying an approximation to the words of English is relatively easy: Just specify English as the set of all words listed in the Oxford English Dictionary. For a computer language, this identification becomes harder. A computer program called a compiler must read a program in one language and translate it into another (a machine language). A part of the compiler divides a computer program into tokens. The rules for what is a token must be clear. In theory, the set of possible tokens in most computer languages is infinite. Although the number of actual tokens allowed by almost any actual compiler is finite (for example, variable names of 5 million characters usually are not acceptable), having a dictionary of all possible tokens and looking up candidate tokens in the dictionary would be prohibitively expensive. Fortunately, there is a theoretical

concept-that of a regular set-that elegantly describes the tokens of many languages.

Definition 4. Let x be a string over a finite alphabet E. For, n G N, define x^0 = A and

Xn+i = X .Xn.

Discrete Mathematics for Computer Science

concept-that of a regular set-that elegantly describes the tokens of many languages.

Definition 4. Let x be a string over a finite alphabet E. For, n G N, define x^0 = A and

Get our desktop app

Company

Features

Documentation

Resources