7.1. Lexical Elements
One of the first phases of compilation is the scanning of the lexical elements into tokens. This phase ignores
whitespace and comments that appear in the textso the language must define what form whitespace and
comments take. The remaining sequence of characters must then be parsed into tokens.
7.1.1. Character Set
Most programmers are familiar with source code that is prepared using one of two major families of character
representations: ASCII and its variants (including Latin-1) and EBCDIC. Both character sets contain
characters used in English and several other Western European languages.
The Java programming language, on the other hand, is written in a 16-bit encoding of Unicode. The Unicode
standard originally supported a 16-bit character set, but has expanded to allow for up to 21-bit characters with
a maximum value of 0x10ffff. The characters above the value 0x00ffff are termed the supplementary
characters. Any particular 21-bit value is termed a code point. To allow all characters to be represented by
16-bit values, Unicode defines an encoding format called UTF-16, and this is how the Java programming
language represents text. In UTF-16 all the values between 0x0000 and 0xffff map directly to Unicode
characters. The supplementary characters are encoded by a pair of 16-bit values: The first value in the pair
comes from the high-surrogates range, and the second comes from the low-surrogates range. Methods that
want to work with individual code point values can either accept a UTF-16 encoded char[] of length two, or
a single int that holds the code point directly. An individual char in a UTF-16 sequence is termed a code
unit.
The first 256 characters of Unicode are the Latin-1 character set, and most of the first 128 characters of
Latin-1 are equivalent to the 7-bit ASCII character set. Current environments read ASCII or Latin-1 files,
converting them to Unicode on the fly.[1]
[1] The Java programming language tracks the Unicode standard. See "Further Reading" on
page 755 for reference information. The currently supported Unicode version is listed in the
documentation of the Character class.
Few existing text editors support Unicode characters, so you can use the escape sequence \uxxxx to encode
Unicode characters, where each x is a hexadecimal digit ( 09 , and af or AF to represent decimal values 1015).
This sequence can appear anywhere in codenot only in character and string constants but also in identifiers.
More than one u may appear at the beginning; thus, the character can be written as \u0b87 or
\uuu0b87.[2] Also note that if your editor does support Unicode characters (or a subset), you may need to
tell your compiler if your source code contains any character that is not part of the default character encoding
for your systemsuch as through a command-line option that names the source character set.
[2] There is a good reason to allow multiple u's. When translating a Unicode file into an
ASCII file, you must translate Unicode characters that are outside the ASCII range into an
escape sequence. Thus, you would translate into \u0b87. When translating back, you
make the reverse substitution. But what if the original Unicode source had not contained
but had used \u0b87 instead? Then the reverse translation would not result in the original
source (to the parser, it would be equivalent, but possibly not to the reader of the code). The
solution is to have the translator add an extra u when it encounters an existing \uxxxx, and
have the reverse translator remove a u and, if there aren't any left, replace the escape
sequence with its equivalent Unicode character.
Exercise 7.1: Just for fun, write a "Hello, World" program entirely using Unicode escape sequences.