Character Sets 459
International-
ization
Character Sets
Web Design in a Nutshell, eMatter Edition
the language. In addition, a single glyph may correspond to different characters,
such as a comma serving as both the punctuation symbol for a pause in a
sentence as well as a decimal indicator in some languages.
The number of characters available in a character set is limited by the bit-depth of
its encoding. For example, 8-bits are capable of describing 256 unique characters,
which is enough for most Western languages.
HTML 2.0 and 3.2 are based on the 8-bit character set for western languages called
Latin-1 (or ISO 8859-1). There are actually a number of other 8-bit encodings,
including:
16-Bit Encoded Character Sets
Sixteen bits of information are capable of representing 65,536 (2^16 ) different char-
acters—enough to contain a large number of alpabets and ideographs. In 1991,
The Unicode Consortium created a 16-bit encoded “super” character set called
Unicode (practically identical to another standard called ISO 10646-1) which
includes nearly every character from the world’s writing systems. Each character is
assigned a unique two-octet code (2 groups of 8 bits making 16 bits total). The
first 256 slots are given to the ISO 8859-1 character set, so it is backwards
compatible.
The HTML 4.0 Specification officially adopts Unicode as its document character
set. So regardless of the character encoding used when a document was created, it
will be converted to the document character set by the browser, which interprets
characters with special meaning in HTML (such as
acter entities (such as © for ©). In cases where a character entity points
outside of the Latin-1 character set (e.g., ϖ forπ), the HTML 4.0 browsers will
use the Unicode character set to display the correct character.
This is the first step toward making the Web truly multilingual.
Incidentally, Bitstream has created a TrueType font called “Cyberbit” that contains
a large percentage of the Unicode character set. For more information about
Cyberbit, see Bitstream’s site,http://www.bitstream.com/news/press/1997/pr-mar10.
html.
Specifying Character Encoding
The external character encoding for a document is communicated between
browser and server within the HTTP header of the document, as follows:
Content-type: text/html; charset=ISO-8859-8
To deliberately set the character-encoding information in a document header, use
thetag with itshttp-equiv attribute (which adds its values into the
ISO 8859-5 Cyrillic
ISO 8859-6 Arabic
ISO 8859-7 Greek
ISO 8859-8 Hebrew
SHIFT_JIS Japanese
EUC-JP Japanese