A Complete Guide to Web Design

(やまだぃちぅ) #1
Character Sets 459

International-
ization

Character Sets

Web Design in a Nutshell, eMatter Edition

the language. In addition, a single glyph may correspond to different characters,
such as a comma serving as both the punctuation symbol for a pause in a
sentence as well as a decimal indicator in some languages.


The number of characters available in a character set is limited by the bit-depth of
its encoding. For example, 8-bits are capable of describing 256 unique characters,
which is enough for most Western languages.


HTML 2.0 and 3.2 are based on the 8-bit character set for western languages called
Latin-1 (or ISO 8859-1). There are actually a number of other 8-bit encodings,
including:


16-Bit Encoded Character Sets


Sixteen bits of information are capable of representing 65,536 (2^16 ) different char-
acters—enough to contain a large number of alpabets and ideographs. In 1991,
The Unicode Consortium created a 16-bit encoded “super” character set called
Unicode (practically identical to another standard called ISO 10646-1) which
includes nearly every character from the world’s writing systems. Each character is
assigned a unique two-octet code (2 groups of 8 bits making 16 bits total). The
first 256 slots are given to the ISO 8859-1 character set, so it is backwards
compatible.


The HTML 4.0 Specification officially adopts Unicode as its document character
set. So regardless of the character encoding used when a document was created, it
will be converted to the document character set by the browser, which interprets
characters with special meaning in HTML (such as), and converts char-
acter entities (such as © for ©). In cases where a character entity points
outside of the Latin-1 character set (e.g., ϖ forπ), the HTML 4.0 browsers will
use the Unicode character set to display the correct character.


This is the first step toward making the Web truly multilingual.


Incidentally, Bitstream has created a TrueType font called “Cyberbit” that contains
a large percentage of the Unicode character set. For more information about
Cyberbit, see Bitstream’s site,http://www.bitstream.com/news/press/1997/pr-mar10.
html.


Specifying Character Encoding


The external character encoding for a document is communicated between
browser and server within the HTTP header of the document, as follows:


Content-type: text/html; charset=ISO-8859-8

To deliberately set the character-encoding information in a document header, use
thetag with itshttp-equiv attribute (which adds its values into the


ISO 8859-5 Cyrillic
ISO 8859-6 Arabic
ISO 8859-7 Greek
ISO 8859-8 Hebrew
SHIFT_JIS Japanese
EUC-JP Japanese
Free download pdf