Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Character Encodings | 237

These differences are encapsulated in the concept oflocale. A locale is usually
defined as a language plus a country or region. It includes not only language but also
regional and local preferences and possibly a character encoding. A POSIX-style
locale identifier looks likeen_US.UTF-8 (English, United States, UTF-8 character
encoding).


Character Encodings


One of the most fundamental topics in i18n is the concept of acharacter encodingor
character set.*Computers work with numbers; people work with characters. A char-
acter encoding maps one to the other. This is simple enough. The difficulty comes,
as it usually does, because of history.


At the time of this writing, ASCII is nearing its 45thbirthday; yet we still see its legacy
today. This should not surprise anyone; data is usually the most long-lived part of a
computing system. As networking protocols and storage formats are built on top of
a character encoding, it should not be a surprise that the character encoding would
be among the most deeply entrenched and hardest to change parts of a protocol stack.


ASCII


ASCII, the American Standard Code for Information Interchange, was one of the first
character encodings to gain widespread use; it was introduced in 1963 and first stan-
dardized in 1967. Most encodings in use today descend from ASCII.


The ASCII standard (ANSI X3.4-1986) defines 128 characters. The first 32 charac-
ters (with hex values 0 through 1F) and the last character (7F) are nonprinting con-
trol characters. The remainder (20 through 7E) are printable. The control characters
have largely lost their original meaning, but the printable characters are nearly
always the same. The standard ASCII table is as follows.



  • A character set is a collection of characters (such as Unicode), while a character encoding is a mapping of a
    character set to a stream of bytes. For the older character sets such as ASCII, the two terms can generally be
    conflated.


x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1x DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2x !"#$%&'()*+,-./
3x 0123456789: ;<=>?
4x @A B C DE F GH IJ KLMNO
5x PQRSTUVWXYZ[\]^_
6x `abcdefghijklmno
7x pqrstuvwxyz{|}~DEL
Free download pdf