Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1

238 | Chapter 8: i18n and L10n


Extended ASCII


Although ASCII defines 128 characters and a 7-bit encoding, most computers pro-
cess data in 8-bit bytes. This leaves room for 128 more characters. Of course, com-
puter vendors each chose their own way to deal with this situation. This led to the
development of numerousextended-ASCIIcharacter sets, each of which used a dif-
ferent interpretation for the upper octets (80 through FF).


The most widely adopted extended-ASCII standard is ISO 8859. This standard
adopts the ASCII values for the first 128 characters, and provides 15 different “parts”
that each provide a definition for the last 128 characters. In effect, ISO 8859 defines
15 separate character sets.


The most used of these character sets is ISO-8859-1 (Latin-1). This provides nearly
complete coverage for most Western European languages. In fact, the 256 characters
defined by ISO-8859-1 correspond to the first 256 code points of Unicode. ISO-
8859-1 is still in widespread use among languages that use the Latin alphabet.


Problems with ASCII


Though the extended ASCII character encodings were widely successful for years,
they only provided a temporary fix. With so many encodings floating around, it is
difficult for people to communicate. It is always impossible to look at a sequence of
bytes and determine their character encoding; that information must be carried out-
of-band. The more potential character sets in use, the worse this problem becomes.


Another problem with the use of ASCII or extended ASCII is that it has no support
for bidirectional, orbidi, text. Some written languages, such as Hebrew and Arabic,
are written primarily right-to-left (RTL). This causes problems in rendering systems
that were designed with left-to-right (LTR) text in mind. Bidirectional text, which
combines LTR and RTL within a page or paragraph, is usually impossible with
ASCII or extended ASCII.


The worst limitation of the extended-ASCII model is that it still only provides sup-
port for a maximum of 256 characters. This is not nearly enough for East Asian lan-
guages (the so-calledCJKorCJKV languages, for Chinese, Japanese, Korean, and
Vietnamese), which are ideographic and can require tens of thousands of characters
for adequate coverage. There are several encodings that cover the CJKV languages
specifically, but they do not solve the general problem of having too many encodings.


Unicode


The extended-ASCII model was successful for many years, and the ISO-8859 encod-
ings provided a good way to support different world scripts. However, the limita-
tions became increasingly bothersome; multiple languages could not be supported

Free download pdf