Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Unicode | 239

within one document, and the CJKV languages had their own independently devel-
oped character sets and encodings. In addition, the Internet began to develop in the
1990s, connecting people and allowing them to exchange digital information with a
far greater reach than before.


So, in 1991, the Unicode Consortium published the first Unicode standard. Unicode
sought to be the “one true character set” in which all text would eventually be repre-
sented. In a large part, that goal is well on the way to being accomplished. Unicode is
a widely known, well-supported standard that is used extensively on the Internet and
in other forms of data exchange today.


Unicode supports all of the world’s writing systems currently in use and many
archaic ones, with very few exceptions. There is no “code page” switching as there
was under the old character-set systems. All of the scripts can be used interchange-
ably within a document, and the encodings are universal; they can be exchanged
over the Internet without worrying too much about differing encodings.


Unicode deals with the world in Platonic ideals. Rather than representing glyphs (the
rendering of a character), each Unicode code point represents agrapheme(the char-
acter abstracted from its representation).*This is consistent with the purpose of a
character encoding: to encode text without specifying presentation. For example, the
following two characters are the same grapheme and would be represented by the
same Unicode code point (U+0061, LATIN SMALL LETTER A), even though they
are different glyphs (see Figure 8-1).


Though the distinction between graphemes and glyphs is relatively easy to make for
English, it can be very difficult and occasionally political for Han characters (the
ideographs common to CJKV languages).†


Unicode Transformation Formats


One of the key factors driving the adoption of Unicode isUTF-8(8-bit Unicode
Transformation Format). UTF-8 has several clever features (some would call them
compromises) that make it attractive to those who are used to working with ASCII or
Latin-1 text:



  • In this chapter, I usegrapheme andcharacter synonymously.


Figure 8-1. Alternative glyphs representing the “a” grapheme


† Seehttp://en.wikipedia.org/wiki/Han_unification for one aspect of this situation.

Free download pdf