Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Rails and Unicode | 243

Unicode Normalization


As with any increasingly complicated encoding, normalization and canonicalization
are important issues with Unicode. One representation on paper (or screen) may
map to multiple encodings. In some cases, it may be more desirable to treat those
sequences identically, but in other cases we may need to treat them differently.


One complicating issue ischaracter composition. Unicode provides multiple versions
of some characters, for various reasons. For example, theöin the German wordschön
can be encoded as eitherö(U+00F6 LATIN SMALL LETTER O WITH DIAERESIS)
or as the combination ofo(U+006F LATIN SMALL LETTER O) and ̈(U+0308
COMBINING DIAERESIS). The two representations use different byte sequences,
and therefore they would not compare as equivalent to a byte-oriented procedure.


Another example iscompatibility characters, or characters that were introduced into
Unicode for compatibility with older encodings. One area where this occurs is typo-
graphical ligatures (see Figure 8-2).


The text on the left does not use a ligature. For typographical reasons, the style on
the right is usually used for the combination offandi. The original intent of Uni-
code was that a smart rendering system would replace the consecutive code pointsf
andiwith the appropriate ligature. However, many systems turned out not to be
capable of this advanced rendering (Mac OS X being a notable exception). There-
fore, common ligatures were given their own code points, so that they could be
embedded in a body of text and rendered (with a suitable font including those liga-
tures) with a dumb client. In this case, the ligature “fi” is U+FB01 LATIN SMALL
LIGATURE FI.


To support character composition on platforms with less complex rendering sys-
tems, Unicode includesprecomposed characters, such as theöshown earlier (U+00F6
LATIN SMALL LETTER O WITH DIAERESIS). Compatibility characters such as
the typographical ligatures are often precomposed. In order to properly compare and
collate strings that may include both combining characters and precomposed charac-
ters, the strings must becanonicalized, or reduced to a well-known form such that
two strings that are “the same” (by some definition) will always map to the same
sequence of code points.


Figure 8-2. The “fi” sequence shown without a ligature and with a ligature

Free download pdf