Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1

244 | Chapter 8: i18n and L10n


To canonicalize sequences of code points, we must first determine what our notion
of equivalence is. Unicode defines two types of equivalence: the narrowcanonical
equivalenceand the broadercompatibility equivalence. Canonical equivalence is lim-
ited to characters that are equal in both form and function—the standard example
being the decomposedö(the two code pointsoand ̈) versus the precomposed char-
acterö(one code point). Two sequences of code points, such as those, that are
canonically equivalent are identical in appearance and usage, and can in nearly all
cases be substituted for each other.


Compatibility equivalence is a broader concept. Compatibility equivalence includes
all canonically equivalent characters, plus characters that may have different seman-
tics but are rendered similarly. Examples include the charactersfandiversus thefi
ligature, or the superscript^2 versus the ordinary numeral 2.


There are four methods of Unicode normalization: D, C, KD, and KC. (They are also
referred to as NFD, NFC, NFKD, and NFKC, with NF standing forNormalization
Form.) The D forms leave the string in a decomposed form, while the C forms leave
the string canonically composed (by first decomposing, and then recomposing by
canonical equivalence). The K forms decompose by compatibility equivalence, while
those without a K decompose by canonical equivalence. (All composition is done
under canonical equivalence to ensure a consistent composition.)


ActiveSupport provides methods on the UTF-8 handler for Unicode normalization,
supporting all four forms. The following code shows the differences between the four
forms as applied to the stringfinal piñata. The first word includes the fi ligature,
which is compatibility equivalent (but not canonically equivalent) to the separated
characters fi. The second word includes the characterñ, which is both compatibility
equivalent and canonically equivalent to the code pointsn and ̃.


$KCODE = 'u'

str = "final piñata".chars

str.normalize(:d).to_s # => "final pin ̃ata"
str.normalize(:c).to_s # => "final piñata"
str.normalize(:kd).to_s # => "final pin ̃ata"
str.normalize(:kc).to_s # => "final piñata"

Filtering UTF-8 Input


Although you may be UTF-8 clean through your entire system (UTF-8 text can be
entered anywhere and is displayed identically upon output), you are still at risk of
problems if you just accept user-provided strings as UTF-8. Users can provide invalid
UTF-8 text (not all byte sequences correspond to valid sequences of UTF-8 code
points). Users will even provide maliciously malformed UTF-8 text in an attempt to
crash or exploit your string-processing functions.

Free download pdf