Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1

240 | Chapter 8: i18n and L10n



  • In UTF-8, text that only uses standard ASCII characters is byte-for-byte identical
    to its ASCII encoding. UTF-8 ensures that the encoding of every code point above
    U+007F begins with a high-ASCII character (with a most significant bit of 1).

  • Because of this, a UTF-8 encoded string will never contain the null byte (0x00),
    except as the encoding of the code point U+0000.

  • UTF-8 is somewhat self-synchronizing, which makes it resilient to error. Each
    type of byte in UTF-8 (single-byte character, first byte of a multibyte character,
    and subsequent bytes of a multibyte character) can be distinguished by its pre-
    fix. Therefore, you can start at any byte point in a string and find the next char-
    acter without working backward. Similarly, you can find the previous character
    by only working backward.

  • Because of these unique prefixes, no encoding of a character is a substring of
    another character’s encoding. For example, the ASCII character “a” is repre-
    sented by 0x61 in UTF-8. No other character’s encoding will contain the byte
    0x61, so if you see that byte, you know that it represents the character “a.” This
    ingenious design decision means that string searching works with standard, non-
    UTF-8-aware algorithms.


However, UTF-8’s similarity to previous encodings can lead to confusion. When
working with UTF-8 text, there are more things to think about:



  • The number of code points in a string cannot be determined from the number of
    bytes. The entire string must be read and processed to determine the number
    of characters.

  • Even when the number of code points is known, features such as ligatures, com-
    bining characters, bidi text, and control characters make it impossible to deter-
    mine how much space is needed to display a string without parsing every byte.

  • UTF-8 strings cannot be cut at byte boundaries; they must be cut on character
    boundaries. Due to the design of UTF-8, it is easy to find character boundaries
    with simple bit operations, but this must still be taken into account.


UTF-8 has largely won out over other encodings, especially on the Internet. Later in
this chapter, we will examine the problems encountered when working with UTF-8
text in Rails, and we will look at the solutions we have available.


The UnicodeBasic Multilingual Plane(BMP), which contains most of the scripts in
common use today, covers code points U+0000 through U+FFFF. In UTF-8, code
points in the BM Pcan be expressed in three or fewer bytes. Though Unicode sup-
ports up to 17 planes of characters (with 65,536 code points each), only about 10%
of the available space has been assigned thus far.

Free download pdf