Advanced Rails - Building Industrial-Strength Web Apps in Record Time

240 | Chapter 8: i18n and L10n

In UTF-8, text that only uses standard ASCII characters is byte-for-byte identical
to its ASCII encoding. UTF-8 ensures that the encoding of every code point above
U+007F begins with a high-ASCII character (with a most significant bit of 1).

Because of this, a UTF-8 encoded string will never contain the null byte (0x00),
except as the encoding of the code point U+0000.

UTF-8 is somewhat self-synchronizing, which makes it resilient to error. Each
type of byte in UTF-8 (single-byte character, first byte of a multibyte character,
and subsequent bytes of a multibyte character) can be distinguished by its pre-
fix. Therefore, you can start at any byte point in a string and find the next char-
acter without working backward. Similarly, you can find the previous character
by only working backward.

Because of these unique prefixes, no encoding of a character is a substring of
another character’s encoding. For example, the ASCII character “a” is repre-
sented by 0x61 in UTF-8. No other character’s encoding will contain the byte
0x61, so if you see that byte, you know that it represents the character “a.” This
ingenious design decision means that string searching works with standard, non-
UTF-8-aware algorithms.

However, UTF-8’s similarity to previous encodings can lead to confusion. When
working with UTF-8 text, there are more things to think about:

The number of code points in a string cannot be determined from the number of
bytes. The entire string must be read and processed to determine the number
of characters.

Even when the number of code points is known, features such as ligatures, com-
bining characters, bidi text, and control characters make it impossible to deter-
mine how much space is needed to display a string without parsing every byte.

UTF-8 strings cannot be cut at byte boundaries; they must be cut on character
boundaries. Due to the design of UTF-8, it is easy to find character boundaries
with simple bit operations, but this must still be taken into account.

UTF-8 has largely won out over other encodings, especially on the Internet. Later in
this chapter, we will examine the problems encountered when working with UTF-8
text in Rails, and we will look at the solutions we have available.

The UnicodeBasic Multilingual Plane(BMP), which contains most of the scripts in
common use today, covers code points U+0000 through U+FFFF. In UTF-8, code
points in the BM Pcan be expressed in three or fewer bytes. Though Unicode sup-
ports up to 17 planes of characters (with 65,536 code points each), only about 10%
of the available space has been assigned thus far.

Advanced Rails - Building Industrial-Strength Web Apps in Record Time

Get our desktop app

Company

Features

Documentation

Resources