Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Rails and Unicode | 241

Rails and Unicode


Ruby 1.8 has less-than-ideal Unicode support, when compared to its contemporaries
such as Java and the .NET languages. To Ruby, strings are just sequences of 8-bit
bytes, while the character and string types of the Java runtime and .NET CLR are
based on Unicode code points. While Ruby’s approach simplifies the language, most
developers at this point in time need Unicode support. Luckily, Ruby is flexible
enough that we can tack support for Unicode onto the language in a relatively
friendly way.


It is not surprising that Ruby’s Unicode support is lacking. During the time of Ruby’s
genesis in Japan (the mid-1990s), Unicode was first being developed. In Unicode’s
early stages, its supporters were mainly American and European, with less East Asian
involvement.


Many Japanese people opposed the process ofHan unification, or collapsing most of
the Han characters common to CJKV languages into a single set of code points. The
unified Han characters tended to appeal more to Chinese speakers than Japanese
speakers. The people involved in Han unification (primarily Westerners) tended to
collapse characters that were similar, but not identical, across Asian languages. In the
early days of Unicode, rendering software would get confused and display similar,
but incorrect, glyphs for the Han-unified characters. This was at best disconcerting;
at worst, offensive.


There are technical solutions to all of these problems today, but Unicode was a slow
starter in Japan. Other character sets such as Shift_JIS gained more currency in Japan
at the time, which actually may have contributed somewhat to the problem; having
more extant character sets leads to more conversion issues.*


Multilingualization in Ruby 1.9


Ruby 1.9 will support multilingualization (m17n). Rather than a built-in Unicode
assumption, Ruby 1.9 will support interoperability between multiple character sets.
This is more flexible than assuming that all string literals are Unicode, and it is a
more general approach to character set handling. To use UTF-8 for all string and
regex literals, the following pragma can be used:


# coding: utf-8


  • Matz expresses this sentiment in an interview available athttp://blog.grayproductions.net/articles/theruby
    vm_episode_iv.

Free download pdf