Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Rails and Unicode | 249


  • Forms from third-party sites pointed at your server may not be encoded in UTF-8.
    These forms will post their data in the original character set.

  • When interacting with other systems through web services or messaging, a char-
    acter set and encoding must be agreed upon.

  • When retrieving data from the Web (with net/http or open-uri), you must be
    sure to convert text from its source encoding into your working encoding.


To remedy this situation, you can use the iconv library, which is part of the Ruby stan-
dard library. We have seen this earlier; it was used to strip invalid characters out of our
UTF-8. To convert a string from one encoding to another, create anIconvobject, pro-
viding the source and destination encodings, and call itsiconv instance method:


require 'iconv'

# Latin-1 (ISO-8859-1) equivalent of "café"
# Latin-1 E9 == "é"
cafe_latin1 = "caf#{"E9".hex.chr}"

ic = Iconv.new("utf-8", "iso-8859-1") # to_encoding, from_encoding
cafe_utf8 = ic.iconv(cafe_latin1)

We can play with the $KCODE variable to change how we see the output. If we set
$KCODEto"U", the string is interpreted as UTF-8 and we see the properly converted
“café.” If$KCODEis"A", the string is interpreted as a series of bytes, and so we see the
unprintable characters escaped:


cafe_latin1 # => "caf\351"

$KCODE = "U"
cafe_utf8 # => "café"

$KCODE = "A"
cafe_utf8 # => "caf\303\251"

As usual, we can see the byte length of each string withString#length:


cafe_latin1.length # => 4
cafe_utf8.length # => 5

JavaScript URI encoding and UTF-8


There is one important thing to remember if you use JavaScript to URI-encode text
in a UTF-8 environment: always encode data using encodeURI( ) or
encodeURIComponent( ); do not useescape( ). TheencodeURIforms follow RFC 3986,
converting the text to UTF-8 and percent-encoding each byte. This makes things
much easier on the server end.


Theescape( )function, on the other hand, escapes one character at a time, using
nonstandard constructs such as%u1234(corresponding to the code point U+1234). It
escapes extended-ASCII characters as Latin-1, even on a page served as UTF-8:

Free download pdf