Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Rails and Unicode | 245

Paul Battley wrote an article addressing the issue of filtering untrusted UTF-8 strings.*
As with most other hard problems in Rails, we cheat. In this case, the iconv library
can clean up UTF-8 strings for us:


require 'iconv'

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

TheIconv.newline creates a newIconvobject to translate potentially invalid UTF-8
data into UTF-8 data with invalid characters ignored. The next line works around an
Iconvbug: it will not detect an invalid byte at the end of a string. Therefore, we add a
space (a known-valid byte) and chop it off after performing the conversion.


Ilya Grigorik shows how to use the Oniguruma regular expression engine to filter out
control characters (of the Cx classes).†Note that the Oniguruma engine is standard
in Ruby 1.9, but is also available for Ruby 1.8 (gem install oniguruma).


require 'oniguruma'

# Finall all Cx category graphemes
reg = Oniguruma::ORegexp.new("\p{C}", {:encoding => Oniguruma::ENCODING_UTF8})

# Erase the Cx graphemes from our validated string
filtered_string = reg.gsub(validated_string, '')

Storing UTF-8


Proper i18n requires that your character set be correctly processed in the application
and correctly stored in the database. For most Rails applications, this means setting up
the database and connection to be UTF-8 clean. Since Rails 1.2, ActiveRecord correctly
processes UTF-8 data and is ready for UTF-8 storage over supported connections. The
specifics differ among database engines, so we’ll examine MySQL and PostgreSQL here.


MySQL


To properly store UTF-8 data in a MySQL database, two things need to be in place.
First, the database and tables need to be configured with the proper encoding. Sec-
ondly, the client connection between ActiveRecord and MySQL needs to use UTF-8.


MySQL ships with Latin1 (ISO-8859-1) as the default character set. Thus, all of the
string operations are by default byte-oriented. You can change the default character
set and collation for the entire database server with the following commands in the
MySQL configuration file (my.cnf):


character-set-server=utf8
default-collation=utf8_unicode_ci

*http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
http://www.igvita.com/blog/2007/04/11/secure-utf-8-input-in-rails/

Free download pdf