Advanced Rails - Building Industrial-Strength Web Apps in Record Time

(Tuis.) #1
Rails and Unicode | 247

you may have UTF-8 data stored in the database as Latin1. If you then convert the
table to UTF-8, the conversion will be performed twice, which will corrupt your
data. The standard procedure in this case is to dump the data as Latin1, piping the
dump throughsed to change the output character set to UTF-8:


mysqldump -uusername -p --default-character-set=latin1 mydb \
| sed -e 's/SET NAMES latin1/SET NAMES utf8/g' \
| sed -e 's/CHARSET=latin1/CHARSET=utf8/g' >mydb.sql

Then, load the dump back into MySQL as UTF-8:


mysql -uusername -p –default-character-set=utf8 <mydb.sql

The last step in this process is to set up the client connection to support UTF-8. Even if
all of the data is properly configured and using UTF-8, if MySQL thinks the client wants
Latin1 data, that is what it will send. The SQL command to set the client encoding in
MySQL is the following:


SET NAMES utf8;

The Rails MySQL connection adapter has anencodingoption that sets the client
encoding as well; in lieu of sending the preceding command, just add the following
to yourdatabase.yml:


production:
adapter: mysql
(...)
encoding: utf8

At this time, MySQL does not support 4-byte UTF-8 characters. This is generally not
a problem, as characters in the Basic Multilingual Plane can always be encoded in
three or fewer bytes.


PostgreSQL


PostgreSQL is in a similar situation; both the database encoding and client encod-
ing must be specified. The default encoding is SQL_ASCII. This is a special byte-
oriented compatibility encoding; the low-ASCII bytes (0x00 through 0x7F) are
treated as ASCII characters, and the rest (0x80 through 0xFF) are left alone. Because
of the design of UTF-8, the SQL_ASCII encoding is safe to use with UTF-8. How-
ever, it is not optimal, as the database server will not validate any input data.


A new database can be created with UTF-8 encoding, using either the-Eoption to
createdb or the SQLWITH ENCODING clause:


$ createdb -E UTF-8 new_database

-or-

=> CREATE DATABASE new_database WITH ENCODING 'UTF-8';

Existing databases that were created with another encoding can be dumped and
reloaded to convert them, as with MySQL.

Free download pdf