Game Engine Architecture

(Ben Green) #1

248 5. Engine Support Systems


5.4.4.1. Unicode

The problem for most English-speaking soft ware developers is that they are
trained from birth (or thereabouts!) to think of strings as arrays of 8-bit ASCII
character codes (i.e., characters following the ANSI standard). ANSI strings
work great for a language with a simple alphabet, like English. But they just
don’t cut it for languages with complex alphabets containing a great many
more characters, sometimes totally diff erent glyphs than English’s 26 lett ers.
To address the limitations of the ANSI standard, the Unicode character set
system was devised.
Please set down this book right now and read the article entitled, “The
Absolute Minimum Every Soft ware Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky. You
can fi nd it here: htt p://www.joelonsoft ware.com/articles/Unicode.html. (Once
you’ve done that, please pick up the book again!)
As Joel describes in his article, Unicode is not a single standard but actu-
ally a family of related standards. You will need to select the specifi c standard
that best suits your needs. The two most common choices I’ve seen used in
game engines are UTF-8 and UTF-16.

UTF-8
In UTF-8, the character codes are 8 bits each, but certain characters occupy
more than one byte. Hence the number of bytes occupied by a UTF-8 character
string is not necessarily the length of the string in characters. This is known as
a multibyte character set (MBCS), because each character may take one or more
bytes of storage.
One of the big benefi ts of the UTF-8 encoding is that it is backwards-com-
patible with the ANSI encoding. This works because the fi rst character of a
multibyte character sequence always has its most signifi cant bit set (i.e., lies
between 128 and 255, inclusive). Since the standard ANSI character codes are
all less than 128, a plain old ANSI string is a valid and unambiguous UTF-8
string as well.

UTF-16
The UTF-16 standard employs a simpler, albeit more expensive, approach.
Each character takes up exactly 16 bits (whether it needs all of those bits or
not). As a result, dividing the number of bytes occupied by the string by two
yields the number of characters. This is known as a wide character set (WCS),
because each character is 16 bits wide instead of the 8 bits used by “regular”
ANSI chars.
Free download pdf