Game Engine Architecture

248 5. Engine Support Systems

5.4.4.1. Unicode

The problem for most English-speaking soft ware developers is that they are trained from birth (or thereabouts!) to think of strings as arrays of 8-bit ASCII character codes (i.e., characters following the ANSI standard). ANSI strings work great for a language with a simple alphabet, like English. But they just don’t cut it for languages with complex alphabets containing a great many more characters, sometimes totally diff erent glyphs than English’s 26 lett ers. To address the limitations of the ANSI standard, the Unicode character set system was devised. Please set down this book right now and read the article entitled, “The Absolute Minimum Every Soft ware Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky. You can fi nd it here: htt p://www.joelonsoft ware.com/articles/Unicode.html. (Once you’ve done that, please pick up the book again!) As Joel describes in his article, Unicode is not a single standard but actu- ally a family of related standards. You will need to select the specifi c standard that best suits your needs. The two most common choices I’ve seen used in game engines are UTF-8 and UTF-16.

UTF-8 In UTF-8, the character codes are 8 bits each, but certain characters occupy more than one byte. Hence the number of bytes occupied by a UTF-8 character string is not necessarily the length of the string in characters. This is known as a multibyte character set (MBCS), because each character may take one or more bytes of storage. One of the big benefi ts of the UTF-8 encoding is that it is backwards-com- patible with the ANSI encoding. This works because the fi rst character of a multibyte character sequence always has its most signifi cant bit set (i.e., lies between 128 and 255, inclusive). Since the standard ANSI character codes are all less than 128, a plain old ANSI string is a valid and unambiguous UTF-8 string as well.

UTF-16 The UTF-16 standard employs a simpler, albeit more expensive, approach. Each character takes up exactly 16 bits (whether it needs all of those bits or not). As a result, dividing the number of bytes occupied by the string by two yields the number of characters. This is known as a wide character set (WCS), because each character is 16 bits wide instead of the 8 bits used by “regular” ANSI chars.

Game Engine Architecture

Get our desktop app

Company

Features

Documentation

Resources