Foundations of Python Network Programming

(WallPaper) #1

Chapter 5 ■ Network Data aND Network errors


78


The same is true of the many Windows code pages that you will see listed in the Standard Library. A few
single-byte encodings, however, share nothing in common with ASCII because they are based on alternative
standards from the old days of big IBM mainframes.





b'\x67\x68\x69\xe7\xe8\xe9'.decode('EBCDIC-CP-BE')
'ÅÇÑXYZ'





The multibyte encodings that you are most likely to encounter are the old UTF-16 scheme (which had a brief
heyday back when Unicode was much smaller and could fit into 16 bits), the modern UTF-32 scheme, and the
universally popular variable-width UTF-8 that looks like ASCII unless you start including characters with codes
greater than 127. Here is what a Unicode string looks like using all three:





len('Namárië!')
8
'Namárië!'.encode('UTF-16')
b'\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'
len()
18
'Namárië!'.encode('UTF-32')
b'\xff\xfe\x00\x00N\x00\x00\x00a\x00\x00\x00m\x00\x00\x00\xe1\x00\x00\x00r\x00\x00\x00i\x00\x00\
x00\xeb\x00\x00\x00!\x00\x00\x00'
len(
)
36
'Namárië!'.encode('UTF-8')
b'Nam\xc3\xa1ri\xc3\xab!'
len(_)
10





If you peer hard into each encoding, you should be able to find the bare ASCII letters N, a, m, r, and i scattered
among the byte values representing the non-ASCII characters.
Note that the multibyte encodings each include an extra character, bringing the UTF-16 encoding to a full (8 × 2)



  • 2 bytes and UTF-32 to (8 × 4) + 4 bytes. This special character \xfeff is the byte order marker (BOM) and can allow
    readers to autodetect whether the several bytes of each Unicode character are stored with the most significant or least
    significant byte first. (See the next section for more about byte order.)
    There are two characteristic errors that you will encounter when working with encoded text: attempting to
    load from an encoded byte string that does not in fact follow the encoding rules that you are trying to interpret and
    attempting to encode characters that cannot actually be represented in the encoding you are requesting.





b'\x80'.decode('ascii')
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
'ghihqi'.encode('latin-1')
Traceback (most recent call last):
...
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 3-5: ordinal not in range(256)




Free download pdf