Foundations of Python Network Programming

(WallPaper) #1
Chapter 5 ■ Network Data aND Network errors

77

The character in the upper-left corner is the space, by the way, at character code 32. (The invisible character at
the lower-right corner is, oddly enough, one last control character: Delete at position 127.) Note two clever tricks in
this 1960 encoding. First, the digits are ordered so that you can compute any digit’s mathematical value by subtracting
the code for the digit zero. Plus, by flipping the 32’s bit, you can switch between the uppercase and lowercase letters or
can force letters to one case or the other by setting or clearing the 32’s bit on a whole string of letters.
But Python 3 goes far beyond ASCII in the character codes its strings can include. Thanks to a more recent
standard named Unicode, we now have character code assignments for numbers reaching beyond the 128 ASCII
codes and up into the thousands and even millions. Python considers strings to be made of a sequence of Unicode
characters and, as is usual for Python data structures, the actual representation of Python strings in RAM is carefully
concealed from you while you are working with the language. But when dealing with data in files or on the network,
you will have to think about external representation and about two terms that help you keep straight the meaning of
your information versus how it is transmitted or stored:


Encoding characters means turning a string of real Unicode characters into bytes that can
be sent out into the real world outside your Python program.

Decoding byte data means converting a byte string into real characters.

It might help you remember to which conversions these words refer if you think of the outside world as
consisting of bytes that are stored in a secret code that has to be interpreted or cracked if your Python program is
going to process them correctly. To move data outside your Python program, it must become code; to move back in,
it must be decoded.
There are many possible encodings in use in the world today. They fall into two general categories.
The simplest encodings are single-byte encodings that can represent at most 256 separate characters but that
guarantee every character fits into a single byte. These are easy to work with when writing network code. You know
ahead of time that reading n bytes from a socket will generate n characters, for example, and you also know when a
stream gets split into pieces that each byte is a stand-alone character that can safely be interpreted without knowing
what byte will follow it. Also, you can seek immediately to character n in your input by looking at the nth byte.
Multibyte encodings are more complicated and lose each of these benefits. Some, like UTF-32, use a fixed number
of bytes per character, which is wasteful when data consists mostly of ASCII characters but carries the benefit that
each character is always the same length. Others, like UTF-8, vary how many bytes each character occupies and
therefore require a great deal of caution; if the data stream is delivered in pieces, then there is no way ahead of time to
know whether a character has been split across the boundary or not, and you cannot find character n without starting
at the beginning and reading until you have read that many characters.
You can find a list of all the encodings that Python supports by looking up the Standard Library documentation
for the codecs module.
Most of the single-byte encodings built in to Python are extensions of ASCII that use the remaining 128 values for
region-specific letters or symbols:





b'\x67\x68\x69\xe7\xe8\xe9'.decode('latin1')
'ghiçèé'
b'\x67\x68\x69\xe7\xe8\xe9'.decode('latin2')
'ghiç
é'
b'\x67\x68\x69\xe7\xe8\xe9'.decode('greek')
'ghihqi'
b'\x67\x68\x69\xe7\xe8\xe9'.decode('hebrew')
'ghiחטי'




Free download pdf