Chapter 5 ■ Network Data aND Network errors
79
You will generally want to fix such errors either by determining that you are using the wrong encoding or by
working out why your data is not conforming to the encoding that you expect of it. If neither fix works, however, and
you find that your code must routinely survive mismatches between declared encodings and actual strings and data,
then you will want to read the Standard Library documentation to learn about alternative approaches to errors rather
than having to handle exceptions.
b'ab\x80def'.decode('ascii', 'replace')
'ab⍰def'
b'ab\x80def'.decode('ascii', 'ignore')
'abdef'
'ghihqi'.encode('latin-1', 'replace')
b'ghi???'
'ghihqi'.encode('latin-1', 'ignore')
b'ghi'
These are described in the Standard Library documentation for the codecs module, and you can find more
examples in Doug Hellman’s Python Module of the Week entry on codecs as well.
Note again that it is dangerous to decode a partially received message if you are using an encoding that encodes
some characters using multiple bytes, since one of those characters might have been split between the part of the
message that you have already received and the packets that have not yet arrived. See the “Framing and Quoting”
section later in this chapter for some approaches to this issue.
Binary Numbers and Network Byte Order
If all you ever want to send across the network is text, then encoding and framing (which you will tackle in the next
section) will be your only worries.
However, sometimes you might want to represent your data in a more compact format than text makes possible.
Or you might be writing Python code to interface with a service that has already made the choice to use raw binary
data. In either case, you will probably have to start worrying about a new issue: network byte order.
To understand the issue of byte order, consider the process of sending an integer over the network. To be specific,
think about the integer 4253.
Of course, many protocols will simply transmit this integer as the string '4253'—that is, as four distinct
characters. The four digits will require at least four bytes to transmit, at least in any of the usual text encodings. Using
decimal digits will also involve some computational expense: since numbers are not stored inside computers in
base 10, it will take repeated division—with inspection of the remainder—for the program transmitting the value to
determine that this number is in fact made of 4 thousands, plus 2 hundreds, plus 5 tens, plus 3 left over. And when the
four-digit string '4253' is received, repeated addition and multiplication by powers of ten will be necessary to put the
text back together into a number.
Despite its verbosity, the technique of using plain text for numbers may actually be the most popular on the
Internet today. Every time you fetch a web page, for example, the HTTP protocol expresses the Content-Length of the
result using a string of decimal digits just like '4253'. Both the web server and the client do the decimal conversion
without a second thought, despite a bit of expense. Much of the story of the past 20 years in networking, in fact, has
been the replacement of dense binary formats with protocols that are simple, obvious, and human-readable—even if
computationally expensive compared to their predecessors.
Of course, multiplication and division are also cheaper on modern processors than back when binary formats
were more common—not only because processors have experienced a vast increase in speed but because their
designers have become much more clever about implementing integer math so that the same operation requires far
fewer cycles today than on the processors of, say, the early 1980s.