Short of writing our own email parser, or pursuing other similarly complex approaches,
the best bet today for fetched messages seems to be decoding per user preferences and
defaults, and that’s how we’ll proceed in this edition. The PyMailGUI client of Chap-
ter 14, for instance, will allow Unicode encodings for full mail text to be set on a per-
session basis.
The real issue, of course, is that email in general is inherently complicated by the pres-
ence of arbitrary text encodings. Besides full mail text, we also must consider Unicode
encoding issues for the text components of a message once it’s parsed—both its text
parts and its message headers. To see why, let’s move on.
Related Issue for CGI scripts: I should also note that the full text decoding
issue may not be as large a factor for email as it is for some other
email package clients. Because the original email standards call for
ASCII text and require binary data to be MIME encoded, most emails
are likely to decode properly according to a 7- or 8-bit encoding such as
Latin-1.
As we’ll see in Chapter 15, though, a more insurmountable and related
issue looms for server-side scripts that support CGI file uploads on the
Web—because Python’s CGI module also uses the email package to
parse multipart form data; because this package requires data to be de-
coded to str for parsing; and because such data might have mixed text
and binary data (included raw binary data that is not MIME-encoded,
text of any encoding, and even arbitrary combinations of these), these
uploads fail in Python 3.1 if any binary or incompatible text files are
included. The cgi module triggers Unicode decoding or type errors in-
ternally, before the Python script has a chance to intervene.
CGI uploads worked in Python 2.X, because the str type represented
both possibly encoded text and binary data. Saving this type’s content
to a binary mode file as a string of bytes in 2.X sufficed for both arbitrary
text and binary data such as images. Email parsing worked in 2.X for
the same reason. For better or worse, the 3.X str/bytes dichotomy
makes this generality impossible.
In other words, although we can generally work around the email
parser’s str requirement for fetched emails by decoding per an 8-bit
encoding, it’s much more malignant for web scripting today. Watch for
more details on this in Chapter 15, and stay tuned for a future fix, which
may have materialized by the time you read these words.
Text payload encodings: Handling mixed type results
Our next email Unicode issue seems to fly in the face of Python’s generic programming
model: the data types of message payload objects may differ, depending on how they
are fetched. Especially for programs that walk and process payloads of mail parts
generically, this complicates code.
email: Parsing and Composing Mail Content | 929