This might be especially useful on systems that use simple encodings such as ASCII or
Latin-1, but may contain files with arbitrarily encoded names from cross-machine cop-
ies, the Web, and so on. Depending upon context, exception handlers may be used to
suppress some types of encoding errors as well.
We’ll see an example of how this can matter in the first section of Chapter 6, where an
undecodable directory name generates an error if printed during a full disk scan (al-
though that specific error seems more related to printing than to decoding in general).
Note that the basic open built-in function allows the name of the file being opened to
be passed as either Unicode str or raw bytes, too, though this is used only to name the
file initially; the additional mode argument determines whether the file’s content is
handled in text or binary modes. Passing a byte string filename allows you to name files
with arbitrarily encoded names.
Unicode policies: File content versus file names
In fact, it’s important to keep in mind that there are two different Unicode concepts
related to files: the encoding of file content and the encoding of file name. Python pro-
vides your platform’s defaults for these settings in two different attributes; on
Windows 7:
>>> import sys
>>> sys.getdefaultencoding() # file content encoding, platform default
'utf-8'
>>> sys.getfilesystemencoding() # file name encoding, platform scheme
'mbcs'
These settings allow you to be explicit when needed—the content encoding is used
when data is read and written to the file, and the name encoding is used when dealing
with names prior to transferring data. In addition, using bytes for file name tools may
work around incompatibilities with the underlying file system’s scheme, and opening
files in binary mode can suppress Unicode decoding errors for content.
As we’ve seen, though, opening text files in binary mode may also mean that the raw
and still-encoded text will not match search strings as expected: search strings must
also be byte strings encoded per a specific and possibly incompatible encoding scheme.
In fact, this approach essentially mimics the behavior of text files in Python 2.X, and
underscores why elevating Unicode in 3.X is generally desirable—such text files some-
times may appear to work even though they probably shouldn’t. On the other hand,
opening text in binary mode to suppress Unicode content decoding and avoid decoding
errors might still be useful if you do not wish to skip undecodable files and content is
largely irrelevant.
174 | Chapter 4: File and Directory Tools