print a directory name of a saved web page. Adding the exception handler skips the
error entirely.
This demonstrates a subtle but pragmatically important issue: Python 3.X’s Unicode
orientation extends to filenames, even if they are just printed. As we learned in Chap-
ter 4 , because filenames may contain arbitrary text, os.listdir returns filenames in two
different ways—we get back decoded Unicode strings when we pass in a normal str
argument, and still-encoded byte strings when we send a bytes:
>>> import os
>>> os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py']
>>> os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']
Both os.walk (used in the Example 6-4 script) and glob.glob inherit this behavior for
the directory and file names they return, because they work by calling os.listdir in-
ternally at each directory level. For all these calls, passing in a byte string argument
suppresses Unicode decoding of file and directory names. Passing a normal string as-
sumes that filenames are decodable per the file system’s Unicode scheme.
The reason this potentially mattered to this section’s example is that running the tree
search version over an entire hard drive eventually reached an undecodable filename
(an old saved web page with an odd name), which generated an exception when the
print function tried to display it. Here’s a simplified recreation of the error, run in a
shell window (Command Prompt) on Windows:
>>> root = r'C:\py3000'
>>> for (dir, subs, files) in os.walk(root): print(dir)
...
C:\py3000
C:\py3000\FutureProofPython - PythonInfo Wiki_files
C:\py3000\Oakwinter_com Code » Porting setuptools to py3k_files
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position
45: character maps to <undefined>
One way out of this dilemma is to use bytes strings for the directory root name—this
suppresses filename decoding in the os.listdir calls run by os.walk, and effectively
limits the scope of later printing to raw bytes. Since printing does not have to deal with
encodings, it works without error. Manually encoding to bytes prior to printing works
too, but the results are slightly different:
>>> root.encode()
b'C:\\py3000'
>>> for (dir, subs, files) in os.walk(root.encode()): print(dir)
...
280 | Chapter 6: Complete System Programs