>>> open('data.txt', 'rb').read()
b'\xa2\x97\x81\x94\r%\x88\x81\x94\r%'
If all your text is ASCII you generally can ignore encoding altogether; data in files maps
directly to characters in strings, because ASCII is a subset of most platforms’ default
encodings. If you must process files created with other encodings, and possibly on
different platforms (obtained from the Web, for instance), binary mode may be required
if encoding type is unknown. Keep in mind, however, that text in still-encoded binary
form might not work as you expect: because it is encoded per a given encoding scheme,
it might not accurately compare or combine with text encoded in other schemes.
Again, see other resources for more on the Unicode story. We’ll revisit the Unicode
story at various points in this book, especially in Chapter 9, to see how it relates to the
tkinter Text widget, and in Part IV, covering Internet programming, to learn what it
means for data shipped over networks by protocols such as FTP, email, and the Web
at large. Text files have another feature, though, which is similarly a nonfeature for
binary data: line-end translations, the topic of the next section.
End-of-line translations for text files
For historical reasons, the end of a line of text in a file is represented by different char-
acters on different platforms. It’s a single \n character on Unix-like platforms, but the
two-character sequence \r\n on Windows. That’s why files moved between Linux and
Windows may look odd in your text editor after transfer—they may still be stored using
the original platform’s end-of-line convention.
For example, most Windows editors handle text in Unix format, but Notepad has been
a notable exception—text files copied from Unix or Linux may look like one long line
when viewed in Notepad, with strange characters inside (\n). Similarly, transferring a
file from Windows to Unix in binary mode retains the \r characters (which often appear
as ^M in text editors).
Python scripts that process text files don’t normally have to care, because the files object
automatically maps the DOS \r\n sequence to a single \n. It works like this by default—
when scripts are run on Windows:
- For files opened in text mode, \r\n is translated to \n when input.
- For files opened in text mode, \n is translated to \r\n when output.
- For files opened in binary mode, no translation occurs on input or output.
On Unix-like platforms, no translations occur, because \n is used in files. You should
keep in mind two important consequences of these rules. First, the end-of-line character
for text-mode files is almost always represented as a single \n within Python scripts,
regardless of how it is stored in external files on the underlying platform. By mapping
to and from \n on input and output, Python hides the platform-specific difference.
The second consequence of the mapping is subtler: when processing binary files, binary
open modes (e.g, rb, wb) effectively turn off line-end translations. If they did not, the
File Tools | 149