exceptions or try alternative schemes; this is especially true on platforms where ASCII
may be the default platform encoding.
The problem with treating text as bytes
The prior sections’ rules may seem complex, but they boil down to the following:
- Unless strings always use the platform default, we need to know encoding types
to read or write in text mode and to manually decode or encode for binary mode. - We can use almost any encoding to write new files as long as it can handle the
string’s characters, but must provide one that is compatible with the existing data’s
binary format on reads. - We don’t need to know the encoding mode to read text as bytes in binary mode
for display, but the str content returned by the Text widget still requires us to
encode to write on saves.
So why not always load text files in binary mode to display them in a tkinter Text widget?
While binary mode input files seem to side-step encoding issues for display, passing
text to tkinter as bytes instead of str really just delegates the encoding issue to the Tk
library, which imposes constraints of its own.
More specifically, opening input files in binary mode to read bytes may seem to support
viewing arbitrary types of text, but it has two potential downsides:
- It shifts the burden of deciding encoding type from our script to the Tk GUI library.
The library must still determine how to render those bytes and may not support
all encodings possible. - It allows opening and viewing data that is not text in nature, thereby defeating
some of the purpose of the validity checks performed by text decoding.
The first point is probably the most crucial here. In experiments I’ve run on Windows,
Tk seems to correctly handle raw bytes strings encoded in ASCII, UTF-8 and Latin-1
format, but not UTF-16 or others such as CP500. By contrast, these all render correctly
if decoded in Python to str before being passed on to Tk. In programs intended for the
world at large, this wider support is crucial today. If you’re able to know or ask for
encodings, you’re better off using str both for display and saves.
To some degree, regardless of whether you pass in str or bytes, tkinter GUIs are subject
to the constraints imposed by the underlying Tk library and the Tcl language it uses
internally, as well as any imposed by the techniques Python’s tkinter uses to interface
with Tk. For example:
- Tcl, the internal implementation language of the Tk library, stores strings internally
in UTF-8 format, and decrees that strings passed in to and returned from its C API
be in this format.
Text | 545