Unicode text in files
Now, the same rules apply to text files, because Unicode strings are stored in files as
encoded bytes. When writing, we can encode in any format that accommodates the
string’s characters. When reading, though, we generally must know what that encoding
is or provide one that formats characters the same way:
>>> open('ldata', 'w', encoding='latin-1').write(s) # store in latin-1 format
5
>>> open('udata', 'w', encoding='utf-8').write(s) # store in utf-8 format
5
>>> open('ldata', 'r', encoding='latin-1').read() # OK if correct name given
'AÄBäC'
>>> open('udata', 'r', encoding='utf-8').read()
'AÄBäC'
>>> open('ldata', 'r').read() # else, may not work
'AÄBäC'
>>> open('udata', 'r').read()
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-3: cha...
>>> open('ldata', 'r', encoding='utf-8').read()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>> open('udata', 'r', encoding='latin-1').read()
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
By contrast, binary mode files don’t attempt to decode into a Unicode string; they
happily read whatever is present, whether the data was written to the file in text mode
with automatically encoded str strings (as in the preceding interaction) or in binary
mode with manually encoded bytes strings:
>>> open('ldata', 'rb').read()
b'A\xc4B\xe4C'
>>> open('udata', 'rb').read()
b'A\xc3\x84B\xc3\xa4C'
>>> open('sdata', 'wb').write( s.encode('utf-16') ) # return value: 12
>>> open('sdata', 'rb').read()
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'
Unicode and the Text widget
The application of all this to tkinter Text displays is straightforward: if we open in binary
mode to read bytes, we don’t need to be concerned about encodings in our own code—
tkinter interprets the data as expected, at least for these two encodings:
>>> from tkinter import Text
>>> t = Text()
>>> t.insert('1.0', open('ldata', 'rb').read())
>>> t.pack() # string appears in GUI OK
>>> t.get('1.0', 'end')
'AÄBäC\n'
>>>
Text | 543