[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1
>>> x
'AÄBäC'

>>> x = s.encode('utf-8').decode('latin-1') # decoding works, result is garbage
>>> x
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...

>>> len(s), len(x) # no longer the same string
(5, 7)

>>> s.encode('utf-8') # no longer same code points
b'A\xc3\x84B\xc3\xa4C'
>>> x.encode('utf-8')
b'A\xc3\x83\xc2\x84B\xc3\x83\xc2\xa4C'

>>> s.encode('latin-1')
b'A\xc4B\xe4C'
>>> x.encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'

Curiously, the original string may still be there after a mismatch like this—if we encode
the scrambled bytes back to Latin-1 again (as 8-bit characters) and then decode prop-
erly, we might restore the original (in some contexts this can constitute a sort of second
chance if data is decoded wrong initially):


>>> s
'AÄBäC'
>>> s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
>>> s.encode('utf-8').decode('latin-1').encode('latin-1')
b'A\xc3\x84B\xc3\xa4C'
>>> s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8')
'AÄBäC'
>>> s.encode('utf-8').decode('latin-1').encode('latin-1').decode('utf-8') == s
True

On the other hand, we can use a different encoding name to decode, as long as it’s
compatible with the format of the data; ASCII, UTF-8, and Latin-1, for instance, all
format ASCII text the same way:


>>> 'spam'.encode('utf8').decode('latin1')
'spam'
>>> 'spam'.encode('latin1').decode('ascii')
'spam'

It’s important to remember that a string’s decoded value doesn’t depend on the en-
coding it came from—once decoded, a string has no notion of encoding and is simply
a sequence of Unicode characters (“code points”). Hence, we really only need to care
about encodings at the point of transfer to and from files:


>>> s
'AÄBäC'
>>> s.encode('utf-16').decode('utf-16') == s.encode('latin-1').decode('latin-1')
True

542 | Chapter 9: A tkinter Tour, Part 2

Free download pdf