>>> s = b.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>> s = b.decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>> s = b.decode('latin1')
>>> s
'AÄBäC'
Once you’ve decoded to a Unicode string, you can “convert” it to a variety of different
encoding schemes. Really, this simply translates to alternative binary encoding formats,
from which we can decode again later; a Unicode string has no Unicode type per se,
only encoded binary data does:
>>> s.encode('latin-1')
b'A\xc4B\xe4C'
>>> s.encode('utf-8')
b'A\xc3\x84B\xc3\xa4C'
>>> s.encode('utf-16')
b'\xff\xfeA\x00\xc4\x00B\x00\xe4\x00C\x00'
>>> s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in position 1: o...
Notice the last test here: the string you encode to must be compatible with the scheme
you choose, or you’ll get an exception; here, ASCII is too narrow to represent characters
decoded from Latin-1 bytes. Even though you can convert to different (compatible)
representations’ bytes, you must generally know what the encoded format is in order
to decode back to a string:
>>> s.encode('utf-16').decode('utf-16')
'AÄBäC'
>>> s.encode('latin-1').decode('latin-1')
'AÄBäC'
>>> s.encode('latin-1').decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid dat...
>>> s.encode('utf-8').decode('latin-1')
UnicodeEncodeError: 'charmap' codec can't encode character '\xc3' in position 2:...
Note the last test here again. Technically, encoding Unicode code points (characters)
to UTF-8 bytes and then decoding back again per the Latin-1 format does not raise an
error, but trying to print the result does: it’s scrambled garbage. To maintain fidelity,
you must generally know what format encoded bytes are in:
>>> s
'AÄBäC'
>>> x = s.encode('utf-8').decode('utf-8') # OK if encoding matches data
>>> x
'AÄBäC'
>>> x = s.encode('latin-1').decode('latin-1') # any compatible encoding works
Text | 541