Writting UTF-8 strings in python

When processing some strings in python you may have to deal with special characters. You test your code with some output on the standard output and everything is working. However when you want to write your string you have complains like the following:

UnicodeError: ascii codec can't decode byte 0x85 in position 255: oridinal not in range(128)

What is the problem?

The problem comes from the fact that when writing in the file, python use the [[!wikipedia ASCII]] coding that code characters on 8bits and have only 128 different characters, while your string is encodes with a higher number of characters. On most current Unix the default character encoding is [[!wikipedia UTF-8]] that code characters with a variable number of bits. So it may appends that some of the characters you want to write are no more in the set of [[!wikipedia ASCII]]. There is no problem when printing on the standard output, because your console is certainly in [[!wikipedia UTF-8]] also.

How to solve it?

In python there are to sort of structures that handle list of characters: strings and unicode. The strings are using [[!wikipedia ASCII]] and unicode are using Unicode as their name shows it.

To be able to write a string encodes with [[!wikipedia UTF-8]] you first need to transform it as unicode.

> my_utf8_string = "A string with some UTF-8 characters °"
> my_utf8_string
'A string with some UTF-8 characters \xc2\xb0'
> converted_string = unicode(my_utf8_string, "utf-8")
> converted_string
u'A string with some UTF-8 characters \xb0'

Then you need to open the file as [[!wikipedia UTF-8]] with the codecs python module and save your unicode string as with a normal file handling.

import codecs
with codecs.open("/tmp/test", "w", encoding="utf-8") as outFile:
    outFile.write(converted_string)

That's all.

What is the problem?

How to solve it?

Resources