Python Unicode

Unicode scripts

In Python 3, all strings are Unicode. In Python 2.x, you can use Unicode in comments and strings inside a .py script if you make Python aware of it. The first line of the script file should be a special comment:

# -*- coding: utf-8 -*- 


Unicode strings

In Python 2.x, strings are byte strings. A byte string stores each character as 1 byte = 256 possible characters = ASCII. Unfortunately, ASCII doesn't have room for special characters (e.g., diacritics, Chinese, Hebrew). You can instead create a Unicode string with the u- prefix:

>>> string1 =  'unicode'
>>> string2 = u'ünîcødé'
>>> print string1.__class__ 
>>> print string2.__class__ 

<type 'str'> 
<type 'unicode'>   

Unicode strings can be encoded, converted to ASCII by representing special characters with a code. For example: é → \xc3\xa9. Such a byte string can be decoded into Unicode later on. 

>>> string = string.encode('utf-8') # Unicode => ASCII
>>> string = string.decode('utf-8') # ASCII => Unicode

Brute force encoding

If the conversion is not possible a UnicodeEncodeError or UnicodeDecodeError will be raised. To brute-force encode() or decode() so that no error is raised, an additional 'ignore' parameter can be given. Sometimes, not crashing is more important than special characters (e.g., in a web crawler).

The following function uses a brute-force approach to convert a string to Unicode:

def decode_utf8(string):
    if isinstance(string, str):
        for encoding in (('utf-8',), ('windows-1252',), ('utf-8', 'ignore')):
                return string.decode(*encoding)
        return string # Don't know how to handle it...
    return unicode(string, 'utf-8')


Reading UTF-8 files

When you read a Unicode file as a byte string, special characters (such as diacritics) become garbled or an error is raised. Reading an ASCII file as a Unicode string is harmless. There is no magic function to detect the encoding of a file, so you need to know how it was stored  in order to read it correctly.

To read a Unicode file as a Unicode string:

>>> from codecs import open
>>> string = open(path, encoding='utf-8').read()

If the file is in Latin-1 (ASCII + a few special characters), use encoding='latin-1' to read it.


Writing UTF-8 files

To write a Unicode string as a Unicode file:

>>> from codecs import open 
>>> open(path, 'w', encoding='utf-8').write(u'ünîcødé')

That said, some applications such as Mac OS X TextEdit may not recognize the UTF-8 content:

In this case you can encode the string manually and include a byte order marker at the start of the file:

>>> from codecs import open, BOM_UTF8 
>>> s = u'ünîcødé'
>>> s = s.encode('utf-8')
>>> open(path, 'w').write(BOM_UTF8 + s)

Remember to strip the byte order marker when you open the file:

>>> from codecs import open, BOM_UTF8 
>>> s = open(path).read() 
>>> s = s.lstrip(BOM_UTF8) 
>>> s = s.decode('utf-8')


Exporting XML

XML can only contain ASCII characters. So to store Unicode in XML you need to use encode() together with the right XML-header. Furthermore, some control characters like < and > must be represented as entities (the XML would of course break otherwise):

def encode_xml(string, encoding='utf-8'):
    string = string.encode(encoding)
    string = string.replace( '&', '&amp;')
    string = string.replace( '<', '&lt;')
    string = string.replace( '>', '&gt;')
    string = string.replace('\\', '&quot;')
    return string
>>> persons = [u'Max Planck', u'Erwin Schrödinger']
>>> xml = ['<?xml version="1.0" encoding="UTF-8"?>']
>>> xml.append('<persons>')
>>> for s in persons:
>>>     xml.append('\t<person>%s</person>' % encode_xml(s))
>>> xml.append('</persons>')
>>> xml = '\n'.join(xml)


Writing the XML file

We can simply write the encoded string as an ASCII file:

>>> open('test.xml', 'w').write(xml)


Reading the XML file

We can read it as an ASCII file – the minidom parser will decode it correctly:

>>> from xml.dom import minidom
>>> xml = open("test.xml").read()
>>> xml = minidom.parseString(xml)
>>> n = xml.childNodes[0]                   # <persons>...</persons>
>>> n = n.getElementsByTagName('person')[1] # <person>...</person>
>>> v = n.childNodes[0].nodeValue           # Erwin Schrödinger
>>> print v
>>> print v.__class__

Erwin Schrödinger
<type 'unicode'>