Unicode is a standard for universal representation of characters in computer code. It supports almost all characters in different languages. Each character is represented by a unique number. Unicode defines the standard for UTF-8, UTF-16 and UTF-32.
This article will discuss how to deal with unicode characters in Python.
Python Source Encoding (Python 2)
By default, Python 2 source files are interpreted as ascii. When writing unicode characters in source files, we must follow the standard defined in PEP263. To do this, we must define the encoding in the first or second line.
#!/usr/bin/env python # -*- coding: utf-8 -*-
Unicode objects (Python 2)
In Python, the unicode string type represents unicode characters and defined by prepending
u in the string definition.
#!/usr/bin/env python # -*- coding: utf-8 -*- a = u"Chicken attack! ?" print(a)
Unicode in Python 3
Python 3 introduced a breaking change when it comes to unicode characters as strings are now in Unicode as opposed to Python 2's byte strings. By default, all source files are now interpreted as UTF-8. In Python, defining the file encoding is no longer needed.
#!/usr/bin/env python a = "Chicken attack! ?" print(a)
Note that in Python 2, strings represent only byte characters while in Python 3, byte strings can be defined by prepending
b during definition. Byte strings cannot contain unicode characters.
#!/usr/bin/env python a = b'hello'
Reading and writing files
To read and write unicode files, we can use the codecs library. This library works on both Python 2 and 3 versions.
#!/usr/bin/env python import codecs with codecs.open('myfile.txt', encoding='utf-8') as f: print(f.read())
In Python 3, it is recommended to use the io library because because the
codecs library is planned to be deprecated.
#!/usr/bin/env python import io with io.open('myfile.txt', encoding='utf-8') as f: print(f.read())
For example, applying this to CSV files, we simply set the encoding.
#!/usr/bin/env python import csv import io with io.open('chicken.csv', 'w', encoding='utf-8') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Chicken attack!', '?'])
The json standard library defaults to ascii when writing. To disable this, pass
#!/usr/bin/env python import json import io with io.open('chicken.jl', 'w', encoding='utf-8') as f: json.dump('Chicken attack! ?', f, ensure_ascii=False)
Printing to console
When printing unicode characters to console, a
UnicodeEncodeError is raised.
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f414' in position 16: character maps to <undefined>
To properly print unicode characters to console, define the environment variable
PYTHONIOENCODING=UTF-8. We can also simply add this before executing a Python script.
$ PYTHONIOENCODING=UTF-8 python unicode.py Chicken attack! ?
The Python logging module does not use UTF-8 by default. To support unicode characters, define encoding when using a FileHandler.
#!/usr/bin/env python import logging logger = logging.getLogger(__name__) handler = logging.FileHandler('unicode.log', encoding='utf-8') logger.addHandler(handler) if __name__ == '__main__': logger.error("Chicken attack! ?")
The redis-py module handles data as bytes. In Python 3, in order to work safely with Unicode characters without manually converting types, we can define the encoding when creating StrictRedis instance.
#!/usr/bin/env python from redis import StrictRedis redis_client = StrictRedis( host=os.getenv('REDIS_HOST'), port=os.getenv('REDIS_PORT'), db=os.getenv('REDIS_DB'), password=os.getenv('REDIS_PASSWORD'), encoding='utf-8' )
Even if unicode is supported by most libraries and languages, we should take note that this is not yet the default encoding. We must explicitly define encoding when writing applications.
What are other unicode-related tips and tricks that I missed? Leave your comments below.
Edit: Added examples for CSV, JSON and Redispython unicode