Working with Unicodes

Tips and tricks when working unicode characters in Python

Posted 2019-05-26 02:36:53 by Ronie Martinez

Unicode is a standard for universal representation of characters in computer code. It supports almost all characters in different languages. Each character is represented by a unique number. Unicode defines the standard for UTF-8, UTF-16 and UTF-32.

This article will discuss how to deal with unicode characters in Python.

Python Source Encoding (Python 2)

By default, Python 2 source files are interpreted as ascii. When writing unicode characters in source files, we must follow the standard defined in PEP263. To do this, we must define the encoding in the first or second line.

# coding=utf-8

or

#!/usr/bin/env python
# -*- coding: utf-8 -*-

Unicode objects (Python 2)

In Python, the unicode string type represents unicode characters and defined by prepending u in the string definition.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

a = u"Chicken attack! ?"
print(a)

Unicode in Python 3

Python 3 introduced a breaking change when it comes to unicode characters as strings are now in Unicode as opposed to Python 2's byte strings. By default, all source files are now interpreted as UTF-8. In Python, defining the file encoding is no longer needed.

#!/usr/bin/env python

a = "Chicken attack! ?"
print(a)

Note that in Python 2, strings represent only byte characters while in Python 3, byte strings can be defined by prepending b during definition. Byte strings cannot contain unicode characters.

#!/usr/bin/env python

a = b'hello'

Reading and writing files

To read and write unicode files, we can use the codecs library. This library works on both Python 2 and 3 versions.

#!/usr/bin/env python
import codecs

with codecs.open('myfile.txt', encoding='utf-8') as f:
    print(f.read())

In Python 3, it is recommended to use the io library because because the codecs library is planned to be deprecated.

#!/usr/bin/env python
import io

with io.open('myfile.txt', encoding='utf-8') as f:
    print(f.read())

For example, applying this to CSV files, we simply set the encoding.

#!/usr/bin/env python
import csv
import io


with io.open('chicken.csv', 'w', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Chicken attack!', '?'])

Writing JSON

The json standard library defaults to ascii when writing. To disable this, pass ensure_ascii=False.

#!/usr/bin/env python
import json
import io


with io.open('chicken.jl', 'w', encoding='utf-8') as f:
    json.dump('Chicken attack! ?', f, ensure_ascii=False)

Printing to console

When printing unicode characters to console, a UnicodeEncodeError is raised.

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f414' in position 16: character maps to <undefined>

To properly print unicode characters to console, define the environment variable PYTHONIOENCODING=UTF-8. We can also simply add this before executing a Python script.

$ PYTHONIOENCODING=UTF-8 python unicode.py
Chicken attack! ?

Logging

The Python logging module does not use UTF-8 by default. To support unicode characters, define encoding when using a FileHandler.

#!/usr/bin/env python
import logging

logger = logging.getLogger(__name__)

handler = logging.FileHandler('unicode.log', encoding='utf-8')
logger.addHandler(handler)

if __name__ == '__main__':
    logger.error("Chicken attack! ?")

Redis

The redis-py module handles data as bytes. In Python 3, in order to work safely with Unicode characters without manually converting types, we can define the encoding when creating StrictRedis instance.

#!/usr/bin/env python
from redis import StrictRedis


redis_client = StrictRedis(
    host=os.getenv('REDIS_HOST'),
    port=os.getenv('REDIS_PORT'),
    db=os.getenv('REDIS_DB'),
    password=os.getenv('REDIS_PASSWORD'),
    encoding='utf-8'
)

Conclusion

Even if unicode is supported by most libraries and languages, we should take note that this is not yet the default encoding. We must explicitly define encoding when writing applications. 

What are other unicode-related tips and tricks that I missed? Leave your comments below.

 

Edit: Added examples for CSV, JSON and Redis

python unicode


Share