Normalizing Unicode strings in Python

Wed, Aug 5, 2020 —

Python

XKCD #1953

Summary #

So recently, I had a project where I was scraping username data off of Twitter. I found that roughly 24% of users have emojis or Unicode characters in their name that represent letters. Now, I wanted to do analysis on the linguistic characteristics of their names, so I needed a way to take Unicode text and convert it to the closest ASCII replacement.

ASCII vs Unicode #

ASCII was the first widespread encoding scheme for text. Each character in ASCII uses 1 byte and thus ASCII includes 2^(8-1)characters. 128 characters is fine for the english character set plus some punctuation, but does not go much further than that.

Unicode is a standard for text that seeks to combine the world’s writing systems. The standard includes 143,859 characters (as of writing) for 154 languages. The most popular Unicode form is UTF-8 which uses 1 byte for English characters and upto 4 bytes for others.

For example, the Unicode character Ɓ is represented by the code U+0181.

Converting #

We can use the popular Python library Unidecode for this.

>>> from unidecode import unidecode
>>> unidecode('Ɓ')
'B'
>>> unidecode('𝓽𝓱𝓲𝓼 𝓲𝓼 𝓼𝓸𝓶𝓮 𝔀𝓮𝓲𝓻𝓭 𝓽𝓮𝔁𝓽')
'this is some weird text'
>>> unidecode('𝕔𝕠𝕟𝕧𝕖𝕣𝕥𝕚𝕟𝕘 𝕥𝕠 𝕣𝕖𝕘𝕦𝕝𝕒𝕣 𝕥𝕖𝕩𝕥...')
'converting to regular text...'

That’s it, check out Unidecode at PyPI.