2008-07-03

Trying to make sense of Unicode terms

[warning: I am no Unicode expert; just someone who has listened to it discussed a ton of times on python-dev and read up on it somewhat]

There is a thread on python-dev about UCS-2/UCS-4 builds of Python and what the default should be (answer: UCS-2, which is really UTF-16 for Python, as it always has been). But what the thread has done for me is made it clear how bloody complicated Unicode is, partially thanks to the amount of new terms the standard brings into the word.

The first thing to realize is that Unicode is a bunch of code points. A code point is a single entry in Unicode. If something in Unicode is represented as a single character (whether it is a symbol representing a sound, like a letter, or a letter, like an ideograph).

Things get complicated, though, when you try to come up with a scheme to represent all 1,114,112 code points while caring about space. Obviously all code points do not fit in 1 byte, let alone 2 bytes. You could use 4 bytes, but that is way past how much space is needed for representing a single code point.

And this is why you have UTF-8, UCS-2/UTF-16, and UCS-4/UTF-32 to choose from for Unicode encoding. Each of those encodings have a different code unit: the minimum amount of space used to represent at least one code point. Problem is that it can take multiple code units to be able to represent any code point. Think of UTF-8 which is 1 byte: while it takes a single code unit of 1 byte to represent anything in ASCII, you can't represent any of the 1,114,112 code points in Unicode in a single byte. So UTF-8 supports variable-length code points by using multiple code units.

Why do you care if a code point takes up multiple code units, you ask? Well, think about trying to calculate the number of characters/code points in a string. If you have variable length code points you can't just return the number of code units used to make the string; you have to go through and figure out how many of those code units are really just a part of a making a code point. An even more problematic thing is slicing; you can't slice in constant-time based off of memory offsets and the code unit size if a code unit might only make up part of a code point. That means slicing goes from being O(1) to being O(n). That sucks.

So that is why UTF-32/UCS-4 exists. All of Unicode characters can be represented in 4 bytes. Everything is constant. That's nice. But now you are using 4 bytes for EVERYTHING, even stuff that could fit in ASCII. And the majority of Unicode that people typically care about can fit in two bytes. This core part of Unicode is called the Basic Multilingual Plane, or BMP. There are other planes that represent other characters and symbols, but they are not commonly used.

So there is UTF-16 (which made UCS-2 obsolete). Fitting anything from the BMP in a code unit from UTF-16 means that for most things the number of code units needed to make a code point is a constant of 1. Of course there are rare occasions where you need more than a single code unit to make a code point in UTF-16, which is when you have what is called a surrogate pair, creating the code point from two code points. This makes UTF-16 a balance between memory usage and the amount of work needed to calculate things.

In terms of Python, you can build it for either UTF-16 or UTF-32 (although the configure options are named UCS-2 and UCS-4; I believe the underlying support is technically the UTF versions). Windows and OS X use UCS-2 builds, while most Linux distros use UCS-4 builds. The key thing to realize, though, is that no matter what build you are on, indexing, slicing, etc. work on code units, not code points! This is a conscience decision to let these operations have O(1) complexity. Assuming you are using Python 3.0 since it is all Unicode, you should use results from str.index(), etc. to index and slicing into strings so that you don't have to worry about what kind of build you are working on.