Unicode notes from an ASCII programmer


Every C programmer knows that any program, no matter how small, is another opportunity to debug a segmentation fault.

The program was a relatively simple one. It would monitor some processes in the background, and periodically make some system calls and collect some data about them. Analysis of one of the crashlogs suggested that some of the process names were null, and so when the program tried to dereference the name, it would crash.

So obviously someone had forgotten to check for an error from the system call, the data was garbage, it was dereferenced, crash. Alternatively, someone had been just clever enough to first check to see if the process was still valid before making the system call to collect its data. It's a classic "time of check time of use" error - if the process terminates between the time of verification and time of data collection, the system call can fail.

But when we looked at the source code, the error checking seemed perfectly fine. Additional logging showed that the other values returned as part of the system call looked reasonable.

With further debugging, we noticed that the name came back from the system call with no issues, but somewhere along the way it was being turned into a null pointer. Once we were able to capture a crash, we noticed the name contained some non English characters in it.

This was the clue we needed. By crafting a process name that was long enough to get truncated by this system call, and by strategically placing an Emoji at the end of the string, we were able to cause the kernel to truncate in the middle of the Emoji. That meant we no longer had a valid Unicode string, and when one of our libraries tried to process this string, it returned a null, and we hadn't expected to need to check for that, because how could a byte sequence not be convertible into a string?

We ended up putting together a Unicode checklist for review.

Fundamentals

Before Unicode came ASCII, which is a 7 bit encoding, really only enough to encode English - various code page tweaks and changes were needed to support other Latin based languages. Each ASCII character has a 7 bit number, or code, corresponding to it, and an ASCII string is just a sequence of ASCII codes, one per byte. The C programming language stored strings as a sequence of ASCII bytes terminated by a single zero byte, or an "ASCIIZ" string, and so many low level systems, being implemented in C, also used ASCIIZ as the native string format.

For multilingual support, 127 codes isn't enough to cover other languages so Unicode was developed. Unicode has a similar coding concept, with the basic unit being the code point, a number that ranges from 0 to 0x10FFFF, enough for over one million code points. Code points can be:

This is roughly analogous to the way ASCII maps each 7 bit number to a character. In fact, for compatibility reasons, the first 127 code points map to the same characters as the ASCII character with equivalent value.

But one Unicode code point doesn't map to a single printable element. The different Unicode points can be combined to form grapheme clusters, which are - loosely speaking - what we colloquially refer to as a "character" when displayed on the screen. The code points are set up so that specific rules can split into grapheme clusters.

In ASCII, each code fits into one byte, so storing an ASCII string is straightforward - we simply store the sequence of codes, one per byte. However, a Unicode code point doesn't fit in one byte, so we need to figure out how to store this.

The most direct way is to do something similar to ASCII and use a fixed width for each Unicode point, and store them one after another. This requires 21 bits per code. Unfortunately, that is an awkward number of bits for modern machines to work with, so it's rounded up to a full 32 bits. This method of encoding is called UTF-32. It's easy to understand and code, but has the big disadvantage of needing four times more memory than an equivalent ASCII string.

Another encoding is UTF-16. We won't speak too much about it here, except to clarify that it is NOT a fixed length encoding - that is a common myth. Early version of Unicode had a 16 bit code space, which meant this encoding had the advantage of each code point taking exactly 2 bytes. But modern Unicode has far more than a 16 bit code space. UTF-16 is able to handle this by using surrogate pairs, but that turns it into a variable length encoding.

Finally, there is UTF-8. It takes the code point and separates them across several bytes, using the high bits of each byte to signify how various parts are divided. UTF-8 has several very useful properties:

To work with a Unicode string, we must know its encoding. This is different from ASCII strings. Again, a sequence of bytes cannot be interpreted as a string unless we know its encoding!

With those definitions, we can now discuss some fundamental string operations.

String length

With Unicode, there's multiple notions of the length of a string.

One question might be, how many bytes does this string take in memory? This is the kind of length we want when we need to allocate storage. For example, if the string were written to a file, this would be how big the file is. Similarly, to get the correct content length to send this string across the Internet, this would be the value to use.

Another notion is, how many Unicode code points are in this string? Besides Unicode converters and other programs of that sort, there aren't too many other uses for this. It's not possible to go from number of code points to byte size - because encoding - and it's not possible to go from number of code points to number of display elements - because of combining code points.

Finally we might ask, how many grapheme clusters are in this string? In other words, loosely speaking, the number of visual elements that will be seen. Computing this requires following a bunch of rules in the Unicode standard, and so effectively means at least a primitive interpretation of code points. In some cases, even this isn't straightforward since not all grapheme clusters are the same size, and there can be quirks when dealing with some edge cases.

We always need to know what length we're looking for. For example, if we're trying to make a copy of the string and we get the code point length instead, we've created a buffer overflow for a hacker to exploit.

String comparison

Unicode strings can't be compared using simple byte by byte comparisons.

First, and most obvious, two strings with different encodings can have the same sequence of unicode code points, but a completely different byte sequence.

Even if they do have the same encoding (or alternatively, if comparing by code point instead of by byte), the same grapheme cluster can have different Unicode representations. For example, an e with an accent can be either U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or it can be U+0065 (LATIN SMALL LETTER E) U+0301 (COMBINING ACUTE ACCENT). You can try this yourself - copy and paste the following two statements into your javascript console and see what it returns:

"é".length // Encoded as one code point
"é".length // Encoded as two code points
"é" == "é" // Looks true... but isn't

So we need to do some standardization on the string before we compare. This is similar in spirit to doing a case insensitive comparison by first forcing all the characters to lowercase before doing the comparison. Unicode has a procedure for this called normalization - actually it provides four different normalization forms - but that doesn't make things four times as easy.

On one dimension is whether the unicode characters are composed (combined) or decomposed. For example, in the above example, the one code point form is composed and the two code point form is decomposed.

On the second dimension is whether we want "canonical" equivalence or "compatibility" equivalence. Canonical equivalence, in this case, means that two Unicode sequence represent the same character. It's easiest to understand this by contrasting with compatibility equivalence, where two sequences might mean semantically the same thing, but have different character representations. An example would be a ligature, such as "fi" U+FB01 (LATIN SMALL LIGATURE FI) which would be compatibility equivalent with "fi" U+0066 (LATIN SMALL LETTER F) U+0069 (LATIN SMALL LETTER I), but not canonically equivalent.

By putting together these two pairs of combinations we end up with the four forms:

Not all systems use the same normalized form, or even consistently normalize. So, a string could get added as a key to a dictionary, and if it gets normalized, iterating the keys might yield a different byte sequence, even though it's canonically the same string. To normalize or not to normalize depends on the use case. For fun, try copying and pasting the two different forms of the acute E above, and see if your text editor can find both forms. Results will vary.

I'm not an expert in this area, and I'm not trying to write a Unicode library, so I settled on these rules for handling comparisons:

I'm sure there are more nuances than this, but so far this has kept me out of problems.

String resize

ASCIIZ strings can be truncated by just shoving in a NULL terminator at the desired location. This doesn't work for Unicode strings.

Depending on the encoding, this may change a partial byte of a multibyte code point encoding, or it may even completely corrupt the encoding, making the string no longer decodable.

Even if we take care to align on a code point, we may be truncating off a combining character. Instead of truncating off the the entire last character of "sé", we might accidentally truncate off the combining accent to end up with "se".

String index

As with most things involving count in Unicode, we need to know what we're indexing - bytes, code points, or characters (grapheme clusters)?

Since many encodings of a Unicode string are not fixed width, indexing code points or characters is not a constant time operation the same way it is in ASCII. The following innocuous looking code can actually be quadratic:

String s = get_some_string();
for (int i=0; i<s.length; i++) {
    process(s.get(i));
}

In most cases, it's best to avoid doing a pass to get an index, only to follow up with another pass to do the actual operation. For example, to see if a character is in the string, don't index through the string, just use the library's search function. To split a string by spaces, don't first find the spaces and then split on those indexes, just call split directly.

To iterate through a string, use designated string iterator functions or types if they exist. Good libraries will avoid having to start from the beginning each time.

Interfacing with other components

Interfacing with other subsystems can lead to all sorts of interesting gotchas. Two of the most common are the file system and the database.

Some file systems will normalize the file name when creating a new file. Other file systems won't try to interpret the file name and instead just treat it as an opaque blob. If you want to be cross platform, you need to be compatible with both.

Both systems have their quirks. When a system treats the file names as an opaque blob, you can have multiple filenames that might be the same string, but have different unicode points. Depending on how you're entering the file name, and depending on if the higher level application code decides to normalize, the filename may look exactly the same on screen but different files are being edited underneath. On the other hand, when the file system normalizes, the file name you give to the OS may not actually be the byte sequence that is used on disk. In that case, iterating and searching for a file may require a Unicode aware string comparison.

That might cause some occasional confusion, but these aren't the major problems. No, the really major problems come when you try to do cross platform interoperability or cross platform development. Here is an example of how a problem can occur:

  1. Development is being done on a normalizing file system. Somewhere in this program is a catalog of assets used by the program - maybe some graphics images. The file names have Unicode characters in them.
  2. The list of assets is stored in a text format in the program, but not stored in a normalized form.
  3. Any tools that create or manipulate these assets use the string that's in the asset catalog. However, the file name is being normalized by the filesystem. Now the file name in the asset catalog and the file name on the file system have different byte representations.
  4. During deployment, the assets and the program are copied to and executed on a file system which treats the file names as opaque. The program attempts to open a non normalized file name, and since the file name has a different byte sequence, the file is no longer found.
There are many variants of this kind of bug. See, for example, this bug where the SVN source control system had issues with filenames with accents. And in case you GIT users are feeling smug because SVN is some ancient system, see this other bug where GIT ran into a similar issue (the bug starts with case sensitivity issues, but later describes how it can also manifest with Unicode characters).

A database may need to have Unicode support enabled to properly support multilingual string storage, sorting, and searching. Any strings that are used in a database query have to be converted to the encoding expected by the database. (Some databases may allow the client to set the encoding of the strings on a per connection basis).

Alternatively, depending on the use cases, a database can just store all its strings as blobs, and the application layer can handle any searching, indexing, etc as necessary. In this case, the type should be a non string type so the database doesn't try to do any type conversion. The strings should all be converted to the same encoding before being stored as a blob.

Avoiding problems

Make sure you're using the right mental model for string manipulation. A string is not a sequence of bytes; a string is a sequence of unicode points, along with an encoding. A string can not be manipulated unless you know its encoding. A sequence of unicode points is not one to one with a series of graphic elements. For C programmers, this is analogous to treating pointers as pointers and integers as integers. Even though pointers are ultimately "just integers," thinking of them as just numbers and casting carelessly between different types of pointers eventually leads to crashes.

If your language or library has a string type that supports Unicode, use it. Convert all your inputs to the string type when reading them in. Do all internal program manipulation on the string types. Only convert them to other types as necessary when interfacing with other types, eg, with the OS or with other clients over the Internet.

If you must write your own type, I suggest using UTF-8 as the standard, and then follow the suggestions above - in other words, convert all strings to UTF-8 on input, only do manipulations on UTF-8, and then convert to other encodings as necessary only when interfacing with outside libraries. For more on this philosophy see the UTF-8 everywhere manifesto.

If your program stores references to files - for example, if the program uses some kind of asset catalog, or the user can add and create references to external files - the stored filename may not be byte for byte equivalent to the reference in your program. This can vary by platform and can cause different behavior on different platforms.

Test your code with a variety of characters and on a variety of platforms. For example, try throwing in accented characters or Emoji into your test data and make sure that your code handles (or fails) properly. Various test sets can be found on the web.

Other useful references

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
An overview of the basics of Unicode
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Another overview of the basics of Unicode, a bit more technical detail than the previous one
Awesome Unicode
From the page description: A curated list of delightful Unicode tidbits, packages and resources.
Let’s Stop Ascribing Meaning to Code Points
A deeper discussion of code points versus grapheme clusters
UTF-8 and Unicode FAQ for Unix/Linux

by Yosen
I'm a software engineer who works on both mobile and web programming. I think STEM education is important. I also follow pop culture and dabble in music and the digital arts.