Zoë Blade's notebook

UTF-8

UTF-8 is a very popular character set. It's a superset of ASCII, increasing the number of supported characters from 128 to over a million. It's essentially ASCII's successor.

The backwards compatability with ASCII works because an ASCII character is 7-bit, while a byte is 8-bit, meaning that every byte representing an ASCII character has its most significant bit set to 0.

All UTF-8 characters are either single-byte ASCII characters with this bit set to 0, or multi-byte characters with this bit set to 1 for each and every byte. So within the context of known UTF-8 text, if any given byte's most significant bit is set to 1, it can be inferred that it's part of a UTF-8 character. Anything that can't cope with non-ASCII characters can therefore simply ignore all bytes that have their most significant bit set, leaving them untouched.

Each character can take one to four bytes to store. Basic Latin characters take up one byte each, as they're still ASCII; extended Latin characters, and most other current alphabets, take two; almost all other alphabets and syllabaries still currently in use take three; and everything else is encompassed in four bytes.

Contrast with UTF-32, which uses four bytes for every single character. As this is more consistent, it's much simpler to process, and less English-centric. The tradeoff is that it's even more wasteful to store.

Compared to all the old code pages, UTF-8 finally allows using multiple different alphabets and syllabaries within the same plain text file or stream. Even without that important advantage, it automates working out which one is being used, at the expense of having larger files. Given how far storage space has come since code pages were popular, the larger files are well worth it in order to be able to give those files to someone in a different country and expect them to see all the characters correctly.

To put the inflation into perspective, remember that data storage media have grown in size far more than text files. Converting an old 32 KB JIS file from the early 1990s to a 96 KB UTF-8 version fit for the twenty-first century isn't so much of a problem. A DOS-formatted 3.5″ floppy disk from the 1990s could only hold 1.44 MB of data, whereas a more modern SDHC card can hold 32 GB. In other words, modern files that are 3 times the size of their 1990s equivalents aren't a big deal on storage media over 20,000 times the size of their 1990s equivalents. Now more than ever, text files are a solved problem.

Downloads

Documentation

Character sets: ASCII | UTF-8 | Useful Unicode characters

RFC standards: ASCII | UTF-8