The Real UTF-8

Back to the real UTF-8 standard.

I wanted to make the point that it was "self-synchronizing": if something goes terribly wrong and you lose some information, you can still recover, and find the next character boundary without any trouble.

This used to be a terrible problem with character sets like Shift-Jis in the mid-90s: a C pointer error could shave off one byte and leave you with an entire screen of "mojibake", aka "transformed characters" (or "garbage" to Merkin Boobs like myself).

So, self-synchronization is a great feature.

It also messes everything up--

Top

doom@kzsu.stanford.edu