16bits Basic Multilingual Plane

(The Latin-1 codepoints have become a small dark stripe at the top.)

In the original scheme, this was everything, it was just 16 bits-- it annoys me slightly that some complain about the "persistant myth that unicode is a double-byte character set": it's a "myth" because it was true for years.

Can you get every character from every language into a 64k space? You can get close... if you make compromises (1) they only cared about current usage, not historical, (2) the ideograms derived from Chinese would have to share codepoints, and the Chinese, Japanese and Koreans would all need to get along. Note: if this were to scale, the CJKV block would be 42% of the total.

Note the peculiar wart of the Surrogate Codes: 2k of the BMP used just to get the UTF-16 encoding to work.

(Diagram borrowed from the Unicode documentation.)

Top

doom@kzsu.stanford.edu