Sankalp Bose

Versatile programmer with experience spanning Linux kernel, Java WebServices, Embedded Systems and Performance Engg.

Unicode condensed

10 Mar 2017 » general

Unicode

  • A Unicode character is a sequence of one or more code points. Code point is an index into a space of cardinality 1,114,112.
  • Code points can be represented as U+0000 to U+10FFFF.
  • 128,237 code points are actually assigned to scripts. Additional 137,468 code points are reserved for future use.

UTF-32

  • 32 bit index into unicode space.

UTF-8

  • Encoding of code space such that code points are between 1 to 4 octets long. Code point can be decoded using the table below.
UTF-8 (binary)Code point (binary)Range
0xxxxxxxxxxxxxxU+0000–U+007F
110xxxxx 10yyyyyyxxxxxyyyyyyU+0080–U+07FF
1110xxxx 10yyyyyy 10zzzzzzxxxxyyyyyyzzzzzzU+0800–U+FFFF
11110xxx 10yyyyyy 10zzzzzz 10wwwwwwxxxyyyyyyzzzzzzwwwwwwU+10000–U+10FFFF
  • Code points below 128 (ASCII characters) are encoded as single bytes.

UTF-16

  • Encoding of code space such that code points are between 1 to 2 Words long and are decoded as per the table below.
UTF-16 (binary)Code point (binary)Range
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxU+0000–U+FFFF
110110xxxxxxxxxx 110111yyyyyyyyyyxxxxxxxxxxyyyyyyyyyy + 0x10000U+10000–U+10FFFF
  • UTF-16’s words can be stored either little-endian or big-endian. There is convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness.

Dynamic composition

  • “Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent.
  • In this example, A is base code point, Accent ◌́ is combining mark
  • Canonical equivalence: There may be two or more ways to dynamically compose same user perceived characters. eg. for “ệ” - U+1EB9 “ẹ” + U+0302 “◌̂” and U+00EA “ê” + U+0323 “◌̣”
  • several normalization forms exist to convert strings into a canonical form so that they can be compared code-point-by-code-point. eg. NFD form, NFC form and NFKD and NFKC forms.

Futher reading

http://reedbeta.com/blog/programmers-intro-to-unicode/