Unicode
- A Unicode character is a sequence of one or more code points. Code point is an index into a space of cardinality 1,114,112.
- Code points can be represented as U+0000 to U+10FFFF.
- 128,237 code points are actually assigned to scripts. Additional 137,468 code points are reserved for future use.
UTF-32
- 32 bit index into unicode space.
UTF-8
- Encoding of code space such that code points are between 1 to 4 octets long. Code point can be decoded using the table below.
UTF-8 (binary) | Code point (binary) | Range |
---|
0xxxxxxx | xxxxxxx | U+0000–U+007F |
110xxxxx 10yyyyyy | xxxxxyyyyyy | U+0080–U+07FF |
1110xxxx 10yyyyyy 10zzzzzz | xxxxyyyyyyzzzzzz | U+0800–U+FFFF |
11110xxx 10yyyyyy 10zzzzzz 10wwwwww | xxxyyyyyyzzzzzzwwwwww | U+10000–U+10FFFF |
- Code points below 128 (ASCII characters) are encoded as single bytes.
UTF-16
- Encoding of code space such that code points are between 1 to 2 Words long and are decoded as per the table below.
UTF-16 (binary) | Code point (binary) | Range |
---|
xxxxxxxxxxxxxxxx | xxxxxxxxxxxxxxxx | U+0000–U+FFFF |
110110xxxxxxxxxx 110111yyyyyyyyyy | xxxxxxxxxxyyyyyyyyyy + 0x10000 | U+10000–U+10FFFF |
- UTF-16’s words can be stored either little-endian or big-endian. There is convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness.
Dynamic composition
- “Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent.
- In this example, A is base code point, Accent ◌́ is combining mark
- Canonical equivalence: There may be two or more ways to dynamically compose same user perceived characters. eg. for “ệ” - U+1EB9 “ẹ” + U+0302 “◌̂” and U+00EA “ê” + U+0323 “◌̣”
- several normalization forms exist to convert strings into a canonical form so that they can be compared code-point-by-code-point. eg. NFD form, NFC form and NFKD and NFKC forms.
Futher reading
http://reedbeta.com/blog/programmers-intro-to-unicode/