Unicode

A Unicode character is a sequence of one or more code points. Code point is an index into a space of cardinality 1,114,112.
Code points can be represented as U+0000 to U+10FFFF.
128,237 code points are actually assigned to scripts. Additional 137,468 code points are reserved for future use.

UTF-32

Encoding of code space such that code points are between 1 to 4 octets long. Code point can be decoded using the table below.

Encoding of code space such that code points are between 1 to 2 Words long and are decoded as per the table below.

UTF-16 (binary)	Code point (binary)	Range
xxxxxxxxxxxxxxxx	xxxxxxxxxxxxxxxx	U+0000–U+FFFF
110110xxxxxxxxxx 110111yyyyyyyyyy	xxxxxxxxxxyyyyyyyyyy + 0x10000	U+10000–U+10FFFF

UTF-16’s words can be stored either little-endian or big-endian. There is convention of putting U+FEFF zero width no-break space at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness.

“Á” can be expressed as a string of two code points: U+0041 “A” latin capital letter a plus U+0301 “◌́” combining acute accent.
In this example, A is base code point, Accent ◌́ is combining mark
Canonical equivalence: There may be two or more ways to dynamically compose same user perceived characters. eg. for “ệ” - U+1EB9 “ẹ” + U+0302 “◌̂” and U+00EA “ê” + U+0323 “◌̣”
several normalization forms exist to convert strings into a canonical form so that they can be compared code-point-by-code-point. eg. NFD form, NFC form and NFKD and NFKC forms.