Unicode Encoding Forms




Unicode Encoding Forms

In addition to defining the identify of each character and its numeric value (also known as code point), character- encoding standards also define internal representation (also known as encoding forms of each character (how its value is represented in bits). Unicode standard defines following three encoding three encoding forms each character.

UTF-8 (Unicode Transformation Format-8). This is a byte-oriented format having all Unicode characters represented as a variable length encoding of one, two, three or four bytes (remember,1 byte=8 bits). This form is useful for dealing with environments designed entirely around ASCII because the Unicodes characters that correspond to the familiar ASCII character set have the same byte values as that of ASCII. This form is also popular for HTML and similar protocols.

UTF-16 (Unicode Transformation Format-16). This is a word oriented format having all Unicode characters represented as a variable length encoding of one or two words (remember,1 word=16 bits). This form is useful for environments that need to balance efficient access to characters with economical use of storage. This is because all the heavily used characters can be represented by and accessible via a single word (16-bit code unit), while all other characters are represented by and accessible via a pair of words. Hence, this encoding form is reasonably compact and efficient yet provides support for larger number of characters.

UTF-32 (Unicode Transformation Format-32). This is a double-word oriented format having all Unicode characters represented as a fixed length encoding of two words (remember,1 word=16 bits). That is, a double word (32-bit code unit) encodes each character. This form is useful for environments where memory space is not a concern but fixed width (single code unit) access to characters is desired.