tail head cat sleep
QR code linking to this page

Manual Pages  — UTF8


utf8 – UTF-8, a transformation format of ISO 10646





The UTF-8 encoding represents UCS-4 characters as a sequence of octets, using between 1 and 6 for each character. It is backwards compatible with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte encoding of non- ASCII characters consist entirely of bytes whose high order bit is set. The actual encoding is represented by the following table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
        1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
        11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
        111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
        1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb

If more than a single representation of a value exists (for example, 0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always used. Longer ones are detected as an error as they pose a potential security risk, and destroy the 1:1 character:octet sequence mapping.



Rob Pike, Ken Thompson, USENIX Association, Proceedings of the Winter 1993 USENIX Technical Conference, Hello World, January 1993.

F. Yergeau, RFC 2279, UTF-8, a transformation format of ISO 10646, January 1998.

The Unicode Consortium, as amended by the Unicode Standard Annex #27: Unicode 3.1 and by the Unicode Standard Annex #28: Unicode 3.2, The Unicode Standard, Version 3.0, 2000.


The utf8 encoding is compatible with RFC 2279 and Unicode 3.2.

UTF8 (5) April 7, 2004

tail head cat sleep
QR code linking to this page

Please direct any comments about this manual page service to Ben Bullock. Privacy policy.

If you have a problem and you think awk(1) is the solution, then you have two problems.
— David Tilbrook