UTF-8
UTF-8
(character)The Unicode character set occupies a 16-bit code space. Themost obvious Unicode encoding (known as UCS-2) consists of asequence of 16-bit words. Such strings can contain bytes like'\\0' or '/' which have a special meaning in filenames andother C library function parameters. In addition, themajority of Unix tools expects ASCII files and can't read16-bit words as characters without major modifications. Forthese reasons, UCS-2 is not a suitable external encoding ofUnicode in filenames, text files, environment variables, etc.
The ISO 10646 Universal Character Set (UCS), a superset ofUnicode, occupies a 31-bit code space and the obvious UCS-4encoding for it (a sequence of 32-bit words) has the sameproblems.
The UTF-8 encoding of Unicode and UCS avoids the problems offixed-length Unicode encodings because an ASCII file encodedin UTF is exactly same as the original ASCII file and allnon-ASCII characters are guaranteed to have the mostsignificant bit set (bit 0x80). This means that normal toolsfor text searching etc. work as expected.
UTF-8 is defined in RFC 2279.
["File System Safe UCS Transformation Format (FSS_UTF)",X/Open Preliminary Specification, X/Open Company Ltd.,Document Number: P316. This information also appears inISO/IEC 10646, Annex P].
Plan 9 UTF manual entry.