Chapter 1. Modified UTF8 Encoding of UNICODE characters

Allowing UNICODE characters for all user visible strings introduces a set of compatibility problems if the protocol must be backward compatible. UTF8 encoding is used to convert wide UNICODE characters into a string compatible with existing IRC servers and clients.

To permit the full range of UNICODE characters, we must introduce an additional post-processing step on the result of an UTF8 translation.

Any string beginning with a '%' character (i.e. "reason" strings within a REDIRECT command) will be interpreted as UTF8-encoded UNICODE strings.

UNICODE characters encoded in UTF8 may use more bytes than an ASCII character. To ensure compatibility, UNICODE strings such as nicknames and channel names must fit within the standard length measured in bytes, not in characters.

The quoting character for the post-processing step is the '\' character. All mappings are listed in the table below.

Table 1.1. Character Quoting in UTF-8

Quoted CharacterUnquoted Character
\b" " (space, ASCII 32)
\c"," (comma, ASCII 44)
\\"\" (backslash, ASCII 92)
\rCR (carriage return, ASCII 13)
\nLF (line feed, ASCII 10)
\tTAB (horizontal tab, ASCII 9)

IRCX clients view UTF8-encoded UNICODE strings in their native form. So non-IRCX clients can inter-operate with UNICODE nicks, UNICODE nicks are translated by the server into a usable form before being sent to the non-IRCX client. This usable format is a simple hex format preceded by a '^' character. Non-IRCX clients can use this hex format nickname to specify the IRCX/UNICODE user.