Arnt Gulbrandsen
About meAbout this blog

The one-minute guide to implementing unicode email addresses

The unicode email address extensions are pleasantly simple to implement. Here is an overview of the RFCs and some notes I made while doing my first implementations; this posting is a very brief description of the protocol and format extensions involved. Despite its brevity it's nearly complete, because these extensions are so simple.

Mail message format: Using UTF8 everywhere is now permitted. Instead of using RFC2047 encoding, quoted-printable and more, messages can use UTF8 everywhere.

To: Jøran Øygårdvær <jøran@blåbærsyltetøy­.gulbrandsen.priv.no>
Subject: Høy på pæra
Content-Type: text/plain; charset=utf8

Gørrlei av eksempler.

No encoding is necessary anywhere. Encoding is permitted but not necessary. The message above lacks From and Date, apart from that it's correct.

Sending mail using SMTP: The server advertises the SMTPUTF8 extension, the MAIL FROM command includes the argument SMTPUTF8, and the email addresses can then use UTF8.

$ telnet mx.example.com 25
Trying 2001:6d8::4269…
Connected to mx.example.com
Escape character is '^]'.
220 mx.example.com ESMTP Postfix (3.0.0)
ehlo myhostname
250-mx.example.com
250-PIPELINING

250 SMTPUTF8
mail from:<> smtputf8
250 2.1.0 Ok
rcpt to:<jøran@blåbærsyltetøy.gulbrandsen.priv.no>
250 2.1.5 Ok
data
354 End data with .

Note that the EHLO argument is sent before the client knows whether the server supports SMTPUTF8. It's best to use ASCII-only EHLO arguments.

The SMTPUTF8 argument to MAIL FROM has two purposes: Notify the mail server that one or more addresses may contain UTF8, and make sure that the recipient software does not receive a message it will not be able to parse.

Thus, if you send a message to आर्न्ट@यूनिवर्सल.भारत with a cc to example@example.com and the mail software at example.com does not support SMTPUTF8, then only आर्न्ट@यूनिवर्सल.भारत will receive the message. The mail server for example.com will reject the message. This is intentional.

An MTA needs to do an IDN conversion (e.g. from blåbærsyltetøy­.­gulbrandsen.­priv.no to xn--blbrsyltety-y8ao3x.­gulbrandsen.­priv.no) as part of MX lookup, a client that connects to its local server doesn't need even that.

Access using IMAP: The server advertises the ENABLE extension, the client sends ENABLE UTF8=ACCEPT (that's legal even if the server advertises only ENABLE), the server acknowledges having enabled UTF8=ACCEPT, and from that point, both server and client can use UTF8 for any quoted string, including folder names, search strings and addresses.

$ telnet imap.example.com 143
Trying 2001:6d8::6942…
Connected to imap.example.com.
Escape character is '^]'.
* OK [CAPABILITY … ENABLE …
a login arnt pils
a OK [CAPABILITY … ENABLE …UTF8=ACCEPT …
b enable utf8=accept
* ENABLED UTF8=ACCEPT
b OK done
c select "Gørrlei"

Testing: Gmail supports this, both for SMTP, IMAP and webmail. The jøran@… address is an autoresponder, you can send mail to it and will receive a reply in a few seconds. Blåbærsyltetøy means blueberry jam and includes all of the three special letters used in Norwegian, æ, ø and å, so it's often used as a test word.

There are more details, but this is 90% of what's needed to write a correct implementation.