Arnt Gulbrandsen
About meAbout this blog
2017-12-13

The one-minute guide to implementing unicode email addresses

The unicode email address extensions are pleasantly simple to implement. Here is an overview of the RFCs and some notes I made while doing my first implementations; this posting is a very brief description of the protocol and format extensions involved. Despite its brevity it's nearly complete, because these extensions are so simple.

Mail message format: Using UTF8 everywhere is now permitted. Instead of using RFC2047 encoding, quoted-printable and more, messages can use UTF8 everywhere.

To: Jøran Øygårdvær <jøran@blåbærsyltetøy­.gulbrandsen.priv.no> Subject: Høy på pæra Content-Type: text/plain; charset=utf8 Gørrlei av eksempler.

No encoding is necessary anywhere. The message above lacks From and Date, apart from that it's correct.

Sending mail using SMTP: The server advertises the SMTPUTF8 extension, the MAIL FROM command includes the argument SMTPUTF8, and the email addresses can then use UTF8.

$ telnet mx.example.com 25 Trying 2001:6d8::4269… Connected to mx.example.com Escape character is '^]'. 220 mx.example.com ESMTP Postfix (3.0.0) ehlo myhostname 250-mx.example.com 250-PIPELINING … 250 SMTPUTF8 mail from:<> smtputf8 250 2.1.0 Ok rcpt to:<jøran@blåbærsyltetøy­.gulbrandsen.priv.no> 250 2.1.5 Ok data 354 End data with .

Note that the EHLO argument is sent before the client knows whether the server supports SMTPUTF8. It's best to use ASCII-only EHLO arguments.

The SMTPUTF8 argument to MAIL FROM has two purposes: Notify the mail server that one or more addresses may contain UTF8, and make sure that the recipient software does not receive a message it will not be able to parse.

Thus, if you send a message to आर्न्ट@यूनिवर्सल.भारत with a cc to example@example.com and the mail software at example.com does not support SMTPUTF8, then only आर्न्ट@यूनिवर्सल.भारत will receive the message. The mail server for example.com will reject the message. This is intentional.

An MTA needs to do an IDN conversion (e.g. from blåbærsyltetøy­.­gulbrandsen.­priv.no to xn--blbrsyltety-y8ao3x.­gulbrandsen.­priv.no) as part of MX lookup, a client that connects to its local server doesn't need even that.

Access using IMAP: The server advertises the ENABLE extension, the client sends ENABLE UTF8=ACCEPT (that's legal even if the server advertises only ENABLE), the server acknowledges having enabled UTF8=ACCEPT, and from that point, both server and client can use UTF8 for any quoted string, including folder names, search strings and addresses.

$ telnet imap.example.com 143 Trying 2001:6d8::6942… Connected to imap.example.com. Escape character is '^]'. * OK [CAPABILITY … ENABLE … a login arnt pils a OK [CAPABILITY … ENABLE …UTF8=ACCEPT … b enable utf8=accept * ENABLED UTF8=ACCEPT b OK done c select "Gørrlei"

Testing: Gmail supports this, both for SMTP, IMAP and webmail. The jøran@… address is an autoresponder, you can send mail to it and will receive a reply in a few seconds. Blåbærsyltetøy means blueberry jam and includes all of the three special letters used in Norwegian, æ, ø and å, so it's often used as a test word.

There are more details, but this is 90% of what's needed to write a correct implementation.