Implementation notes about unicode mail

I've implemented unicode mail three times now; in Postfix (paid for by CNNIC and not yet integrated), in aox and lastly in an old mail reader I'm porting from the Zaurus PDA to Android (unreleased as yet, send me mail if you'd like beta access). This is mostly a random collection of notes and remarks I collected while writing the code.

The specification was produced by an IETF working group called EAI (short for email address internationalisation). The WG produced two generations of RFCs. First, an experimental series which I ignore, then a revised, simplified and improved series. This covers the second generation, which takes the general position that unicode mail is only sent to recipients who understand it. There is no conversion during transport, and (almost) no fallback to ASCII.

RFC 6530 is an overview/introduction. It points to the other documents, and has some extra text. Worth reading.

6531 describes how unicode addresses are used with SMTP: MAIL FROM, RCPT TO and VRFY accept UTF8 addresses, and there's a safeguard to provoke a syntax error in case a unicode message body would otherwise reach someone who cannot accept it.

6531 and current practice combine to require that while mail addresses can contain UTF8, mail server hostnames should not. jøran@blåbærsyltetøy.gulbrandsen.priv.no works as an address, but the hostname of the server for that address cannot be e.g. mta.blåbærsyltetøy.gulbrandsen.priv.no, it has to be something like eai-test.gulbrandsen.priv.no instead. The main reason for this is that some servers expect that a client's EHLO argument be its hostname, and the EHLO argument is used before the SMTP client knows whether the server supports unicode. It's possible to use xn--blbrsyltety-y8ao3x for EHLO (and Received), but I fear that some servers will do a DNS lookup which automatically decodes to the UTF8 form, and then give the message a spam penalty because of the mismatch. It's better to just avoid that issue.

6532 describes the changes to the message format: Addresses can contain UTF8 in both the localpart and the domain, in subjects, in attachment filenames, etc. Quoted-printable is a thing of the past. It's good work. Procmail handles the new extensions despite being written in 1991, which I interpret as a sign that the changes fit well into the existing architecture.

6533 defines the error messages, plus the ORCPT argument used in SMTP. It is the worst of the series. This document defines quite a lot of things I can't imagine how to test or use.

For example: 6531 defines a fine mechanism for ensuring that unicode messages and addresses stay within the set of servers that support it, but 6533 defines an encoding for when addresses stray, which can only happen through noncompliant servers, as far as I can tell.

The other RFCs in the series assume that compliant parsers accept UTF8 addresses in rfc822 email messages, but 6533 defines three new MIME types rather than make the same assumption. And it adds syntax for using UTF8 addresses without using the new MIME types, too.

All of the RFCs use UTF8 all the time and for everything, except that 6533 uses both UTF8 and a different encoding. A unique encoding, not like any I've seen before.

6854 is a tiny message format update to let From use the same address syntax as To and Cc, including From: unicode-address:;. It's not really a part of this document series, it just happened to be needed by 6857 and 6858.

6855 is for IMAP; it's great. Like 6531, it simply lets you use UTF8 in most contexts where you expect to. There's very little optional about it, and the mandatory parts are fairly easy to implement.

6856 is for POP3; it's also great, for the same reasons as 6855: There's just one way to implement it, and that one way lacks unnecessary complications.

6857 is the ambitious way to present unicode mail to unextended IMAP/POP clients. Mail is stored using unicode and converted (as far as possible) to legacy syntax at read time.

6858 is the simple way to present unicode mail to unextended IMAP/POP clients. I wrote that, because I thought 6857-in-spe was too much work and offered too many opportunities for error. In my opinion 6858 is much simpler and delivers results in the same class as 6857: Much better than nothing, but not perfect. Both are good enough to read the mail, neither is good enough to reply.

IETF working groups don't often define two RFCs for the same task. In this case, the general feeling was that one size did not fit all, and that having two specifications was the best way to cater to everyone's taste.

IMAP/POP server implementers should implement either 6857 or 6858, depending on how much effort they want to put into it.

6783 is about mailing lists. I haven't read it yet.

There aren't any test suites or organised interoperability testing. Send me mail if you're interested in such things.

The DNS does not support UTF8 in domains directly, so mail servers/clients must convert e.g. blåbærsyltetøy to xn--blbrsyltety-y8ao3x before doing the MX lookup to find the server. The xn-- form is not used anywhere else in internet mail.

The xn-- form is confusing. The EAI documents only mention it when discussing MX lookups, so implementers (and domain owners) are free to use …@xn--… in email addresses, and equally free not to. Owning a domain does not require using it in any particular way, after all.

Ned Freed (a noted IETF contributor) argues that since xn-- is a DNS concept and xn--blbrsyltety-y8ao3x is equal to blåbærsyltetøy as defined by the DNS, good implementations ought to accept it, much as they accept BLÅBÆRSYLTETØY.

My own opinion is that Ned is right within the DNS scope, but he's allowing the tail to wag the dog. The DNS' rules do not bind higher layers, so e.g. a mail message processor does not have to accept Cc: jøran@xn--blbrs…. I think mail parsers are better if they match human users' understanding of equality, and the same applies to all software subsystems in mail software, except the DNS client.

My advice is twofold: Convert UTF8 to xn-- only when you're about to do an MX lookup, and avoid using xn-- domain where users can see them. Never send an xn-- domain to anyone, and if you have to accept one on input, convert it to UTF8 before showing it to the user.

Otherwise users will come to accept that xn--blbrsyltety-45abg3 means blåbærsyltetøy whatever the last six characters are, and that's rather too phisher-friendly for my taste.