Arnt Gulbrandsen
About meAbout this blog

The one-minute guide to implementing unicode email addresses

The unicode email address extensions are pleasantly simple to implement. Here is an overview of the RFCs and some notes I made while doing my first implementations; this posting is a very brief description of the protocol and format extensions involved. Despite its brevity it's nearly complete, because these extensions are so simple.

Mail message format: Using UTF8 everywhere is now permitted. Instead of using RFC2047 encoding, quoted-printable and more, messages can use UTF8 everywhere.

To: Jøran Øygårdvær <jøran@blåbærsyltetøy­.gulbrandsen.priv.no>
Subject: Høy på pæra
Content-Type: text/plain; charset=utf8

Gørrlei av eksempler.

No encoding is necessary anywhere. Encoding is permitted but not necessary. The message above lacks From and Date, apart from that it's correct.

Sending mail using SMTP: The server advertises the SMTPUTF8 extension, the MAIL FROM command includes the argument SMTPUTF8, and the email addresses can then use UTF8.

$ telnet mx.example.com 25
Trying 2001:6d8::4269…
Connected to mx.example.com
Escape character is '^]'.
220 mx.example.com ESMTP Postfix (3.0.0)
ehlo myhostname
250-mx.example.com
250-PIPELINING

250 SMTPUTF8
mail from:<> smtputf8
250 2.1.0 Ok
rcpt to:<jøran@blåbærsyltetøy.gulbrandsen.priv.no>
250 2.1.5 Ok
data
354 End data with .

Note that the EHLO argument is sent before the client knows whether the server supports SMTPUTF8. It's best to use ASCII-only EHLO arguments.

The SMTPUTF8 argument to MAIL FROM has two purposes: Notify the mail server that one or more addresses may contain UTF8, and make sure that the recipient software does not receive a message it will not be able to parse.

Thus, if you send a message to आर्न्ट@यूनिवर्सल.भारत with a cc to example@example.com and the mail software at example.com does not support SMTPUTF8, then only आर्न्ट@यूनिवर्सल.भारत will receive the message. The mail server for example.com will reject the message. This is intentional.

An MTA needs to do an IDN conversion (e.g. from blåbærsyltetøy­.­gulbrandsen.­priv.no to xn--blbrsyltety-y8ao3x.­gulbrandsen.­priv.no) as part of MX lookup, a client that connects to its local server doesn't need even that.

Access using IMAP: The server advertises the ENABLE extension, the client sends ENABLE UTF8=ACCEPT (that's legal even if the server advertises only ENABLE), the server acknowledges having enabled UTF8=ACCEPT, and from that point, both server and client can use UTF8 for any quoted string, including folder names, search strings and addresses.

$ telnet imap.example.com 143
Trying 2001:6d8::6942…
Connected to imap.example.com.
Escape character is '^]'.
* OK [CAPABILITY … ENABLE …
a login arnt pils
a OK [CAPABILITY … ENABLE …UTF8=ACCEPT …
b enable utf8=accept
* ENABLED UTF8=ACCEPT
b OK done
c select "Gørrlei"

Testing: Gmail supports this, both for SMTP, IMAP and webmail. The jøran@… address is an autoresponder, you can send mail to it and will receive a reply in a few seconds. Blåbærsyltetøy means blueberry jam and includes all of the three special letters used in Norwegian, æ, ø and å, so it's often used as a test word.

There are more details, but this is 90% of what's needed to write a correct implementation.

Using procmail as an autoresponder

Procmail is old and almost forgotten, but still works well. This short script is the autoresponder for jøran@blåbærsyltetøy­.gulbrandsen.priv.no:

:0 c
/tmp/jøranmail

:0
* !^FROM_DAEMON
* !^X-Loop:
* !^Auto-Submitted:
| (formail -r -t -I"From: Jøran Øygårdvær " -A"Auto-Submitted: auto-replied" -A"Mime-Version: 1.0" -A"Content-Type: text/plain; charset=utf8" -A"Content-Transfer-Encoding: 8bit" ; echo "Liker du blåbærsyltetøy? Jeg synes blåbærsyltetøy er veldig godt." ; echo ; echo "-- " ; echo "Jøran") | /usr/sbin/sendmail -t

The first clause stores all incoming mail in /tmp/jøranmail just in case it's needed for debugging. The second clause filters out three kinds of mail that should not receive an autoreply. For messages that pass all three hurdles, it runs formail with many arguments to create an EAI-compliant autoreply header, echo to write a brief reply, and sends the result back.

It may not be terribly readable, but it's brief and reliable. That glass is two thirds full, not one third empty.

A new email address

I think mail to आर्न्ट@यूनिवर्सल.भारत should now land in my inbox... wonder how long it'll take before the first spammer manages to spam that address.

Three programs, one feature

It's not something one does often, but I've implemented the same feature in three different programs. Not very different, all are written in the same programming language for the same platform, and all are servers.

Same platform, same language, same task, same developer... you would think the three patches would end up looking similar? They did not, not at all.

The feature I wrote is is support for using UTF8 on SMTP, which I've implemented for Postfix, Sendmail and Qmail, which all run on linux/posix systems. I tried to follow the code style for each of them, and surprised myself at how different my code looked.

One patch is well-engineered, prim and proper.

The next is for an amorphous blob of software. The patch is itself amorphous, and makes functions even longer that were too long already. Yet it's half as long as the first patch. The two are, in my own judgment, about equally readable. One wins on length, the other on readability, they're roughly tied overall. This surprised me not a little.

The third is a short, readable patch which one might call an inspired hack. It's a much smaller than the others and easily wins on readability too.

It wasn't supposed to be like that, was it? Good engineering shouldn't give the most verbose patch, and the hack shouldn't be the most lucid of the three.

I see two things here:

First: Proper engineering has its value, but perhaps not as much as common wisdom says. Moderately clean code offers almost all of the value of really clean code.

Second: A small program is easy to work with, such as the MVPs that are so fashionable these days. But ease of modification isn't all, the smallest among the three servers has fallen out of use because the world changed and it stopped being viable.

Some random verbiage on each of the three servers and patches: […More…]

Use UTF8 or Punycode for email addresses?

Unicode addresses in email, such as مثال@مثال.السعودية, can be written using either Punycode or UTF8. (Or, if you're feeling inventive, in another manner you invent.) Which is best?

UTF8 looks like this: From: Arabic Example <مثال@مثال.السعودية>, punycode might look like this (if it were legal, see below): From: Arabic Example <xn--mgbh0fb@xn--mgbh0fb.xn--mgberp4a5d4ar>

The answer follows from two of the design goals for the unicode email extensions:

  1. Allow UTF8 everywhere
  2. Extend email, don't restrict it

RFC 821 and its successors do not contain any rules such as you MUST NOT put the letter n next to an x, so Punycode is allowed. EAI allows Punycode by virtue of not forbidding what was previously allowed. But the right way is to use UTF8 everywhere. Use UTF8 in the subject field, in the body text, in the address… everywhere! That's allowed, it's a design goal, and it's better than Punycode for four reasons.

First, it's simpler than using Punycode in addresses, 2047 encoding in the subject text and qp/b64 encoding in the body text.

Second, it's very, very readable. A surprising amount of legacy software does the right thing if you send it UTF8, and that goes for humans who read email source too.

Third, Punycode's interpretation is only specified for domains, and if rumour is to be believed, people are using two incompatible encodings for the localpart. (In the example above, the second and third instances of xn-- are specified, but the first is not.) You're permitted to send a punycoded localpart to anyone, but the recipient is not required to interpret it in the way you intend and most do not.

Again, nothing in the standards requires the receiving software to understand what you mean with <xn--mgbh0fb@… The punycode example above will only work by luck.

Fourth, sending Punycode habituates users to accept random hex blobs in addresses. A phisher's dream.

So use UTF8 everywhere in the message. Mapping to Punycode is necessary when doing the MX lookup in order to transmit the message, but only then.

An aside on implementing IETF extensions

Sometimes implementing a new IETF extension RFC that makes sense, but quite often not, because being among the first to implement rarely makes sense. Unicode mail is like that: Why bother to implement unicode addresses if you can't send mail to anyone and noone can send mail to you?

Well, gmail has done it and in hindsight, of course it makes sense. Half the world's email is on gmail, so their first customer to use it can already exchange mail with half the world. Google has to sell Google Apps for Business to some Chinese and Japanese companies who currently use Exchange, then the investment is paid for.

Update: Those first customers don't even have to use unicode addresses. Being the first to use a unicode address is perhaps not much fun, but being able to receive mail from that first user is 99% benefit.

Implementation notes about unicode mail

I've implemented unicode mail three times now; in Postfix (paid for by CNNIC and not yet integrated), in aox and lastly in an old mail reader I'm porting from the Zaurus PDA to Android (unreleased as yet, send me mail if you'd like beta access). This is mostly a random collection of notes and remarks I collected while writing the code.

The specification was produced by an IETF working group called EAI (short for email address internationalisation). The WG produced two generations of RFCs. First, an experimental series which I ignore, then a revised, simplified and improved series. This covers the second generation, which takes the general position that unicode mail is only sent to recipients who understand it. There is no conversion during transport, and (almost) no fallback to ASCII.

RFC 6530 is an overview/introduction. It points to the other documents, and has some extra text. Worth reading.

6531 describes how unicode addresses are used with SMTP: MAIL FROM, RCPT TO and VRFY accept UTF8 addresses, and there's a safeguard to provoke a syntax error in case a unicode message body would otherwise reach someone who cannot accept it. […More…]

A unicode email autoresponder

I've set up a test address for the SMTPUTF8 extension created by the IETF EAI working group.

If you send mail to jøran@blåbærsyltetøy.gulbrandsen.priv.no Jøran will send you a stock reply, which you can use to test that unicode mail works in both directions.

For the moment you must be able to send via IPv6. Jøran can send the reply back via either IPv4 or IPv6, but you have to send the initial message via IPv6. I intend to add a v4-capable secondary MX later.

I have or can arrange other testing too; send me mail if you're interested.

Test messages for unicode mail addresses (EAI)

EAI is a set of RFCs to enable unicode email addresses. jøran@example.com and even jøran@blåbærsyltetøy.no are syntactically valid email addresses. There are RFCs to extend the email message syntax, to transmit these messages via SMTP, access them via POP and IMAP, and to provide read access by unextended IMAP/POP clients.

I wrote a set of test messages for EAI this morning and put them on github. Feel free to send me extensions and corrections.

Email address internationalisation

EAI defines a set of RFCs to provide non-ASCII email addresses. pål@eksempel.no. I looked at them with a view to implementing that in Archiveopteryx.

The good news: It's simple and sane.

The bad news: I can tell it's possible to spend a lot of time arguing about minor side issues.

On good and bad RFCs

The worst of the nine RFCs I have written is doubtlessly 5465, IMAP NOTIFY, which should have been good but is a disaster. Its main characteristics are that it's complex (both in terms of number of rules in the RFC and the number of features needed in a server), that it's much more complex than the first of its input documents, and that noone implements it.

My best may be 4978, IMAP COMPRESS=DEFLATE, which is much shorter than 5465, roughly as complex as its first draft version was, contains an informative section with implementation advice, and is widely implemented. […More…]