Use UTF8 or Punycode for email addresses?
Unicode addresses in email, such as مثال@مثال.السعودية, can be written using either Punycode or UTF8. (Or, if you're feeling inventive, in another manner you invent.) Which is best?
UTF8 looks like this: From: Arabic Example <مثال@مثال.السعودية>
, punycode might look like this (if it were legal, see below): From: Arabic Example <xn--mgbh0fb@xn--mgbh0fb.xn--mgberp4a5d4ar>
The answer follows from two of the design goals for the unicode email extensions:
- Allow UTF8 everywhere
- Extend email, don't restrict it
RFC 821 and its successors do not contain any rules such as you MUST NOT put the letter n next to an x
, so Punycode is allowed. EAI allows Punycode by virtue of not forbidding what was previously allowed. But the right way is to use UTF8 everywhere. Use UTF8 in the subject field, in the body text, in the address… everywhere! That's allowed, it's a design goal, and it's better than Punycode for four reasons.
First, it's simpler than using Punycode in addresses, 2047 encoding in the subject text and qp/b64 encoding in the body text.
Second, it's very, very readable. A surprising amount of legacy software does the right thing if you send it UTF8, and that goes for humans who read email source too.
Third, Punycode's interpretation is only specified for domains, and if rumour is to be believed, people are using two incompatible encodings for the localpart. (In the example above, the second and third instances of xn-- are specified, but the first is not.) You're permitted to send a punycoded localpart to anyone, but the recipient is not required to interpret it in the way you intend and most do not.
Again, nothing in the standards requires the receiving software to understand what you mean with <xn--mgbh0fb@… The punycode example above will only work by luck.
Fourth, sending Punycode habituates users to accept random hex blobs in addresses. A phisher's dream.
So use UTF8 everywhere in the message. Mapping to Punycode is necessary when doing the MX lookup in order to transmit the message, but only then.