Detecting character encodings

Archiveopteryx often needs to massage incoming mail to make it syntactically valid. 99% may be valid, but 1% is still a lot. One of the chores is to guess how a message is encoded — unicode, ISO-8859-x or what? For that Archiveopteryx uses a novel and good algorithm.

Almost every text uses a few very common words. Almost, every, text and so on are all among the 10,000 most common English words. If the common words in a language use unusual letters, then texts written in that language will generally contain these words, and these letters.

Archiveopteryx contains lists of common words for six languages, in which each word is encoded using each of the character encodings commonly used for that language. Taking German as an example, für and groß are on the list, encoded using ISO 8859-1, Mac-Roman and IBM Code Page 437. These two words account for six list entries.

When Archiveopteryx needs to guess how a particular message is encoded, it takes each non-ASCII word and looks it up in that list. If the message is in German, the chances are good that it'll contain für, groß or another word from the list. If Archiveopteryx sees that one or more words from its list are used, and that the list entries in question use Mac-Roman, then it concludes that the message uses Mac-Roman, and stores it accordingly.

The grand total list in Archiveopteryx contains about 6,000 words, and a total of 20,283 encoded forms, and if you look closer at the actual code used, you'll see that it also tries several other approaches, not only this one. What I've described here is just the novel aspect, the one worth writing about.

(I now toast myself for having struck a six-year-old item off my todo list.)