Arnt Gulbrandsen
About meAbout this blog

Detecting character encodings

Archiveopteryx often needs to massage incoming mail to make it syntactically valid. 99% may be valid, but 1% is still a lot. One of the chores is to guess how a message is encoded — unicode, ISO-8859-x or what? For that Archiveopteryx uses a novel and good algorithm.

Almost every text uses a few very common words. Almost, every, text and so on are all among the 10,000 most common English words. If the common words in a language use unusual letters, then texts written in that language will generally contain these words, and these letters.

Archiveopteryx contains lists of common words for six languages, in which each word is encoded using each of the character encodings commonly used for that language. Taking German as an example, für and groß are on the list, encoded using ISO 8859-1, Mac-Roman and IBM Code Page 437. These two words account for six list entries.

When Archiveopteryx needs to guess how a particular message is encoded, it takes each non-ASCII word and looks it up in that list. If the message is in German, the chances are good that it'll contain für, groß or another word from the list. If Archiveopteryx sees that one or more words from its list are used, and that the list entries in question use Mac-Roman, then it concludes that the message uses Mac-Roman, and stores it accordingly.

The grand total list in Archiveopteryx contains about 6,000 words, and a total of 20,283 encoded forms, and if you look closer at the actual code used, you'll see that it also tries several other approaches, not only this one. What I've described here is just the novel aspect, the one worth writing about.

(I now toast myself for having struck a six-year-old item off my todo list.)


Ubuntu 9.10 on the Lifebook S4572

Sad to say, but I recently installed ubuntu 9.10 (karmic koala) on a Fujitsu Siemens Lifebook S4572. I installed the minimal system followed by xubuntu-desktop and gcompris: xfce is supposed to be better for small boxes and gcompris is the whole point. (more…)


The inverted pyramid

The inverted pyramid is how journalists are taught to write typical news stories. It puts the most newsworthy information at the top, and then the remaining information follows in order of importance, with the least important at the bottom. (The quote is from Chip Scanlan.)

This is the right way to write most API documentation. (more…)


Companion techniques

Another finding-my-thoughts posting. Documenting helps fix bugs early, but what are its companion techniques?

Polishing the code all the time. Make it look good. Beauty is truth, truth is beauty. (more…)

Comments on the LWN thread

LWN ran an article on Archiveopteryx. Some points.

It's BSD-licensed, not OSL-licensed. It was OSL-licensed until last year.

One commenter opined that we might not have tested big mailboxes. Well, we have. Not sure exactly how big. We've routinely tested up to a million, bigger occasionally. At a million it's quite simple: Most mail readers fall over. (more…)


Reading someone's documentation

Written with doxygen. It's funny to see how early qdoc syntax is still there. Features I added because they seemed to make sense, then discarded later, when I saw they didn't carry their load.

And there they are, still in use. Maybe funny isn't the right word.


Popcorn Hour A-110

The Popcorn Hour A-110 is a small silent box that can play ISOs and other movies scaled up to 1920×1080 resolution. We have one, diskless, connected to our NAS via gigabit ethernet, (more…)


From: Charlie Root <root@…> — but which Charlie?

I hate it when different, independent computers all send me mail from my close friend Charlie Root. Here's an aox hack to ease the pain. (more…)