Monday, December 31, 2007

I Can't Believe It's Not Bodah

A hilariously terrible first date (via Universal Hub.)

Wrong Dates in iCal Birthday Calendar

To keep track of people's birthdays, I use Mac OS X's Birthday Calendar feature of Address Book/iCal. I was going through my calendar the other day, and I noticed that a birthday which I knew was sometime in January wasn't showing up. It was on the corresponding Address Book contact, though. I deleted the birthday from this contact and reentered it, which fixed that entry, but on the suspicion that more birthdays might be missing, I flipped through my calendar and found:

Address Book says Mar 23, iCal says Mar 21

The Address Book birthday field has the misfeature that it forces a year to be specified. What a rude thing for Address Book to be asking! Anyway, I'd arbitrarily picked year 1 for the year for any contacts whose birth years I didn't know. Maybe, I thought, the Gregorian reform was throwing things off. However, changing the year to 1900 didn't help matters, and in fact made them worse:

Address Book says Mar 23, iCal says June 23

Turning the birthday calendar off (which wipes out iCal's backing store for the calendar) and on didn't help matters. A web search turned up some other people having the same problem, but the only useful solution they came up with was deleting and recreating entire contacts by hand.

I wanted to see if the raw data was wrong in Address Book's database. Address Book uses Core Data in a way that makes the database difficult to work with at the SQLite command-line level, so instead I hacked /Developer/Examples/Python/PyObjC/AddressBook/Scripts/ to emit the birthday field by adding ('Birthday', AddressBook.kABBirthdayProperty) to FIELD_NAMES and the following to encodeField:

    elif isinstance(value, AppKit.NSCalendarDate):
        return value.descriptionWithCalendarFormat_("%Y-%m-%d")

It turns out that a number of entries had negative years, e.g. -1900-03-23 instead of 1900-03-23. I'm not sure how this happened, but here's a script (which you can download) to fix it:

Fix negative birthday years in Address Book.
This work is hereby released into the Public Domain.
import AddressBook
import AppKit

def personName(person):
    return "%s %s" % (

def formatDate(date):
    return date.descriptionWithCalendarFormat_("%Y-%m-%d")

def fixBirthday(birthday):
    year = int(birthday.descriptionWithCalendarFormat_("%Y"))
    if year < 0:
        return birthday.dateByAddingYears_months_days_hours_minutes_seconds_(
            -year * 2, 0, 0, 0, 0, 0)
        return None

def fixPersonBirthday(person):
    birthdayProp = AddressBook.kABBirthdayProperty

    birthday = person.valueForProperty_(birthdayProp)
    if birthday == None: return

    fixedBirthday = fixBirthday(birthday)
    if fixedBirthday != None:
        print "Fixing up %s: %s -> %s" % (
        person.setValue_forProperty_(fixedBirthday, birthdayProp) 

book = AddressBook.ABAddressBook.sharedAddressBook()

for person in book.people():

Friday, December 28, 2007

Internationalization of Names

Names are complicated

What's in a name? The answer turns out to vary quite widely around the world. When an English-language form, either electronic or paper, asks for a person's name, it usually provides separate fields for first and last name, and sometimes middle name or middle initial. Aristotle Pagaltzis linked to a post by Jim Clark on Thai names, demonstrating that this approach, or even the alternative "given name, family name", falls down pretty quickly outside the English-speaking world. Thai names consist of:

  • A given name, similar to the English first name, except that it must come from a list of government-approved names;
  • A family name, which is also government-regulated; all people with the same family name are related, and new Thai citizens must select an unused name. Like all non-namespaced identifiers (domain names, instant messenger handles, user names on popular web services), the good short ones are taken; and
  • A chue len, which is typically translated as nickname, but according to Mr. Clark is more like an informal given name; it's selected by one's parents or close relatives early in life (though not necessarily at birth).

The obvious mapping of Thai name components onto English, (given name, family name, chue len) → (first name, last name, nickname), doesn't work very well. Consider the Thai name Thaksin Shinawatra, chue len Meow, the former prime minister. His (romanized; more on that later) legal name is Thaksin Shinawatra. If addressing him politely, I would refer to him as Khun Thaksin.1 Note that this is {honorific} {given name}, not {honorific} {family name}; in other words, Mr. Matthew as opposed to Mr. Sachs. His friends and family will call him Meow, not Thaksin or Shinawatra.

A further wrinkle is that when sorting a list of Thai names, the given name, not the family name, should be the sort key. Then there's also the matter that Thaksin Shinawatra, aka Meow isn't really the gentleman's name at all; it's ทักษิณ ชินวัตร, aka แม้ว. There are several standard romanizations for Thai, and whichever one the named individual prefers is considered canonical. There are also other quirks involved in the Thai script form of a name, like the lack of whitespace between the honorific and the given name.

Non-Thai complications

Then there are the whole sets of different requirements for other kinds of names. The comments on Jim Clark's blog entry, and this post by Richard Ishid, who's in charge of i18n issues for the W3C, give some other good examples.

  • Russian and Icelandic have gender suffixes on the family name (Fuzaylova for a woman, Fuzaylov for a man; Fjalar Jónsson vs. Katrín Jónsdóttir.)
  • Russian has nicknames (which, like Thai "nicknames", are much more widely used than English nicknames) which are usually (always?) systematically derivable from their given names; Vladimir → Vova.
  • Scandanavian given names typically include spaces, and convention varies as to how acceptable it is to refer to Hans Christian Andersen as Hans vs. Hans Christian. This isn't unheard of in the southern United States, either -- Billy Jean, &c. In some parts of Europe, these multipart given names are hyphenated, as in the Austrian Hans-Christian or the French Jean-Claude.
  • In France and Italy, names can have a comma which essentially divides a series of first names from a series of middle names; in France, the middle names are rarely used outside of legal contexts, while in Italy, the middle names aren't used in legal contexts. A Mario, Alberto Giovanni Rossi would have a legal name of Mario Rossi in Italy, whereas a French Jean, Christophe Dupond would be commonly known as Jean Dupond but legally Jean, Christophe Dupond.
  • Many countries use patronymics instead of stable family names, so a set of related people won't have the same family name.
  • Many Chinese take arbitrary western nicknames for ease of communicating with westerners.
  • Chinese names also have generational markers, so a set of siblings will all have the same "middle" name, and names are written {family}{generational}{given} in Chinese script.

So what?

How much of this do we really need to worry about? When I say that Thai names should be sorted by given name, should, of course, is a horribly loaded term. If an American border control agent pulls up a list of people who have entered the country at a particular point, they probably want the sort key to be Thaksin, not Shinawatra. Mapping (given, family) → (first, last) is also probably fine for this application. So when, exactly, does the extra information need to be preserved?

Some reasons that a system might be interested in a name, or parts of a name, are:

  • Correlating records with other systems
  • Displaying people's names
  • Addressing people in writing ("Dear Mr. Sachs,", "Welcome, Matthew!") or on the phone
  • Identifying people ("To look up your records, enter your name")
  • Searching for people (on, say, a social networking site)
  • Sorting a list of people

For most English applications that don't cater to a large international audience, it might be "good enough" to either simply have a flat name field where users can either enter arbitrary names or at least their romanizations.2 A flat name field is much more flexible. Since you probably need to support substring searches anyway, it doesn't lose anything as far as searching's concerned.

If you want to sort by last name, or communicate with other systems that take a (first name, last name) tuple, it might be good enough to just split off the last whitespace-separated token and treat that as the last name.3 If that's not good enough, a pair of (first names, last name) or (given names, family name) inputs may be called for, but characters such as spaces and apostophes (O'Flannagan) should be valid. If your application wants to try to automatically derive a secondary form of address from the name entered, maybe it shouldn't. Is the ability to have form letters say Mr. Sachs as opposed to Matthew Sachs really worth the faux pas of Mr. Shinawatra? I guess it depends on how international your audience is; you could always ask for multiple forms of address.4

For applications that want to really get localized names right, like a system-wide address book or a global social networking site, a more complex approach is called for. For instance, the Mac OS X address book framework knows about the address formats for various countries; it could extend that functionality to support different name formats. It has some rudimentary support for this, in that an individual address book entry can have a set of name ordering flags associated with it, either first name first or last name first (sic); name fields are fixed at title, first name, middle name, last name, suffix, nickname, maiden name, and phonetic (first, middle, last) name.

Per-country address format support doesn't change which fields exist, but it changes the order they're displayed in. Per-country name format would need to be more complicated. A Name (which a person might have more than one of with different NameFormats) might consist of:

  • NameFormat, defining the (country, language) associated with the name (e.g. en.US and the set of available NameComponent)
  • A list of (NameComponent, Value, (optional) PhoneticValue)
  • The system could provide functions like:
  • int Name.compareWith(Name)
  • String Name.representation(NAME_REPRESENTATION) where NAME_REPRESENTATION is one of:
  • Name Name.convertTo(NameFormat) would try to convert to a different name representation using automated rules for things like romanization.

  1. Khun is a generic honorific roughly akin to Mr./Ms./Mrs. There might be a better one to use for a (former) Prime Minister. This list includes ones for teacher, aunt, sister, older person, and younger person, but suggests that khun is always used when addressing someone formally.
  2. In part two of his post Mr. Ishid recommends that applications that expect ASCII input specify it; detecting and erroring on input in unsupported scripts is probably sufficient.
  3. It might be worth having a list of tokens which will also get treated as part of the last name, such as de, with this approach.
  4. "Enter your name and how you'd like to be addressed:" ?

Wednesday, December 26, 2007

Migrating a wiki from Trac to MediaWiki

I'd set up a Trac installation for wedding planning, instead of using MediaWiki (the system Wikipedia uses, which I already had a couple of installations of) since we wanted both a wiki (venue data, possible honeymoon destinations, guest lists... shut up, it's useful!) and ticket system (useful for tracking things like thank-you notes and being able to assign specific ones to either Liz or myself).

However, Dreamhost doesn't support mod_python, so pages were taking way too long to load. I decided to switch over to MediaWiki for the wiki part and just use my existing Bugzilla installation for ticket tracking. Hence, a new script over on the code page, trac2mw. Our wiki was fairly tiny, so caveat user. I didn't bother having it migrate tickets tickets or attachments, since we didn't have any data there that was worth preserving. The input format, a MySQL XML dump, probably isn't ideal for a lot of people (since Trac runs on SQLite by default.) It does fix up the wiki page syntax (the parts of it we were using, at least), though.

Monday, December 17, 2007

Less Edward Tufte, More Don Martin

A New York Times blog post on holiday tipping linked to a gem from the Times archives, its own ancestor from 1911.

The most striking feature of the article, which appeared on page six of the magazine section, is the large political cartoon-like illustration in the center (drawn by Reginald Russom, who evidently went on to help found what later became the Australian Cartoonists' Association.) From what I've noticed, while the Times Magazine still employs plenty of illustrations, they're mostly charts and graphs; when there's a lead image that's not a more or less realist photograph of the article's subject, it tends to be a photo like this one.

I love how one old newspaper article can shed light on:

  • Other concerns of the period (the legality of a state (or city?)-wide income tax debate was argued before the State Supreme Court)
  • Typical incomes and wages (a bit over $1M/yr in 2006 dollars is their example income for a "well-bred" New Yorker)
  • Types of service-sector employees one might utilize (such as elevator boy, charwoman, furnaceman, telephone operator, milkman, and stenographer, in addition to less remarkable professions)
  • Things that one might fear malfunctioning in an apartment (how little some things change; here we have the electric buzzer, hot water, windows (by the glass being broken, not routine mechanical failure), and mail delivery)

Maybe this is still routine in Manhattan, at least in the more highfalutin co-ops, but I also found it noteworthy that the building's management was expected to send you candidates if you wanted to sublet your apartment (but watch out; if you anger your super by not tipping around Christmas, he might send "several negroes and a Chinaman" your way!)

When I first got Times archives access (by subscribing to TimesSelect back in the day), I trawled the archives, there's a lot of good stuff there. If anyone else has a favorite, I'd love to hear about it in the comments.

Sunday, December 2, 2007

The Superest

I've been enjoying The Superest, an ongoing game of "My Team, Your Team"; one player draws a superhero, the next draws a superhero that can defeat that one, repeat. (Via John Gruber.)