Friday, December 28, 2007

Internationalization of Names

Names are complicated

What's in a name? The answer turns out to vary quite widely around the world. When an English-language form, either electronic or paper, asks for a person's name, it usually provides separate fields for first and last name, and sometimes middle name or middle initial. Aristotle Pagaltzis linked to a post by Jim Clark on Thai names, demonstrating that this approach, or even the alternative "given name, family name", falls down pretty quickly outside the English-speaking world. Thai names consist of:

  • A given name, similar to the English first name, except that it must come from a list of government-approved names;
  • A family name, which is also government-regulated; all people with the same family name are related, and new Thai citizens must select an unused name. Like all non-namespaced identifiers (domain names, instant messenger handles, user names on popular web services), the good short ones are taken; and
  • A chue len, which is typically translated as nickname, but according to Mr. Clark is more like an informal given name; it's selected by one's parents or close relatives early in life (though not necessarily at birth).

The obvious mapping of Thai name components onto English, (given name, family name, chue len) → (first name, last name, nickname), doesn't work very well. Consider the Thai name Thaksin Shinawatra, chue len Meow, the former prime minister. His (romanized; more on that later) legal name is Thaksin Shinawatra. If addressing him politely, I would refer to him as Khun Thaksin.1 Note that this is {honorific} {given name}, not {honorific} {family name}; in other words, Mr. Matthew as opposed to Mr. Sachs. His friends and family will call him Meow, not Thaksin or Shinawatra.

A further wrinkle is that when sorting a list of Thai names, the given name, not the family name, should be the sort key. Then there's also the matter that Thaksin Shinawatra, aka Meow isn't really the gentleman's name at all; it's ทักษิณ ชินวัตร, aka แม้ว. There are several standard romanizations for Thai, and whichever one the named individual prefers is considered canonical. There are also other quirks involved in the Thai script form of a name, like the lack of whitespace between the honorific and the given name.

Non-Thai complications

Then there are the whole sets of different requirements for other kinds of names. The comments on Jim Clark's blog entry, and this post by Richard Ishid, who's in charge of i18n issues for the W3C, give some other good examples.

  • Russian and Icelandic have gender suffixes on the family name (Fuzaylova for a woman, Fuzaylov for a man; Fjalar Jónsson vs. Katrín Jónsdóttir.)
  • Russian has nicknames (which, like Thai "nicknames", are much more widely used than English nicknames) which are usually (always?) systematically derivable from their given names; Vladimir → Vova.
  • Scandanavian given names typically include spaces, and convention varies as to how acceptable it is to refer to Hans Christian Andersen as Hans vs. Hans Christian. This isn't unheard of in the southern United States, either -- Billy Jean, &c. In some parts of Europe, these multipart given names are hyphenated, as in the Austrian Hans-Christian or the French Jean-Claude.
  • In France and Italy, names can have a comma which essentially divides a series of first names from a series of middle names; in France, the middle names are rarely used outside of legal contexts, while in Italy, the middle names aren't used in legal contexts. A Mario, Alberto Giovanni Rossi would have a legal name of Mario Rossi in Italy, whereas a French Jean, Christophe Dupond would be commonly known as Jean Dupond but legally Jean, Christophe Dupond.
  • Many countries use patronymics instead of stable family names, so a set of related people won't have the same family name.
  • Many Chinese take arbitrary western nicknames for ease of communicating with westerners.
  • Chinese names also have generational markers, so a set of siblings will all have the same "middle" name, and names are written {family}{generational}{given} in Chinese script.

So what?

How much of this do we really need to worry about? When I say that Thai names should be sorted by given name, should, of course, is a horribly loaded term. If an American border control agent pulls up a list of people who have entered the country at a particular point, they probably want the sort key to be Thaksin, not Shinawatra. Mapping (given, family) → (first, last) is also probably fine for this application. So when, exactly, does the extra information need to be preserved?

Some reasons that a system might be interested in a name, or parts of a name, are:

  • Correlating records with other systems
  • Displaying people's names
  • Addressing people in writing ("Dear Mr. Sachs,", "Welcome, Matthew!") or on the phone
  • Identifying people ("To look up your records, enter your name")
  • Searching for people (on, say, a social networking site)
  • Sorting a list of people

For most English applications that don't cater to a large international audience, it might be "good enough" to either simply have a flat name field where users can either enter arbitrary names or at least their romanizations.2 A flat name field is much more flexible. Since you probably need to support substring searches anyway, it doesn't lose anything as far as searching's concerned.

If you want to sort by last name, or communicate with other systems that take a (first name, last name) tuple, it might be good enough to just split off the last whitespace-separated token and treat that as the last name.3 If that's not good enough, a pair of (first names, last name) or (given names, family name) inputs may be called for, but characters such as spaces and apostophes (O'Flannagan) should be valid. If your application wants to try to automatically derive a secondary form of address from the name entered, maybe it shouldn't. Is the ability to have form letters say Mr. Sachs as opposed to Matthew Sachs really worth the faux pas of Mr. Shinawatra? I guess it depends on how international your audience is; you could always ask for multiple forms of address.4

For applications that want to really get localized names right, like a system-wide address book or a global social networking site, a more complex approach is called for. For instance, the Mac OS X address book framework knows about the address formats for various countries; it could extend that functionality to support different name formats. It has some rudimentary support for this, in that an individual address book entry can have a set of name ordering flags associated with it, either first name first or last name first (sic); name fields are fixed at title, first name, middle name, last name, suffix, nickname, maiden name, and phonetic (first, middle, last) name.

Per-country address format support doesn't change which fields exist, but it changes the order they're displayed in. Per-country name format would need to be more complicated. A Name (which a person might have more than one of with different NameFormats) might consist of:

  • NameFormat, defining the (country, language) associated with the name (e.g. en.US and the set of available NameComponent)
  • A list of (NameComponent, Value, (optional) PhoneticValue)
  • The system could provide functions like:
  • int Name.compareWith(Name)
  • String Name.representation(NAME_REPRESENTATION) where NAME_REPRESENTATION is one of:
  • Name Name.convertTo(NameFormat) would try to convert to a different name representation using automated rules for things like romanization.

  1. Khun is a generic honorific roughly akin to Mr./Ms./Mrs. There might be a better one to use for a (former) Prime Minister. This list includes ones for teacher, aunt, sister, older person, and younger person, but suggests that khun is always used when addressing someone formally.
  2. In part two of his post Mr. Ishid recommends that applications that expect ASCII input specify it; detecting and erroring on input in unsupported scripts is probably sufficient.
  3. It might be worth having a list of tokens which will also get treated as part of the last name, such as de, with this approach.
  4. "Enter your name and how you'd like to be addressed:" ?