At 12:36 PM 6/5/2002 -0400, Bartek Plichta wrote:

>This is in response to the question about the languages listed in the Text
>Tech Schema. While ISO 639-2 seems to be a good choice for a set of standard
>language codes, I would like to alert the METS community to the possibility
>of using the Ethnologue set of language codes (,
>not as a replacement, but rather as an alternative (via mapping) to ISO

I did consider using Ethnologue instead of ISO 639-2.  It is obviously more
and fine-grained.  It's license terms are also reasonable (basically 'free,
but please give
us credit').  There is the argument to be made that 'It's good, but it's
not a *standard*'
but I'm not horribly concerned about that.

Ultimately, what brought me down on the side of ISO 639-2 was the thought
that the level of detail provided by Ethnologue, while useful for
linguists, is
probably overkill for a technical metadata set.  Technical reasons for wanting
to know a language include wanting to know what rules should be used in
rendering it for display, what fonts/scripts to use in rendering it, what
and stemming algorithms are appropriate in enabling retrieval, etc.  Many of
the distinctions Ethnologue makes are not particularly relevent at this
level.  Moreover,
if we insist on Ethnologue, we also insist on people having the expertise
on hand to make
the fine-grained distinctions between languages accurately.  To some degree,
that presents a fairly high knowledge-barrier to its use.

I considered the alternative of not specifying a particular set of language
and allowing people to indicate what codeset they were using, e.g.
<language code="JER" name="Jere" codesource="Ethnologue"/> and
perhaps limiting the allowable codesource values to Ethnologue and ISO 639-2
(we could debate the inevitable desire to have 'other' as
yet-another-codesource value).
But I decided there was something to be said in this case for trying to
press for
universal agreement on a particular codeset to use for this purpose.  That is
obviously a debatable choice and I would be interested in hearing people's

>I also think it might be worthwhile adding another language element to the
>schema. This would provide a way to capture the distinction between the
>language of the resource (e.g., 'language') and the language that the
>resource describes (e.g., 'subject.language'), unless, of course, the schema
>can already handle that. An example of that could a Fulfulde-French
>dictionary, where the alternative name "Pulaar" is preferred (using
>Ethnologue codes):

I can see the importance of having that kind of information, but it really
me as descriptive metadata, and not technical.

