Print

Print


Okay, so the documents of the ISO 15924 committee which are bilingual: 

<titleInfo>
        <title>Information technology ___ Code for the representation of the names of scripts, FDIS</title>
</titleInfo>
<titleInfo>
        <title>Technologies de l___information ___ Code pour la représentation des noms d___écritures, FDIS</title>
</titleInfo>
<identifier type="uri">http://www.evertype.com/standards/iso15924/document/fdis15924.pdf</identifier>

You know, and I know that one of these is English and the other French. Further, at one level, its beside the point what languages these are! They are simply the two, equally valid, titles of the one work.

But, look at the identifier tag. It is also equally obvious that the identifier is an uri (url). That http:// gives it away every time ;-)
Yet we still include the type, why? Three reasons, I would suggest:
1) tradition, from the early days of digital metadata and metalanguages reduntant information has always been provided when possible, perhaps to encourage,
2) data granularity, what is obvious to one person may not be to another, the more forms or basic elements you can provide for any one piece of data, the more likely it will make sense to the broadest range of people/user agents, and this granularity or redundancy allows,
3) data checking and referential integrity and so on (as well as some level of data backup).

So, in the identifier tag is placed the type attribute information and an user agent may check that the data matches the type attribute and is an uri. Or it may, if http://, gopher://, file:///, mailto: etc is not present, it may still pass the information to a browser, because its been told its an uri. The browser is likely to automatically add the http:// and see what happens.

Or, in the case of the titles... say the title with the attribute xml:lang="fr" comes through and only the tag+attribute (xml:lang="en") comes through of the other title. A search on the French title finds nothing, but with just the attribute information, a search in an English language US library catalogue for "technology, information, code, representation, names, scripts" produces the record. Voila ;-) Information redundancy/granularity has won the day!!!

Far-fetched situation, I know, but illustrative of why language tags are important, at least in certain kinds of works.

Now, in the case of "Siddhartha"... leave out the lang attribute altogether or give it the xml:lang="en" attribute and value to indicate that this is meant as an English title.

In the Andy Warhol example <title xml:lang="de">, <title type="alternative" xml:lang="en">, and <language authority="iso639-2b">eng</language> gives a great summary of the basic  ironies of this work!

Any attempt to categorize material, as you know, is fraught with "exceptions" because human data is always more messy and more subtle than any artificial data structure. But just because there are always exceptions doesn't (and shouldn't) stop us from trying. If it did we wouldn't be having this conversation <grin>.

So again, unicode is great. The case I presented with a word in greek script within the title is obvious in a unicode system. The <span> with language/script attributes is not necessary within a fully unicode cataloguing system. But adding this granularity to the data allows the citation software that has downloaded the catalogue record to render all <span> tags with language or script tags in an italic script if called upon in the style manual.

As in all cases, a particular library may choose not to use the language attributes, at all. It will still be legal MODS data. It will save the encoders time (and so the library will save money) and the bibliographers will spend more time adding the attributes! Likewise, the <span> tag could be ignored entirely, but the users of your library will certainly appreciate the additional data when it comes to forming bibliographies. And maybe the library catalogue will become a Rosetta Stone a hundred millenia from now, allowing the reconstruction of dozens of dead languages because all the translations and pieces of other languages are properly labelled!!!!!!!!!!!!!!!!!

Doug (only half kidding)

On Thu, 17 Jul 2003 10:44:50 -0700
Karen Coyle <[log in to unmask]> wrote:

> At 10:33 AM 7/17/2003 -0500, you wrote:
> >But if we are going to rely on unicode alone, then we might as well drop 
> >the lang, xml:lang and script attributes completely!
> 
> Unfortunately - or otherwise - you are talking to the person (moi) who 
> argued against xml:lang in descriptive fields (which are those that are 
> copied from the piece, such as author, title, publisher, etc.). What would 
> you do with "Italian Cuisine" or "The Tao of Pooh"? What about a book title 
> like: "Siddhartha"? I don't think we want folks to have to determine if a 
> word that originates in another language is or isn't now considered part of 
> English. And I also don't think we can expect people to make these 
> distinctions for works in languages other than their own. Do you exclude 
> proper nouns? Can you even positively determine what is a proper noun? 
> Sometimes this is easy:
>    Andy Warhol : Ausstellung der Deutschen Gesellschaft für Bildende Kunst
> Sometimes less so:
>    On the effects of gypsum, or plaster of paris, as a manure;
>