Unicode is a marvellous thing. But if we are going to rely on unicode alone, then we might as well drop the lang, xml:lang and script attributes completely! Those attributes are in place for greater granularity of the data. There is never a guarentee that the data will remain within a unicode framework at all times, languages are separate to scripts and both only coincidently match sometimes, and the end user may have no idea what a particular language actually is without some further indication.
The second issue looks like a showstopper when described as you describe it, Karen. But I don't think it needs to be when we are talking about xml. The beauty of xml is that the user agent is relatively free to render material only when it understands the material, and ignore it otherwise. The material remains in place and can be rendered differently by a different user agent.
Further, I know I conflated a couple of different 'deeply code[d] semantics' issues in the one set of examples. My initial concern is citations!!! I am not a librarian, though I have worked in data management in acquisitions in an academic library and my mother was a librarian. I want to be able to properly cite material in papers I write. For me to be able to do this, and for life scientists to do so (the other example that has come up in the OpenOffice.org bibliographic mailing list) requires that we have some way of indicating a separate work quoted in a title, and taxonomic names of plants and animals. A <span> tag with a type attribute with uncontrolled (for now) values would work adequately for this. Controlled values, or an authority list of values would help data interchangeability but isn't critical. The advantage of xml is that it can be human read and edited in a simple editor... if I have downloaded records from several different sources and one uses a type value!
of 'otherwork' and another 'derivedfrom' I can easily edit one to match the other or use my citing software to do the same formatting on both.
The other deeply coded semantic issue is other xml namespaces or binary information. Binary information could be directly encoded with an encoding attribute specifying the encoding used (makes for an ugly title!!!!) or a link attribute could be used to link to a digital form of the binary information. The namespace issue could simply involve nesting an <extensions> tag within the <span> tag, or using the <extensions> tag in place of the <span> tag. The MathML chemical formula that is a part of the title of a chemistry work can then quite easily be a part of the title.
Personally, I would be happy with just a basic <span> tag for next week!!!!! ;-) (I don't ask for much :-) With a more controlled and detailed tag or set of tags in v. 2.2 or beyond.
On Thu, 17 Jul 2003 07:04:11 -0700
Karen Coyle <[log in to unmask]> wrote:
> It looks like two things are happening in your examples -- one is that
> you are identifying different character sets ("scripts"), and another is
> that you are identifying quoted works in book titles.
> The first, changing scripts, should not be necessary if you are coding
> the entire record in Unicode. All of the scripts are available to you in
> that encoding (and it is the default encoding for XML).
> The second is something that we haven't ever done in library metadata
> but I can see will come up for many -- being able to deeply code
> semantics within the fields of a metadata record. That's a huge leap
> from where we are today and I don't think we could add this to the
> standard quickly -- different communities will each have their own needs
> (just imagine what the mathematicians will want -- embedded LaTex or
> MathML). It may be that inter-field encoding will have to lie outside of
> the MODS standard as a purely practical decision.
> OK, that's my gut reaction. Others?