Mike Ferrando wrote:
> One thing that really troubles me about the database approach is the
> attributes. I find that most people that use the database approach
> simply do not use attributes in their code at all.
This is a design problem and a human problem (data entry) -not a problem
with the technology itself. If a database doesn't provide you with that
information it is a fault of the design - not the database.
People can do horrible XML markup too. In EAD which is a loose DTD, you
can have a valid document that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ead SYSTEM "ead.dtd">
Yes - the only thing I *had* to enter other than mark-up was the level
attribute. And I have a valid document. Again, this is a human problem -
not a technology problem.
In fact, relational databases have datatyping (which is part of why
searching and indexing can be so efficient). Data entry can also be set
so that the user is forced to use the correct datatype or validation
against a regular expression. For example you can specify that the
titleproper contain at least 5 characters and no numbers.
XML Schema allows for this sort of thing - but since EAD isn't
officially in XML Schema and signficant work would have to be done to
develop the required datatypes it isn't really of much use yet in
forcing normalized practice.
SO indeed, if normalization is your goal - database win hands down.
Almost all encoding in EAD has to be done as established best practice
at a local level.
In summary, in either case this is a human design problem -not the
fundamental technology. And XML doesn't provide a more rigorous solution
and may in fact provide fewer tools than a RDMS to enforce solutions.
> Those that do usually have fields that are designated by some type of
> standard. For us that would be MARC. However, even if that is a
> better scenerio, I cringe to think of typing in dates twice or even
> three times in order to create a normalized string.
In a database, in fact, you can represent dates only once - you don't
have to provide a lot of different forms. A date is stored as the
datatype DATE which is a numeric representation of the date (not a
string). Retrieval on dates entered as DATE types is incredibly fast
(computers like numbers better than strings :)) On output you transform
the date to the form that you wish to display or manipulate. One entry
and infinite display possiblilities. MARC, AACR2, ISO8601 anything you
can possibly imagine. And the tools to format the date are usually
already a part of any half way functional RDMS. So you don't have to
bake your own.
> If these types of attributes are to be done as the data is entered,
> this would put an incredible burden on the data collector.
> Further, without the EAD tag set, it seems to me that a search engine
> would really have to do double duty to get to the data (AACR2 date;
> ISO 8661 date).
And not in XML??? Currently, if you want your dates machine processable
(thus searchable, sortable etc.) you have to enter a normal attribute or
place a normalized form in the field
<unitdate normal="2004-12-22">December 22, 2004</unitdate>
In the first instance you will probably find yourself writing a
stylesheet (somewhat complex) to parse the date and display it for humans.
So RDMS - one date - many outputs.
In XML - many dates (Some for humans and some for machines) *or* snarly
XSLT to make it look human readable. And you can not force the DTD to
make sure that the user entered the date correctly (But a DB schema can)
> Finally, I would think that crawling through a proprietary software
> would be much more difficult than an XML document.
Not sure I understand this.
First,- not all relational database are proprietary.
Second, as Richard points out, they have very robust and capable
indexing and searching. If you start thinking large quantities of data -
relational databases have XML databases hands down.
True, you can't open up a database in any old text editor because the
data is coded in binary (usually) format that isn't readable in a text
editor (that doesn't mean it is proprietary). And point in fact - you
can't open up a XML database file in a text editor either - it will also
be in a binary format that facilitates searching and retrieval.
But reading the raw data isn't why people use databases in the first
place. They are used to create efficiencies in data entry, reduce
redundancy, provide noramlization and provide efficient and effective
searching of the data.
> These are my reasons for sticking with XML rather than databases. I
> see a separation between data collection software and mark up/display
> of that data. Mapping the datatypes seems to be the key, but context
> (heirarchy) conveys information I would not want to try to capture in
> a database format.
I can agree that context is important to represent in archival
collections. And I think it is a very hard schema design problem to
capture the richness that EAD can capture. But it is possible. I am not
saying it is the best solution - only that it is possible.
I think it is important to make a clear distinction between the
technology, the format, the particular instances of schema or DTD design.
A relational database is a set of technologies that optimize storage and
retrieval and transactions of data.
The data that is entered into a database could also be stored in a CSV
file (which could be read by a text editor) just as XML is stored in a
file that can be read by a text editor. However we generally store the
data indexed in a binary format.
A particular database schema that provides the set of data fields, their
relations and their datatypes may be well designed or poorly designed -
but it doesn't negate the underlying capabilities of RDBMS.
For our purposes, XML is a format that can be read by a text editor -
not a set of technology tools (any XML geek right now would cringe over
this simplification but for our purposes it will do).
There are XML databases that have similar (though less efficient at the
moment) capabilities - for storage and retrieval - but their files too
are in a binary format. In fact, if you enter data directly into the XML
database, it may never be represented as a "readable" XML file until
exported. In that sense it is just like the data in the relational
database - until serialized as text.
You can have a good XML Schema/DTD or a bad one. You can have one that
enforces a fair amount of rigor - but until XML schemas are widely used
you will not have one that enforces datatyping.
Relational Database Management Systems can be compared to XML Database
systems in terms of capabilities.
DB Schemas can be compared to DTDs and XML Schemas.
But comparing XML to RDMS doesn't make sense. It is comparing apples and
PS -Richard - my address book example was indeed too simple - Perhaps a
personnel database would have been more apt.
> Mike Ferrando
> Library Technician
> Music Division
> Library of Congress
> Washington, DC
> --- Richard Davis <[log in to unmask]> wrote:
>>Liz's post was very clear and interesting. I just wanted to
>>couple of points:
>>Elizabeth Shaw wrote:
>>>It is not that archival data is particularly unique but it is
>>>that highly nested linear data stored across many fields in a
>>>database is more difficult to retrieve and reconstruct.
>>It hadn't occurred to me that anyone might think of storing every
>><emph> in its own field or row. As you suggest, it doesn't sound
>>something to be recommended, nor does it seem very relational.
>>>But I would argue that perhaps the archival community should move
>>>away from the notion of thinking of its data as a linear
>>>If you move away from that notion, then storing the descrete data
>>>elements that describe a collection and its component parts in a
>>>database begins to make more sense.
>>This is the approach I've taken, and still favour, at least for the
>>being. The finding aids I've dealt with are ISAD(G) based, and all
>>seemed strongly field-oriented. Within each component field,
>>granularity is preserved by using the markup for the equivalent EAD
>>element. At the moment, little further use is made of this markup,
>>except for transformation to HTML. But valid and meaningful EAD can
>>easily be reconsituted, offline or on-the-fly, for transmission
>>web, for indexing, or for when the ultimate killer EAD app is ready
>>>Although data may be indexed by elements for searching purposes,
>>>is usually retrieved as a chunk (with all the internal tagging
>>This point is important, and often overlooked. Indexing is
>>to any DBMS (including XML). None of it works at all, except in
>>In modern RDBMS, indexing is exceedingly well implemented: for that
>>speed and reliability alone, it's likely to be worth compromising
>>absolute integrity of a logical design. And, lurching back to the
>>MySQL's indexing features work extremely well, and include the
>>"fulltext" indexing, which makes it very attractive for storing
>>MySQL has long lacked some core relational features. For example,
>>had to implement referential integrity at application level, which
>>(like Bartleby) I'd prefer not to. MySQL makes up for that by being
>>fast, and free, and well-supported - though I understand PostgreSQL
>>also free, and fully relational, and performs well.
>>>Depends on what you want to do. I will put my address book in a
>>>relational database anyday but when I want to search Shakespeare
>>> me XML.
>>At first I agreed with you, but then I had second thoughts: XML
>>address books very well, probably more so than a heavyweight RDBMS
>>unless your address book is Yellow Pages! On the other hand, a
>>network of multi-level descriptive records seems eminently suitable
>>the relational treatment.
>>\ Richard M Davis
>>/ Digital Archives Specialist
>>\ University of London Computer Centre
>>/ Tel: +44 (0) 20 7692 1350
>>\ mailto: [log in to unmask]
> Do you Yahoo!?
> All your favorites on one personal page – Try My Yahoo!