On Fri, 2003-07-11 at 17:40, Bruce D'Arcus wrote:
>
> First, so how are you guys dealing with MODS records (at the University
> of California; right?)? Are you using Cheshire, or just flat xml files?
I don't know the details of the file format. For that, you should
contact [log in to unmask] I do know that the MODS records are
actually encapsulated in METS records, and that we use swish-e to index
the records, so the indexing is similar to that used for full-text
documents. No DBMS is involved at this time, but the number of records
is still small (<100K).
>
> Second, would it be reasonable to interpret your comments as meaning:
>
> - it is impossible to map MODS to a RDBMS without any loss
It's not a question of loss, it's a question of appropriateness and
efficiency. RDBMS' are designed for data with a great deal of repetition
in data elements and a lot of one-to-many relationships between data
elements. Bib records have little repetition and there's little
efficiency that can be gained in the few one-to-many situations that
exist.In essence, there isn't enough restriction and regularity in the
data to take advantage of an RDBMS architecture.
>
> - that this makes it a non-starter for libraries who are critically
> concerned with every piece of detail
No, it's the same problem with MARC records -- it has to do with the
nature of the data, not the format the data is stored in (MARC, XML,
etc.).
>
> - that it is conceivable it might be appropriate for end-user-oriented
> software of the sort I'm interested in?
I assume you'll have the same problems, but that doesn't mean you won't
use an RDBMS -- it just means that there will be very little
"normalization" in your database, but you will probably use the DBMS
technology to facilitate searching of some fields. If your database gets
large, however, you'll need to use other design principles to make the
searching viable. For example, our database of 23 million bib records
(MARC, not MODS) uses Oracle, but it doesn't use Oracle for keyword
searching -- searches would take hours. Instead, it uses special
technology that creates bit strings for keywords. I suspect this is
similar to the technology used in the many web search engines.
For more than you ever wanted to know about this topic, here's the
citation for Clifford Lynch's doctoral dissertation which is precisely
on the topic of bibliographic data and database design. Although many
are aware that his research proved important points about relational
databases and info retrieval, you could always be the first person to
actually read what he says ;-) I sure haven't been able to get through
it:
PT Dissertation
PT Book
AU Lynch, Clifford A.
MT Extending relational database management systems for information
retrieval applications /
DP 1987.
NT Thesis (Ph. D. in Computer Science)--University of California,
Berkeley, Dec. 1987.
PH v, 239 leaves : ill. ; 28 cm.
LO UC Berkeley Engin T7.6 .L988 UCB
UC Berkeley Main Z667.5 .L92 UCB
kc
|