Print

Print


Lacking access to the full OCLC catalog, I did experiments using the
scriblio LOC bib records, which were roughly contemporaneous with the UNT
data.

Because we had access to the full record, we looked at the bottom ten tags
in terms of  record frequency; even though they were LC data, all of the
tags were either obsolete, non-existent, or incorrect.  In some cases the
error was obviously the result of fat fingering a digit, as the first
indicator was obviously really part of a tag- the  intended tag was obvious
based on the value.

One thing that it is critical to do is to look at the conditional
tag/subfield probabilities, rather than the raw ones; some absolutely rare
fields are much more likely to occur given the presence of other tags. The
OCLC study conditions on record type from the leader, but otherwise used
absolute frequency.

One tidbit- the amount of S/W information in the leader of an LC  record is
approximately 7 bits.

Also, the binary format is annoyingly textual. Also, after the first record
in a file, alignment is  random, so it's hard to make use of SIMD. It's
also hard to use multiple processors/cores, as you have to randomly seek
into the file near where you want to divide it, then hunt for an end of
record marker.
These issues are really only of concern  if you're processing millions of
records repeatedly, in which case you're probably going to transcode the
records into a more efficient format. If people are interested, I can post
some notes.

Also, in most setups, the binary format has so little entropy that it is
faster to read gzipped data off storage and decompress than to read the raw
data.  It's nothing like the "compression opportunities" in marc-xml though.

Simon