Print

Print


Lacking access to the full OCLC catalog, I did experiments using the scriblio LOC bib records, which were roughly contemporaneous with the UNT data.

Because we had access to the full record, we looked at the bottom ten tags in terms of  record frequency; even though they were LC data, all of the tags were either obsolete, non-existent, or incorrect.  In some cases the error was obviously the result of fat fingering a digit, as the first indicator was obviously really part of a tag- the  intended tag was obvious based on the value.

One thing that it is critical to do is to look at the conditional tag/subfield probabilities, rather than the raw ones; some absolutely rare fields are much more likely to occur given the presence of other tags. The OCLC study conditions on record type from the leader, but otherwise used absolute frequency.

One tidbit- the amount of S/W information in the leader of an LC  record is approximately 7 bits.

Also, the binary format is annoyingly textual. Also, after the first record in a file, alignment is  random, so it's hard to make use of SIMD. It's also hard to use multiple processors/cores, as you have to randomly seek into the file near where you want to divide it, then hunt for an end of record marker.
These issues are really only of concern  if you're processing millions of records repeatedly, in which case you're probably going to transcode the records into a more efficient format. If people are interested, I can post some notes.

Also, in most setups, the binary format has so little entropy that it is faster to read gzipped data off storage and decompress than to read the raw data.  It's nothing like the "compression opportunities" in marc-xml though.

Simon