I have also written a general purpose reader for ISO 2709 based streams
in Java (MARC, UNIMARC, MAB etc.) I could make it robust to all
inconsistencies that I am aware of. For example, in Germany, Pica MARC
files contain a line feed between records, which is illegal. Or, the
Unicode UTF-8 characters are encoded in decomposed form. With this
reader I plan to explore Bibframe conversions in the future. It's open
source (Affero GPL).
I wonder if there is a source of freely available representative sets of
bibliographic records of the MARC format family that can help developers
in quality tests? There are only a few example records in the marc4j
source distribution I know of.
The GND was started being delivered in RDF with unescaped IRI characters
a year ago, I reported the issue and it should have been fixed quite a
while now. As a consequence, I wrote my own Java RDF Turtle parser that
can even handle broken IRIs. Yes, most the RDF turtle parsers out there
are flaky. Same holds for RDF Turtle writers.
Best regards,
Jörg
Am 05.02.13 18:48, schrieb Tom Emerson:
> Riley, Charles writes:
>> Also, too many programmers have to understand raw "marc", because too
>> much code produces broken records, and there are too many
> [...]
>
> Indeed: I've written a general purpose library for reading Z39.2 /
> ISO-2709 encoded files and it is rife with hooks and special cases to
> deal with the records we get from data providers. Supporting all the
> possible variants is a nightmare (MARC-21, UniMARC, CMARC, CNMARC,
> KORMARC, *MARC) and the inconsistencies and invalid crap I see drives me
> to distraction.
>
> Then again, RDF isn't free from that kind of thing. I've been working
> with the latest publically available GND authority file (several
> gigabytes of Turtle encoded RDF) from the DNB and I ended up having to
> globally filter out one particular predicate because the values commonly
> contained invalid IRIs that made Jena's TDB import barf.
>
>> The character used for a field delimiter on one system, ǂ, is the
>> alveolar click letter used in print in Khoesan languages, supported in
>> ISO 6438 and therefore, by extension, in UNIMARC.
>>
>> Other systems use ‡ as the field delimiter.
> Was this intentional? U+01C2 and U+2021 could be easily confused if the
> font is lame enough.
|