Bernhard Eversberg writes:
> Ah, I'm sorry, I mixed up the figures! The RDF file is the larger one.
> Is there, by the way, any software that would convert RDF into Turtle,
> and also change the &#nnn; entity notations into UTF-8?
The Apache Jena rdfcat utility will to this, assuming the inputs are
valid. With the previous GND dump there were a handful of Turtle
statements that were invalid.
Converting the entities to UTF-8 is a bigger issue. Again, in the
previous release, the data often contains entities that are not part of
the W3C's list of XML entities, including:
- &nsb; - ISO 6630 control for NON-SORTING CHARACTER(s), BEGIN -> U+0098
- &nse; - ISO 6630 control for NON-SORTING CHARACTER(s), END -> U+009C
- &ptacc; - U+0323 COMBINING DOT BELOW, "punct als accent"
It was also common for &nse; to appear in the data with the trailing ';'
missing. There were also cases where a space was missing after an
ampersand leading to failures when attempting to decode entities, e.g.,
"Pietsch, Heinz Dieter &Getty-Ulan"
> And it would be really nice to learn why there are these differences
> instead of a stable download format. Wonder what it will be the
> next time...
Yes, I'd prefer Turtle of RDF/XML, since all of my tools for processing
this expect Turtle already. :-)
Principal Software Engineer, Search
[log in to unmask]