Riley, Charles writes:
[regarding preprocessing of GND data]
> Will you share some good examples of this, to show what's going on?
A quick example, so as not to annoy others with non-BIBFRAME related
noise.
The gnd:homepage property sometimes contains invalid IRIs, e.g.,
gnd:homepage <http://www.kurszentrum-ballenberg.ch/cms/cms.asp?page=213&p=ASP\Pg213.asp> ;
the backslash is invalid. Even if there aren't a lot of these, it is
very frustrating to have a data import killed part way through because
of them.
Note that the live GND does not show this invalid URL, so I suspect the
data file (which is from October 2012) just needs to be updated.
One other thing that got me was the use of the Unicode Combining
Grapheme Joiner --- this character won't be removed in any normalization
form so if you're working with the text you need to remove it
yourself. For example,
0047 G
006F o
0072 r
0062 b
0061 a
010d č
0065 e
034f <combining grapheme joiner>
0308 <combining diaeresis>
0076 v
Neither NFC nor NFKC will create ë from the sequence 0065 034F 0308. Of
course that doesn't cause a loading issue.
OK, enough of this. Sorry everyone.
-tree
|