Print

Print


In my software that I write for consolidating german library catalog data,
I chose the following lines of though to deal with identifiers:

All identifier fields from legacy catalog records are considered "vague" /
"opaque" without semantics. They were once recorded as plain strings and
must be parsed into a internal representation (which is not a new task).

Finding the representation can be divided into subtasks for:
- identifiers imported into a catalog out of control of the cataloging
library ("global" identifiers, regional/local identifiers, standard numbers)
- identifiers from other libraries (authority files, cross linkings)
- identifiers created for the use only in the catalog of the library (i.e.
"system numbers")

Some methods I use for disambiguation are
- namespacing by taking Linked Data rules, this means, minting URIs from
opaque identifiers. This just works well for "system numbers" and for
identifiers which are under "my control"
- namespacing by rules made by librarians (e.g. prefixing with ISIL)
- considering the import context (for finding the sender of the received
data, by file or by API)
- checksumming (identifiers can be assigned to certain identifier types
when checksums are found to be correct, example: ZDB-ID)

In the search engine index, I store at least three forms of each
identifier, side by side:
-- the original form
-- the canonical form
-- the display form

The purpose of the original form is for preservation/archival so other
processing routines in the future can be applied to the form that has once
been entered or imported into the catalog.

The purpose of the canonical form is programming identifier equivalence,
search/faceting, match / merge, and linking to other resources.

The purpose of the display form(s) is to show the user what is expected as
the expected, context-dependent output (e.g. displaying ISBN-13 or ISBN-10
or both, adding a prefix for the type, adding ISIL as prefix, showing URIs
etc).

When search terms are entered, I analyze the terms if identifier types can
be recognized automatically (e.g. ISBN, ISSN). Internally, a found
identifier phrase is converted to the canonical form ("normalizing the
input"). This may also use contextual information from the environment
(e.g. is there a library ISIL of the user who is searching). This means
there will be only one hit expected in the search result (if there are more
hits, there is surely something wrong with the indexed catalog data).

URIs, created from the canonical form, are a preferred format for creating
HTML links in the display of a Linked Open Data environment, so I can link
to other web resources. E.g. linking to Zeitschriftendatenbank (ZDB) site
ld.zdb-services.de if I deal with a ZDB-ID. I could even copy the resources
from ZDB to a local site and redirect my links to ZDB-ID to a local mirror.

I do everything to avoid URNs. This would mean I had to implement resolving
systems. I do not think that resolving systems can be a solution for
something reliable. They failed from a historical view (there are reasons
why e.g. Google never was in the need of an URN resolver).

I also avoid persistent identifiers, as they are a constant source of
trouble, e.g. http://purl.org namespace, which is unreliable, OCLC server
may be down, or DOIs, where publisher servers are down etc. This is not a
matter of practice, it is built into the design.

To me, Linked Data/Semantic Web technology is superseding URN/persistent
identifier concepts. This is finally a distributed and stable technology.
No central authority, no bottlenecks, no more trouble with server
reliability.

Jörg



On Sat, Jul 19, 2014 at 6:06 PM, Karen Coyle <[log in to unmask]> wrote:

>
> On 7/18/14, 3:34 PM, Robert Sanderson wrote:
>
>
>  Or an HTTP space such as Jeff's suggested purl.org.
>
> This may be a question for Jeff ... must PURLs re-direct to a non-PURL
> URL? - If so, then in any case one will need a conformant non-PURL URL for
> the identifiers.
>
> Taking Ray's example “info:bibframe\publisherNumber\ 256A090” - that could
> be expressed as "http://bibframe.org/publisherNumber/256A090"
> <http://bibframe.org/publisherNumber/256A090>. I rather doubt that it
> makes sense to create a PURL for every identifier value, although I like
> the idea that one could re-direct to a more authoritative URL when the
> relevant agency actually instantiates a URL form of the identifier scheme.
>
> There's another issue, which is that the "identifiers" in the records
> today aren't normalized. As Thomas Berger points out, already the LCNA
> identifier has a different form when encoded in a URL:
>
> MARC: $a n 96055058
> URL: http://id.loc.gov/authorities/names/n96055058
>
> I suspect that Ray's publisher number example has been normalized. Some of
> the schemes are quite awkward in form, using varying punctuation:
>
>   *074* *##**$a*277-A-2 (MF)
>  and sometimes being multi-part, such as:
>
>   *017* *##**$a*EU781596*$b*U.S. Copyright Office  *017* *##**$a*DL
> 80-0-1524*$b*Bibliothèque nationale du Québec  *017* *##**$a*PA1116341*$b*U.S.
> Copyright Office*$d*20020703
> Some of us have the experience of developing search algorithms for these
> identifiers, but search is considerably different from minting a URI - to
> begin with, the usage of these in library systems does not require them to
> be unique; occasionally two normalize to the same string.
>
> What I think we are forgetting here is how we use these various codes and
> numbers. Essentially they are searched and displayed. In the future we may
> be using them for linking. This means that if they are "converted" to URLs,
> they will still need human-readable labels, and some thought must be given
> to how (if?) they can be made searchable.
>
> kc
>
> --
> Karen [log in to unmask] http://kcoyle.net
> m: 1-510-435-8234
> skype: kcoylenet
>
>