Print

Print


In my software that I write for consolidating german library catalog data, I chose the following lines of though to deal with identifiers:

All identifier fields from legacy catalog records are considered "vague" / "opaque" without semantics. They were once recorded as plain strings and must be parsed into a internal representation (which is not a new task).

Finding the representation can be divided into subtasks for:
- identifiers imported into a catalog out of control of the cataloging library ("global" identifiers, regional/local identifiers, standard numbers)
- identifiers from other libraries (authority files, cross linkings)
- identifiers created for the use only in the catalog of the library (i.e. "system numbers")

Some methods I use for disambiguation are
- namespacing by taking Linked Data rules, this means, minting URIs from opaque identifiers. This just works well for "system numbers" and for identifiers which are under "my control"
- namespacing by rules made by librarians (e.g. prefixing with ISIL)
- considering the import context (for finding the sender of the received data, by file or by API)
- checksumming (identifiers can be assigned to certain identifier types when checksums are found to be correct, example: ZDB-ID) 

In the search engine index, I store at least three forms of each identifier, side by side:
-- the original form
-- the canonical form
-- the display form

The purpose of the original form is for preservation/archival so other processing routines in the future can be applied to the form that has once been entered or imported into the catalog.

The purpose of the canonical form is programming identifier equivalence, search/faceting, match / merge, and linking to other resources.

The purpose of the display form(s) is to show the user what is expected as the expected, context-dependent output (e.g. displaying ISBN-13 or ISBN-10 or both, adding a prefix for the type, adding ISIL as prefix, showing URIs etc).

When search terms are entered, I analyze the terms if identifier types can be recognized automatically (e.g. ISBN, ISSN). Internally, a found identifier phrase is converted to the canonical form ("normalizing the input"). This may also use contextual information from the environment (e.g. is there a library ISIL of the user who is searching). This means there will be only one hit expected in the search result (if there are more hits, there is surely something wrong with the indexed catalog data).

URIs, created from the canonical form, are a preferred format for creating HTML links in the display of a Linked Open Data environment, so I can link to other web resources. E.g. linking to Zeitschriftendatenbank (ZDB) site ld.zdb-services.de if I deal with a ZDB-ID. I could even copy the resources from ZDB to a local site and redirect my links to ZDB-ID to a local mirror.

I do everything to avoid URNs. This would mean I had to implement resolving systems. I do not think that resolving systems can be a solution for something reliable. They failed from a historical view (there are reasons why e.g. Google never was in the need of an URN resolver).

I also avoid persistent identifiers, as they are a constant source of trouble, e.g. http://purl.org namespace, which is unreliable, OCLC server may be down, or DOIs, where publisher servers are down etc. This is not a matter of practice, it is built into the design.

To me, Linked Data/Semantic Web technology is superseding URN/persistent identifier concepts. This is finally a distributed and stable technology. No central authority, no bottlenecks, no more trouble with server reliability.

Jörg



On Sat, Jul 19, 2014 at 6:06 PM, Karen Coyle <[log in to unmask]> wrote:

On 7/18/14, 3:34 PM, Robert Sanderson wrote:

Or an HTTP space such as Jeff's suggested purl.org
This may be a question for Jeff ... must PURLs re-direct to a non-PURL URL? - If so, then in any case one will need a conformant non-PURL URL for the identifiers.

Taking Ray's example “info:bibframe\publisherNumber\ 256A090” - that could be expressed as "http://bibframe.org/publisherNumber/256A090". I rather doubt that it makes sense to create a PURL for every identifier value, although I like the idea that one could re-direct to a more authoritative URL when the relevant agency actually instantiates a URL form of the identifier scheme.

There's another issue, which is that the "identifiers" in the records today aren't normalized. As Thomas Berger points out, already the LCNA identifier has a different form when encoded in a URL:

MARC: $a n 96055058
URL: http://id.loc.gov/authorities/names/n96055058

I suspect that Ray's publisher number example has been normalized. Some of the schemes are quite awkward in form, using varying punctuation:

074 ##$a277-A-2 (MF)

 and sometimes being multi-part, such as:

017 ##$aEU781596$bU.S. Copyright Office
017 ##$aDL 80-0-1524$bBibliothèque nationale du Québec
017 ##$aPA1116341$bU.S. Copyright Office$d20020703

Some of us have the experience of developing search algorithms for these identifiers, but search is considerably different from minting a URI - to begin with, the usage of these in library systems does not require them to be unique; occasionally two normalize to the same string.

What I think we are forgetting here is how we use these various codes and numbers. Essentially they are searched and displayed. In the future we may be using them for linking. This means that if they are "converted" to URLs, they will still need human-readable labels, and some thought must be given to how (if?) they can be made searchable.

kc

-- 
Karen Coyle
[log in to unmask] http://kcoyle.net
m: 1-510-435-8234
skype: kcoylenet