In my software that I write for consolidating german library catalog data, I chose the following lines of though to deal with identifiers:
All identifier fields from legacy catalog records are considered "vague" / "opaque" without semantics. They were once recorded as plain strings and must be parsed into a internal representation (which is not a new task).
Finding the representation can be divided into subtasks for:
- identifiers imported into a catalog out of control of the cataloging library ("global" identifiers, regional/local identifiers, standard numbers)
- identifiers from other libraries (authority files, cross linkings)
- identifiers created for the use only in the catalog of the library (i.e. "system numbers")
Some methods I use for disambiguation are
- namespacing by taking Linked Data rules, this means, minting URIs from opaque identifiers. This just works well for "system numbers" and for identifiers which are under "my control"
- namespacing by rules made by librarians (e.g. prefixing with ISIL)
- considering the import context (for finding the sender of the received data, by file or by API)
- checksumming (identifiers can be assigned to certain identifier types when checksums are found to be correct, example: ZDB-ID)
In the search engine index, I store at least three forms of each identifier, side by side:
-- the original form
-- the canonical form
-- the display form
The purpose of the original form is for preservation/archival so other processing routines in the future can be applied to the form that has once been entered or imported into the catalog.
The purpose of the canonical form is programming identifier equivalence, search/faceting, match / merge, and linking to other resources.
The purpose of the display form(s) is to show the user what is expected as the expected, context-dependent output (e.g. displaying ISBN-13 or ISBN-10 or both, adding a prefix for the type, adding ISIL as prefix, showing URIs etc).
When search terms are entered, I analyze the terms if identifier types can be recognized automatically (e.g. ISBN, ISSN). Internally, a found identifier phrase is converted to the canonical form ("normalizing the input"). This may also use contextual information from the environment (e.g. is there a library ISIL of the user who is searching). This means there will be only one hit expected in the search result (if there are more hits, there is surely something wrong with the indexed catalog data).
URIs, created from the canonical form, are a preferred format for creating HTML links in the display of a Linked Open Data environment, so I can link to other web resources. E.g. linking to Zeitschriftendatenbank (ZDB) site ld.zdb-services.de
if I deal with a ZDB-ID. I could even copy the resources from ZDB to a local site and redirect my links to ZDB-ID to a local mirror.
I do everything to avoid URNs. This would mean I had to implement resolving systems. I do not think that resolving systems can be a solution for something reliable. They failed from a historical view (there are reasons why e.g. Google never was in the need of an URN resolver).
I also avoid persistent identifiers, as they are a constant source of trouble, e.g. http://purl.org
namespace, which is unreliable, OCLC server may be down, or DOIs, where publisher servers are down etc. This is not a matter of practice, it is built into the design.
To me, Linked Data/Semantic Web technology is superseding URN/persistent identifier concepts. This is finally a distributed and stable technology. No central authority, no bottlenecks, no more trouble with server reliability.