On 8 Aug 2005 in the thread "Topic 1, Representing Extended Unicode in
MARC-8," Geoff Mottram <[log in to unmask]> asked:
> Even in Unicode aware databases, how do you sort the "section symbol" or
search for it?
I said about the Unicode Collation Algorithm in my presentation "True
Scripts in Library Catalogs" (ALA Annual, 2004):
Consistent ordering of data is necessary not only for the presentation of
results in a meaningful order – more significantly, it is an essential part
of query matching that allows records to be retrieved.
What we have been calling a "MARC-8 system" has a specific character
repertoire limited by the platform software (for example, Latin-1 (ISO/IEC
8859-1); Windows Code Page 1252; Macintosh Roman) or by the manufacturer
(ASCII + ANSEL only).
I think it is reasonable to assume that any "MARC-8 system" that handles any
non-Roman script(s) has either been converted to Unicode or is being
converted. If there are systems that are limited to ASCII, this is a
constraint imposed by the manufacturer and has always had to be dealt with,
since 8-bit Latin was introduced long ago for MARC records.
In a MARC-8 system, characters that are significant for indexing can only be
the characters that are legal for the system itself. This excludes
extraneous characters in queries (a) from other systems via Z39.50, and (b)
generated by a device not sanctioned for use with the system, e.g., a
keyboard for an unsupported script.
So the "section symbol" is simply ignored for indexing and searching in such
systems.
What do we do in a Unicode aware system? We have conventions for MARC-8
characters that we might want to replicate in a Unicode aware system. For
example, in an English language environment, we might want to treat the
Polish L as a regular L for searching and retrieval (but we wouldn't want to
do that in Poland).
What we need to ask ourselves are questions like: How precise does a match
have to be? Are we going to provide alternative matching options at
different levels of specificity? Unicode copens up lots of possibilities.
I hope that readers more knowledgeable about searching than I am will
contribute to this thread.
-- Joan Aliprand
|