LISTSERV mailing list manager LISTSERV 16.0

Help for UNICODE-MARC Archives


UNICODE-MARC Archives

UNICODE-MARC Archives


[email protected]


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UNICODE-MARC Home

UNICODE-MARC Home

UNICODE-MARC  August 2005

UNICODE-MARC August 2005

Subject:

Searching and Unicode

From:

Joan Aliprand <[log in to unmask]>

Reply-To:

UNICODE-MARC Discussion List <[log in to unmask]>

Date:

Tue, 9 Aug 2005 16:21:18 -0400

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (46 lines)

On 8 Aug 2005 in the thread "Topic 1, Representing Extended Unicode in
MARC-8," Geoff Mottram <[log in to unmask]> asked:
> Even in Unicode aware databases, how do you sort the "section symbol" or
search for it?

I said about the Unicode Collation Algorithm in my presentation "True
Scripts in Library Catalogs" (ALA Annual, 2004):
Consistent ordering of data is necessary not only for the presentation of
results in a meaningful order – more significantly, it is an essential part
of query matching that allows records to be retrieved.

What we have been calling a "MARC-8 system" has a specific character
repertoire limited by the platform software (for example, Latin-1 (ISO/IEC
8859-1); Windows Code Page 1252; Macintosh Roman) or by the manufacturer
(ASCII + ANSEL only).

I think it is reasonable to assume that any "MARC-8 system" that handles any
non-Roman script(s) has either been converted to Unicode or is being
converted. If there are systems that are limited to ASCII, this is a
constraint imposed by the manufacturer and has always had to be dealt with,
since 8-bit Latin was introduced long ago for MARC records.

In a MARC-8 system, characters that are significant for indexing can only be
the characters that are legal for the system itself. This excludes
extraneous characters in queries (a) from other systems via Z39.50, and (b)
generated by a device not sanctioned for use with the system, e.g., a
keyboard for an unsupported script.

So the "section symbol" is simply ignored for indexing and searching in such
systems.

What do we do in a Unicode aware system? We have conventions for MARC-8
characters that we might want to replicate in a Unicode aware system. For
example, in an English language environment, we might want to treat the
Polish L as a regular L for searching and retrieval (but we wouldn't want to
do that in Poland).

What we need to ask ourselves are questions like: How precise does a match
have to be? Are we going to provide alternative matching options at
different levels of specificity? Unicode copens up lots of possibilities.

I hope that readers more knowledgeable about searching than I am will
contribute to this thread.

-- Joan Aliprand

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

April 2018
February 2016
September 2013
March 2013
September 2008
December 2007
October 2007
September 2007
August 2007
July 2007
June 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005

ATOM RSS1 RSS2



LISTSERV.LOC.GOV

CataList Email List Search Powered by the LISTSERV Email List Manager