Dear group,
I have question about whether certain Unicode characters--in my case,
certain Hebrew ones--are represented by double bytes in UTF-8, and if so,
whether this would explain the following situation:
While testing the Unicode release of Endeavor Voyager, a member of my team,
Jerry Anne Dickel, found that Hebrew script titles beginning with definite
(and for Yiddish, also indefinite) articles, no longer indexed properly.
The titles were failing to show up in browse displays. Jerry Anne was able
to implicate the second indicator of the 245 field (= number of non-filing
characters) in this: Ordinarily, with the Hebrew article "ha" [a one
character prefix], the second indicator of the 245 would be 1, but it was
only when Jerry Anne changed it to a 2 that the title once again indexed
correctly. The same thing happened with the Yiddish definite article "der"
(3 letters plus a space as in the Latin script), where the numeral 4
(representing the three letters plus space) would normally be used in the
second indicator; in Voyager Unicode, however, the title would only index
if the 4 were replaced by a 7 (i.e., doubling the Hebrew characters (3x2)
but not the space).
We replicated the problem in LC's Unicode-compliant Voyager and in OCLC
WorldCat.
Interestingly, there did not seem to be a problem in RLIN21.
Did RLG anticipate (what I'm assuming is) the doubled bytes and apply a
fix? Alternatively, do you think it might be something other than byte
number that's causing the problem?
Thank you very much for your help.
Daniel
>------------------------------------
Daniel Lovins
Hebraica Team Leader
Catalog Department
Sterling Memorial Library
Yale University
PO Box 208240
New Haven, CT 06520
tel: 203/432-1707
fax: 203/432-7231
|