Jim,
How is the problem you identify here for Cyrillic Russian different than
that for Romanized Russian? Or for other highly inflected
languages? We've been using wild cards and other stratagems for a long
time to cope with such cases. This is not to say that the sort of things
Sherman reports wouldn't be better.
An interesting variation on the inflection problem occurs in languages like
Arabic and Hebrew when not only can the end of a word vary grammatically,
but, in the original scripts, particles (prepositions, the definite
article) can be attached at the beginning of the word.
Charles
At 11:38 AM 8/30/2006 -0400, you wrote:
>
>Wednesday, August 30, 2006
> The Cyrillic script has long been part of the MARC character
> repertoire; RLG automated it long ago. Which MARC software vendors have
> also done so I do not know. Vendors who plan to do so and libraries who
> would like a vendor to do so may find the following article of interest:
>Yang, Haiyang: "Complexities in Russian information retrieval". It is in
>the September 2006 issue of Multilingual ( formerly Multilingual
>computing) # 82, v.17, no. 6, on pages 44-46, 48-49. Among other topics it
>briefly describes the problems posed for effective keyword indexing by
>Russian being highly inflected. Nouns have: 1. three genders: male, female
>or neuter; 2. two numbers: singular or plural; and 3: six cases:
>nominative, accusative, genitive, prepositional, dative and instrumental.
>For the Russian word for cat it says there are 21 possibilities.
>Apparently Russian cats aren't neutral and a few cases have the same form.
> I do not know if Google indexes Cyrillic script, nor, if it does,
> how it does so. This might be of interest to libraries participating in
> Google's library project.
> Regards,
> Jim Agenbroad ( <mailto:[log in to unmask]>[log in to unmask] )
-- Charles Husbands
-- Harvard University Library Office for Information Systems
-- 90 Mount Auburn Street, Cambridge, MA 02138
-- 617-495-3724 fax: 617-496-5600
|