Toshihiro Takasu wrote:
> Please take a little time to read through my question and hopefully to give
> me
> an answer.
>
> I have a technical question about Z39.50 and SRW/U.
Please foregive me for my first comment: this has nothing to do with
Z39.50 or SRU. Anyhow, even if it's off-topic, it's an interesting
question, I'd like to give a few comments on.
> The question is specifically about how you handle the search requested by
> an user.
>
> I am from Japan and we have several styles of character(kanji, hiragana,
> katakana) in Japanese language
> as you might have known. Therefore when we use Z39.50 or whatever to build
> a search system,
> it comes to a critical issue that a book could be stored with a title
> consisted of different styles of character in
> each Z39.50 database. Once it happened, the search system reads the title
> of the same book from each database,
> and recognize them as different books because their titles don't match.
>
> I assume that the same problem could arise in English as well.
> The search system has to handle the capital letters, lower case letters,
> space, /, -, commas, periods, so on so forth,
This is the question of character set normalization.
Some thoughts: as far as I remember (please correct me if I am wrong),
hiragana and katagana represent essentially the same information, or to
say in a different way, are two representations of the same character
set (structurally seen).
So it might be an idea to make a one-to-one map from hiragana to
katagana, and index them the same way. Also treating both as equivalens
classes. This is character normalization, which every indexing system does.
My knowledge of japanese use of kanji is so limited that I will refer
from having ideas.
> and a title of a book could be represented in different way. (Title
> metadata could be slightly different depending on the database, right?)
>
This is the problem of semantic normalization.
The way this is done usually, is to figure out what characterizes
equivalens classes for book titles.
This is in fact the hard part.
There are attempts to solve this problem in the western world, the most
noticable I can think of is the FRBR set of ideas.
http://www.ifla.org/VII/s13/frbr/frbr.htm
This might or might not be useful for asian books.
> The thing is that a search system with ability to recognize a same books as
> a same book even though the
> titles don't perfectly match, is ideal system because it can output exactly
> one result per a book.
>
> So how does your system handle this problem?
>
We handle in our indexer only the character normalization part of the
problem.
Cheers, Marc Cromme, Index Data, http://www.indexdata.com
> I apologize that I wrote more than enough amount, but if you could have
> some time to
> answer my question, that will be greatly appreciated.
>
> best regards,
>
> Toshihiro Takasu
--
Marc Cromme, cand. polyt, Ph.D
Senior Developer, Project Manager
Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark
tel: +45 3341 0100
fax: +45 3341 0101
http://www.indexdata.com
INDEX DATA Means Business
for Open Source and Open Standards
|