Yep, hard problems. We all live with them. I've said for a long time now
that each advance in technology that we make just clarifies the real
problems.
All I can say is that my experience leads me to believe that the users
recognize that the terms they enter can be problematic and seem to cope with
it. I wish content providers were as good at trying to avoid problematic
content.
Ralph
-----Original Message-----
From: Alan Kent [mailto:[log in to unmask]]
Sent: Sunday, May 19, 2002 8:34 PM
To: [log in to unmask]
Subject: Re: cql index definitions
On Fri, May 17, 2002 at 02:11:30PM +0200, Janifer Gatenby wrote:
> Tasmania? - right truncation
> Tasmania? tiger? - right truncation
> Tasmania? tiger - truncation 104
> Tasmanian tiger? - truncation 104
Overall, I agree with the basic semantics brought up in this thread.
But we hit a few issues in the past that I thought I would raise.
Its only important *if* we decide that we should (as a part of the
CQL spec) mandate how CQL should be turned into RPN. If we leave it
up to an implementor, then there is nothing to be done (my preference).
In our CCL->RPN parser we certainly try to be smart about the above
sorts of things. But it can be tricky to get right. For example,
we allow TITLE to be indexed as words and as a single complete string
(that is, the full title as one term). Do do Bath first-in-field
right truncation character based, we use the TITLE as a single term
and do right truncation on it. Piece of cake. When searching the TITLE
as words, you can right truncate each word.
Now this separation of 'terms' from 'words' may be our implemenation
choice, and not part of Z39.50 as such. But to us, a term in an index
can be any sequence of bytes. The database designer chooses how to
turn content into a set of terms to be indexed.
The problem is how does the CQL parser know if it sees
Tasmanian Tiger
whether its a single term or multiple terms? (Our system allows spaces
in terms.) How can the CQL parser safely break the input text into terms
in the same way as the database engine?
While you might say "well, if the attribute indicates 'words' then
use spaces to separate terms", but I don't think this is good enough
because different systems treat punctuation differently. Is '-' part
of a word, or a word separator? How about '.'? How about '/'? etc.
I think its *valid* for different systems to have different word parsing
rules. As such, how can a CQL parser work out how many terms are in
the string
a/b?c.d#e!f%g^h(i)j
This is important in order to get all the truncation examples above
correct.
I raise all this just to point out sometimes life is not as easy as
you expect. We could just ignore the problem, but this problem is
one of the reasons that type 104 trunction is so important to us.
We try our best to use right truncation etc, but at times its just
too hard (unless you assume index terms are always separated by
spaces and nothing else).
Alan
ps: I am not proposing anything in this mail - just raising some issues.
I guess it depends on how fine grained we want to get when describing
how to turn CQL into RPN.
--
Alan Kent (mailto:[log in to unmask], http://www.mds.rmit.edu.au/~ajk/)
Project: TeraText Technical Director, InQuirion Pty Ltd (www.inquirion.com)
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC
Australia.
Phone: +61 3 9925 4114 Reception: +61 3 9925 4099 Fax: +61 3 9925 4098
|