In implementations with which I have been involved punctuation is converted
to either space or null for indexing purposes. Parsers still have the
problem of determining whether the punctuation is part of the instruction or
data. If it is certain that it is data, then it converts to null or blank
with the same rules as it uses for indexing. If it thinks that it is an
instruction, then it converts it to a null and acts on the instruction.
From: Alan Kent [mailto:[log in to unmask]]
Sent: Monday, 20 May 2002 02:34
To: [log in to unmask]
Subject: Re: cql index definitions
On Fri, May 17, 2002 at 02:11:30PM +0200, Janifer Gatenby wrote:
> Tasmania? - right truncation
> Tasmania? tiger? - right truncation
> Tasmania? tiger - truncation 104
> Tasmanian tiger? - truncation 104
Overall, I agree with the basic semantics brought up in this thread.
But we hit a few issues in the past that I thought I would raise.
Its only important *if* we decide that we should (as a part of the
CQL spec) mandate how CQL should be turned into RPN. If we leave it
up to an implementor, then there is nothing to be done (my preference).
In our CCL->RPN parser we certainly try to be smart about the above
sorts of things. But it can be tricky to get right. For example,
we allow TITLE to be indexed as words and as a single complete string
(that is, the full title as one term). Do do Bath first-in-field
right truncation character based, we use the TITLE as a single term
and do right truncation on it. Piece of cake. When searching the TITLE
as words, you can right truncate each word.
Now this separation of 'terms' from 'words' may be our implemenation
choice, and not part of Z39.50 as such. But to us, a term in an index
can be any sequence of bytes. The database designer chooses how to
turn content into a set of terms to be indexed.
The problem is how does the CQL parser know if it sees
whether its a single term or multiple terms? (Our system allows spaces
in terms.) How can the CQL parser safely break the input text into terms
in the same way as the database engine?
While you might say "well, if the attribute indicates 'words' then
use spaces to separate terms", but I don't think this is good enough
because different systems treat punctuation differently. Is '-' part
of a word, or a word separator? How about '.'? How about '/'? etc.
I think its *valid* for different systems to have different word parsing
rules. As such, how can a CQL parser work out how many terms are in
This is important in order to get all the truncation examples above
I raise all this just to point out sometimes life is not as easy as
you expect. We could just ignore the problem, but this problem is
one of the reasons that type 104 trunction is so important to us.
We try our best to use right truncation etc, but at times its just
too hard (unless you assume index terms are always separated by
spaces and nothing else).
ps: I am not proposing anything in this mail - just raising some issues.
I guess it depends on how fine grained we want to get when describing
how to turn CQL into RPN.
Alan Kent (mailto:[log in to unmask], http://www.mds.rmit.edu.au/~ajk/)
Project: TeraText Technical Director, InQuirion Pty Ltd (www.inquirion.com)
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC
Phone: +61 3 9925 4114 Reception: +61 3 9925 4099 Fax: +61 3 9925 4098