On Fri, Sep 20, 2002 at 07:33:01AM -0400, LeVan,Ralph wrote:
> Okay, let's start all over.
>
> When I extract keywords from records to build indexes, I do it two ways. I
> either take the entire contents of a field and use that as an index term, or
> I take the individual words from the field, remember their relative
> positions and use them as index terms. I have always called the first type
> of indexing "phrase" indexes and the second kind of indexing "word" indexes.
We support both of the above of course, but we also support another slight
variation - which is for you 'phrase' index to normalize the text based
on knowledge that it is a series of words. For example, we might strip
all the punctuation, map to upper case, compress multiple spaces, trim
leading and trailing spaces.
This index allows very efficient 'first word in field' type operators.
Its faster than just using word positions for queries where you have
several words at the front in the query. We also have done things
occasionally like remove leading 'A' and 'THE' etc.
My point is there is a bit of risk assuming everyone is going to implement
things the same way. Z39.50 has this abstraction layer allowing
implementations to do what they want, as long as the semantics are
the same.
Oh, and what you call phrase/string above I think the Bath profile calls
'complete field'. That is, they effectively recommend the 'completeness'
attribute type to distinguish between the cases. (Actually, they
don't define this level of semanitcs - they just say "this combination
means this, that combinations means that" - they don't try and justify
semantics to individual attribute types & values, they only define
semantics to complete combinations. My analysis of the attribute
lists was that completness was the best way, according to Bath, to
distinguish between word indexing and indexing the whole value.
Having typed up all of the above, I am not sure of my point exactly.
Probably more of a clarification. Or maybe pointing out that for what you
call phrase/string, Bath (as I understand it) recommends using completeness.
If nothing else, it highlights the confusion. I think SRW should stand
on its own two feet rather than using Z39.50/Bib-1 terminology.
I agree with your two types of indexes though (but point out that there
are other variations possible).
Alan
|