> If this makes sense, then we might as well just do PQN and specify all the
> attributes.
> Why are you so eager to confuse them?
Because I don't want to end up in a situation where I can't predict the
structure of the contents of an index because of different naming
policies.
Hopefully everyone will support at least the Dublin Core index set, but if
they don't specify it (for what ever reason) then they're at least more
likely to call a title index 'foo.title' ...however when it comes to
naming an exact title index, it could be titleString exactTitle, XTitle,
title, etc. Same for titleWord titleKeyWord titleWords and so forth.
This is why I'd like either one index, or a limited set of names to
distinguish structure to append to the name given to the index.
Even with Explain, there's currently no way of knowing what is in the
index without human intervention, even to the point of keyword vs string.
This means that if I did a search for index="word" I wouldn't know if the
zero hits was because there were no matches for the keyword 'word' or if
it was doing an exact match.
I would be okay with a 'very strongly suggested' list of structure
identifiers, but it should be in the spec itself, not an accompanying
document.
Secondly, the desire to not have multiple ways of doing the same thing,
while still allowing the possibility of 'first words in field'.
I think that FWiW should not be proximity on a word index, as this would
create Very Long and unwieldy searches for a relatively easy concept, and
would require field anchoring in proximity.
So the only other option is a string based search.
This calls into question the rationale behind saying that the index is
string based. If we accept that we can use word masks on a string index,
then the concept of structure in indexes is already pretty half-hearted.
It should either be a string or a word index, but not both.
One solution, IMO, is to use my initial proposal of 'word boundary
character' rather than a word masking character. Thus | would stand for
one or more white space characters or the beginning or end of the field,
not zero or more words. This also clears up any confusion about the use
of *| -- this would mean zero or more characters, followed by at least one
of: beginning of field, end of field, white space character. Hence:
(.*?)(^|$| |\n|\t)+
(assuming that punctuation has already been stripped out of the field)
So to do a first words in field search in a string index would be:
title="keyword search|*"
This makes it still a string operation, not a word operation, and hence we
can use it without getting string and word structures all intertwined.
I think my 2 cents are now up to around $5. The only resolution required
is to get a working first words in field search that is consistent with
the rest of the protocol. I think the above solves the problems which
Ralph recognised (string/word confusion) and promotes interoperability.
Rob
--
,'/:. Rob Sanderson ([log in to unmask])
,'-/::::. http://www.o-r-g.org/~azaroth/
,'--/::(@)::. Special Collections and Archives, extension 3142
,'---/::::::::::. Twin Cathedrals: telnet: liverpool.o-r-g.org 7777
____/:::::::::::::. WWW: http://liverpool.o-r-g.org:8000/
I L L U M I N A T I
|