As I unfortunately predicted, an extremely thorny problem was discovered
yesterday. Although this mail is very long, I wanted to cover all of the
ground that we did yesterday so as to hopefully not go over it again.
The end result is a more functional CQL specification and a more
interoperable SRW.
The current situation:
Record schemas and index sets are refered to by simple name, described
in the Explain record. The record should be retrieved, parsed and
then further queries use this information to interact with the server.
The issue:
There is no session for which the Explain can be declared valid for.
Even by the time that the client reconnects to perform its first
query, the Explain may be invalid. While this particular situation
may be unlikely, it is significantly more likely that the Explain
information will change over the course of a single user's interaction
with the database. The only resolution using the current system is to
fetch the Explain record repeatedly -- the more often it is fetched
the more likely it is to be valid. This is simply not supportable.
'Solutions'
The following solutions were discussed:
* Have a guaranteed valid until time for Explain records.
BUT this is just more 'session' support which has already been rejected
(ala authenticationToken) Also servers would never be able to change if
they have constant connections.
* Retrieve the last modified time for the Explain record.
BUT this doesn't solve the underlying issue, it just reduces the amount
of data transfered, making it easier to ask repeatedly, but you still
need to do it.
* Maintain a registry of names for all schemas and indexsets.
BUT it's a lot of work and people will ignore it. First in gets the best
short names. So we end up with SRW-Rob-IndexSets-myIndexSet. Or
'SRW cybersquatters'.
* Send the URI for the indexSet/Schema in place.
BUT ugly for Schemas and intolerable for IndexSets. Bad for SRU.
* Send the mapping between URI and simple name in a separate parameter
BUT CQL needs to be able to stand alone. This ties it to SRW again and
already that is looking to be an unacceptable solution. Mike reports
that people are asking for CQL support in non SRW focused products /already/.
The query simply -has- to somehow stand by itself and not rely on other
information in the request.
Final Solution:
The only way to be sure is to send the URIs, not simple names. This has
been rejected in the past for length of URL reasons for SRU, but this is
very unlikely to ever be an issue in practice. Much more unlikely than the
problem which it solves.
Schemas can be sent directly and should always be used making the simple
names for schemas redundant. This occurs in the recordSchema request
parameter, the schema parameter of sort and in the schema field of the
returned record.
Indexsets on the other hand need to be typable. Our solution for this is
to send the mapping used to the server, rather than hoping that the
server's mapping hasn't changed since it was last fetched. This cannot be
done in a separate parameter, so we need to change X/CQL.
This is a relatively simple change -- we remove the simple names from the
Explain record for record schemas and everywhere that they were used we
now use the full URI identifier. The change for CQL was designed (with
much agonising) to be completely backwards compatible. All currently
valid CQL queries will be compatable after this change.
CQL Specifics:
The change for CQL is to allow a mapping to be sent before any cql-query
or searchClause. The mapping applies to anything contained within that
searchClause or boolean triple.
After trying many possibilities, we arrived at the following syntax which
we believe to be unambiguous and not require multiple token lookahead.
'>' [identifier '='] term
identifier is the simple name and term is the URI to which it is assigned.
If identifier = is omited then it gives a default index set URI.
This can be repeated to give multiple name definitions.
For example:
> dc="http://www.dublincore.org/" > b="http://www.loc.gov/.../bath/"
(dc.title = "fish" and b.author = "^Smith, J*")
which is equivalent to
( > dc="http://www.dublincore.org" dc.title = "fish" and
> b = "http://...bath/" b.author="^Smith, J*" )
Other examples:
( > "http://www.dublincore.org" title = "fish" )
( > b="http://.../bath/" > "http://www.dublincore.org"
(b.author = "smith" and title = "fish")
)
These index set definitions are optional. If you're sending the search to
a database that you are confident has not changed its configuration, then
you can still use the current method.
We tried MANY other variants, but this was the neatest with the least
impact on the current specification.
In XCQL this translates as an optional element 'prefixes' at the beginning
of either searchClause or triple, which contains a sequence of 1 or more
'prefix' elements, each of which contains a name/identifier map.
<triple>
<prefixes>
<prefix>
<name>dc</name>
<identifier>http://www.dublincore.org/</identifier>
</prefix>
<prefix>
<name>bath</name>
<identifier>http://www.loc.gov/.../bath/</identifier>
</prefix>
</prefixes
<boolean><value>and</value></boolean>
<searchClause>
...
</searchClause>
<searchClause>
...
</searchClause>
</triple>
Other side effects:
This makes broadcast searches possible. You can send the same query to all
servers and ask for DC records back.
With no centrally maintained lists, the uptake will likely be greater as
it doesn't rely on a single point. Communities can define their own
record schemas and indexsets and not have trouble when the bibliographic
community starts asking the homewares community for bath.author as opposed
to bath.manufacterer. This is also less work for Ray and index set/record
schema authors. A centrally maintained list of record schemas would be
impossible.
Rob
--
,'/:. Rob Sanderson ([log in to unmask])
,'-/::::. http://www.o-r-g.org/~azaroth/
,'--/::(@)::. Special Collections and Archives, extension 3142
,'---/::::::::::. Twin Cathedrals: telnet: liverpool.o-r-g.org 7777
____/:::::::::::::. WWW: http://liverpool.o-r-g.org:8000/
I L L U M I N A T I
|