Theo van Veen wrote:
>Let me give some examples of possible responses:
>
>
Thanks for the clarifying examples.
>Suppose I have the following query:
>http://host/sru?query=some_index=theo
>
>Then server A may respond with:
>...
><numberOfRecords>0</numberOfRecords>
>...
>server B may respond with
>...
><numberOfRecords>0</numberOfRecords>
><fuzzyMatches>
> <term>thea</term>
> <term>then</term>
> <term>toe</term>
></fuzzyMatches>
>...
>
>
This is more or less a scan, as far as I can see. The SRW scan is
bounded to an index, but then again, all valid indexes are known through
the explain functionality.
So why should a client not just use a regular scan when it does not get
sufficient hits? This way, the decision is on the side of the client,
which is the part of the system which is (hopefully) brain-powered in
the end, and can make good decisions.
>server C may respond with
>...
><numberOfRecords>10000</numberOfRecords>
><hitsPerIndex> (for query=theo)
> <index>
> <name>any</name>
> <numberOfRecords>99900<numberOfRecords>
> <index>
> </index>
> <name>dc.subject</name>
> <numberOfRecords>5<numberOfRecords>
> <index>
> </index>
> <name>dc.creator</name>
> <numberOfRecords>50<numberOfRecords>
> <index>
> </index>
> <name>composer</name>
> <numberOfRecords>10<numberOfRecords>
></index>
><hitsPerIndex>
>...
>
>
>
This is exactly what I refer to when using the term "indexes spread over
clients".
If a client wants to do anything inteligent with this information, it
has to cache it - not only just a few seconds, but over many, many
sessions. Essentially, a client has to build an inverted index
(hashtable) over all the atomic parts of a query ( in this example
"theo"), and has to record which server had it, and how many times in
which index field. Otherwise, it has no chance to adaptive guide the
user to better queries.
For example, your response would add the following data to the hash table:
theo[server C][any] = 99900
theo[server C][dc.subject] = 5
theo[server C][dc.creator] = 50
theo[server C][composer] = 10
This definitely _is_ an inverted index: it tells where to be lucky with
queries for the
word "theo", and how lucky queries must be formulated.
It is distributed in the sense that a given client only has partial
knowledge over all
inverted indexed - namely only those, it had asked for.
But it lacks a fundamental property for being really usefull: other
clients, with their partial knowledge of other arts of the distributed
hash table, do not know how to use their neighboring clients parts of
the whole hash table, so they can not take advantage of other clients
adaptive learning.
This is because clients do not know clients, just servers. And even if
they did, SRW do not define a client-client protocol, so the can't talk
to eachother to exchange information.
Enters the need of a grid or peer-to-peer protocol.
So the only thing resulting in this - keeping the SRW client-server
philosophy - is that a client has to build indexes over the servers it
knows, and can't exchange it with others. But really - who's best at
creating indexing structures?? The server.
So my question is: why should a client remember hash tables, when a
sever is much better at it? And when we do not want to leave the
client-server philosophy with SRW ?
How does the client update it's hash tables when records are invalidated
on the server?
IMHO this is not really gonna work before updating issues are dealt
with, and the benefit for the client might not be as large as expected.
Marc Cromme, Index Data
|