On Wed, May 15, 2002 at 02:21:38AM +0200, Theo van Veen wrote:
> First I have to say that I appreciate the work being done on CQL and
> Explain. Nevertheless I think that we should make use of some new
> opportunities now we are defining a new query language.
I certainly agree its worth arguing through issues - it is the best
way to solve problems early on rather than later.
> First as a reaction on Ralph:
>
> >So, if I support both dc.title and bath.title and you send me
> >unqualified title, what do you expect me to do? It just so happens
> >that I specified in my explain record that the default index set was
> >bath, but what if you were expecting it to be dc?
>
> What should the client do in this case? Explain to the user that
> there are different sorts of titles? Or just make an arbitrary choice
> for the user?
As I understand it, you (Theo) want a single concept of "title" in CQL.
(Please correct me if I am wrong!) If there is a single concept of
"title", then you don't need qualifiers.
The problem that others (including me) have expressed is that there is
not a single definition of "title". In Dublin Core there are defined
semantics for "title" (the name of a book etc). However, in another
application "title" might mean "Mr/Mrs/Ms/Dr/Sir" etc. Using prefixes
is therefore proposed to qualify "title" (such as "dc.title") to
disambiguate the meaning of "title".
There *are* multiple solutions to the problem (and using a prefix is
only one of them).
(1) Come up with a global namespace for *all* concepts (without prefixes)
and the first person to come up with a meaning of "title" gets to use
that name, and any later meaning that comes along needs to use a
different name (e.g. "formal_title").
(2) Use explain on a server to work out what a particular server means
by "title". That is, don't use the name "title" to work out the meaning.
Instead have some other way of identifying the concept (such as a URI)
then use explain to work out which index name a particular server
uses for that concept (so I would look for "http://dc.org/title..."
and find it was mapped to "title", but "http://human-name.org/title"
was mapped to "formal_title").
(3) Introduce a prefix so a prefix is allocated to a semantic area where
names must be unique in that area. Such as "dc" for Dublin Core.
This could be viewed as a variation of (1) above - the names have just
got longer "dc.title" instead of just "title". But there is a formalism
to it - you must be allocated a unique prefix, then your group can
define names under that group. (This is in effect what XML namespaces
do by the way - but we are using a short prefix instead of a long
URI to identify the namespace).
There are lots more variations I am sure.
Now there are also different usages of CQL.
(U1) A person has a single server they talk to all the time, and want to
express queries using the full capabilities of that server.
(U2) A person wants to write a single query and send it to multiple servers.
I want to support both nicely.
Theo, question 1: do you think I have captured the different alternatives
correctly (with no comment on which is best - I just want to make sure I
am understanding the conceptual model that you want, and the models that
you do not want).
Of the above, I am against proposal (2) because I want to write a
single query that has a chance to work against multiple servers. If I
have to use explain, then I have to rewrite the CQL query per server.
I dislike (1) (a global namespace) because that is against the trend
of what Dublin Core etc are doing. I think its important to be able
to segregate the namespace of indexes.
However, to support usage U1, using qualifiers all the time *is* a pain.
I like being able to define local index names and not have to define
a public table and register a prefix etc. These names are frequently
not intended for cross collection searching. So I like the mix of
being able to define indexs with standard prefix names (with standard
semantics) and local unqualified names for which I can define my own
semantics as best suits the database I am building.
> dc is defined for description and not for searching.
I am sorry, I don't follow your point here. I would have thought that
describing/categorizing data is directly relevant to searching.
> But if it is supposed that a user will have a general
> understanding that dc.author means author, because he has an
> understanding of author, the prefix is not relevant and even
> misleading.
I think what people (including me) are saying is that "author" is
ambiguous unless you come up with a single definition of what "author"
means. Going back to the "title" example above, I think its clear there
is not an intuitive single definition of what "title" means to all
people. It would be a matter of specifying for CQL what "title" or "author"
means. So I disagree with the assertion that a simple index name
such as "title" or "author" is a clear definition of what the semantics
of the index are. I think the Dublin Core activities have demonstrated
this well. The started with 15 core elements, but soon realised that
life is not that simple, and simple names they first came up with
were not enought. So they introduced "qualified Dublin Core" with more
names.
> > > In my point of view not supporting Ralph's premises means
> > > not supporting prefixes. Or did I misunderstood previous
> > > discussions and
> > > is everyone already on this track?
> > Yes, I think you misunderstood. I believe the consensus is
> > this:
> > 1.There will be some well-know prefixes, e.g., bath and dc, and
> > you won't have to use Explain to discover a server-specific
> > definition for these.
>
> In this case a client has to know the prefix exactly. Searching for
> "dc.title:abc or bath.title:abc" will return an error message if one of
> both is not supported.
Exactly. I think its better to report an error if a query has specified
something that a server does not know than return an incorrect result
because the server has misinterpreted the query due to different semantics.
> > 2.A server is free to define server-specific prefixes (as
> > long as they don't clash with the well-known prefixes) and you
> > might have to use explain to discover those.
>
> In distributed searching I do not think any client will search for
> prefixes or indexes that it doesn't know.
Of course. If it does not know the prefix, it cant use it by definition!
But Explain gives a mechanism of learning about prefixes and indexes
the client did not know before. The simplest illustration is a client
that does an Explain query on a server then displays all returned values
to the user in a drop down list. Each index name has a human readable
description along with it. The client application does not "understand"
the different index names in this situation - the human does though.
> > 3.You can send an index name
> > without a prefix, but in that case the server applies the default
> > prefix, and you'll need to use explain to find out what that is for
> > a given server (there won't be any global-default).
>
> This is all I want: reasonable defaults. But I am not able to write
> clients that are intelligent enough to find out whether the servers
> default corresponds to the users expectations.
Ahhhhh! Does this mean then that you are not opposed to prefixes, but
rather all you want to ensure is that a database can be defined without
them. That is not all index names *must* be qualified? I certainly
agree with this. I think a database should be able to support a set
of qualified index names (with standard prefixes) AND a set of unqualified
names.
Is the challenge therefore in your eyes working out what these unqualified
names mean? (Eg: does "title" mean title of a book versus Mr/Mrs/Dr etc).
Is a human readable description enough? Or a URI? Or the Z39.50 attribute
list it binds on to? Or put another way, what unambiguous way can you
think of that defines what a user expectation is? This is an important
question to answer.
> > 4.Distributed searching is theoretically possible, but all indexes
> > should have well-known prefixes. (Or, you could send non-
> > prefixed indexes to different servers but you cannot assume
> > that they mean the same thing to different servers.)
> > --Ray
>
> What (default) prefixes should be used in distributed searching?
I think Ray's point above is that if you want to write a query and
have it sent to mulitple servers and guarantee those servers use
the same meaning as you intend, that there is no default prefix that
can be used.
If a database can support both qualified (formal, standard definitions)
and unqualified (locally defined) index names, a distributed query using
only unqualified names can still work *if* the query is being sent off
to multiple servers that are known to support the same locally defined
names. I think the argument is that in the case where you want to send
a query of to lots of servers where they do not share the same locally
defined names (because they are locally defined), then using prefixes
avoids a server misinterpreting a query.
> Ralph will return an error message if I try "dc.title:abc or
> bath.title:abc".
Me too. I would never write a query using both though. I would write
a query using only one of them.
But an alternative here is to add a flag when a CQL query is submitted
saying "report error on unknown index names" versus "ignore unknown
index names". (By ignore, I mean return zero matches for that term - sort
of like NULL in relational databases.) I can see the merit in this.
Or even introduce a new symbol or something in CQL indicating for
a index name the behaviour to take (zero matches or error) so the
person writing the query has control - but I think a boolean flag
being sent along with the query is better.
> I have the strong feeling that we are currently on the wrong track.
> We are mixing up Z39.50 attribute sets with dc name spaces, while
> the solution is quite simple: use user understandable names for
> search indexs. It is possible in Dublin Core for description, why is it
> not possible in CQL for searching?
Dublin Core only gives one semantics of "title". I think if you asked
them Dublin Core would agree that their semantics *is not* the only
definition, or even the best. Its just a definition they have agreed
with. This is why they use XML namespaces to qualify their elements
in XML encodings. They do not, for example, claim their interpretation
of "title" is the best one so their one does not need qualification.
So I think we should support qualifiers in part *because* Dublin Core
do it too.
> The abstract Z39.50 attributes were usefull in case of MARC
> descriptions, but in line with Dublin Core I think we should map the
> Z39.50 search attributes to user understandable names instead of
> sticking to the attributes.
I think we are all in agreement here. We want textual names in queries.
The question regards to unambiguous agreement to what a textual name means.
> Theo
I think there are some very interesting issues to have come out of this.
In summary:
* I think a database should support both prefix qualified index names
(with globally defined and agreed to semantics) and unqualified
index names (locally defined semantics).
* For a locally defined index name, how to unambiguosly define its
semantics? Human description? URI? Z39.50 attribute list?
ZeeRex records I think would allow a human description and an
attribute list.
* Should SRW have a flag to be sent in a query to define the behavour
for unknown index names? (Ignore versus report error versus server
can do whatever it feels like etc.) I can see the logic in this for
distributed queries. If SRW picks a single semantic however, I think
it should be to report an error.
Alan
--
Alan Kent (mailto:[log in to unmask], http://www.mds.rmit.edu.au/~ajk/)
Project: TeraText Technical Director, InQuirion Pty Ltd (www.inquirion.com)
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC Australia.
Phone: +61 3 9925 4114 Reception: +61 3 9925 4099 Fax: +61 3 9925 4098
|