Print

Print


---------- Forwarded Message -----------
From: "Edward C. Zimmermann" <[log in to unmask]>
To: "SRU (Search and Retrieve Via URL) Implementors" <[log in to unmask]>,
[log in to unmask]
Sent: Thu, 28 Aug 2008 15:28:12 +0200
Subject: Re: SRU/CQL 2.0: Invitation to participate in OASIS SWS TC Development

>  Among the suggested 2.0 features are:
> 
> 1. Allow Non-XML Record Representations
> Many formats do not map easily into XML, for example multimedia, 
> images, and even complex text formats. The suggestion is to allow 
> non-xml serialized data in the response, as well as value by 
> reference. These would be signaled by additional values for the 
> recordPacking parameter. For example recordPacking="base64" or
recordPacking="uri".

Makes sense even for text.. We are, after all, often indexing objects that
don't fit into XML (such as plain text  words, lines, sentences, paragraphs,
pages and their overlaps).

> 
> 2. Proximity
>  deprecate the PROX BOOLEAN operator and instead represent proximity 
> by two methods:

Can't agree more since I think the very concept of "proximity"--- and
I've voiced this quite a few times--- is WRONG here.

We should NEVER, I'd say, speak of proximity other than, at best, a linear
metric of octets in the original data as stored.

http://ibu.de/node/52
(look started at "Query Model")

"In XML we not only have a parent/child ancestry of nodes but we also have
within nodes a linear ordered relationship. One letter follows the next and
one word follows the other in a container. In the above example "Yet" precedes
"here's" and "a" follows after and finishing with "spot". We have order and at
at least a qualitative (intuitive) notion of distance.

In XML we do not, however, have any well-defined order among the siblings
(different LINEs). The XML 1.0 well-formedness definition specifically states
that attributes are unordered and says also nothing about elements. Document
order (how they are marked-up) and the order a conforming XML parser might
decide to report the child elements of SPEECH might not be the same. Most
systems handling XML from a disk and using popular parsers typically deliver
it in the same order but the standard DOES NOT specify that it need be--- and
for good reason."

Even if we restrict the current model of proximity to the ordering within
a single container we have, beyond a metric of bytes as stored, problems when
we start to speak of words. What is a word? Its up to the server, after all,
to decide that and there is often little way of having the user know what
it might be.. For example in my own engine... depending upon configuration
any of number of non-alphanumeric characters may belong to a word depending
upon what is before and after that character. What's 6 words over? Depends..

What's words in XML marked-up text? Are the different ingredients below
each 1 word over from the next?

<ingredients>
    <item>Chocolate</item>
    <item>Flour</item>
    <item>Butter</item>
    <item>eggs</item>
  </ingredients>

Are "eggs" and "flour" within 3 words of each other?

You need to kick the habit of units and think instead of structure..

Instead of proximity we should (well, actually need) to talk about
something being in an element within some structure.

Words, sentences, etc. as you have defined a units is really nothing other
than a structure.. The above ingredients and items are structure...

What you have called poximity with unit as words and distance of less
than 5 is really nothing other than:
- a map of a record into words.
<word>this</word><word>is</word><word>a</word><word>word</word>
together with a linear order which yields a count..

We should NOT assume or demand that all records have word, line, paragraph
etc. structure or even that we can agree upon the application of word, line
etc. My word model and your word model may be different.. Searching for a
word model means to search for the word model as defined by the record as
indexed. Its just like searching for title..

>  -- Adding a relation: 'window'.
> examples:
>  * dc.title window/distance<5/unit=word "fries salt vinegar"

See above why that's still "wrong"

>   (fries, salt, and vinegar all within a span of 5 words)
>  *dc.title window/distance<5/unit=word ((fish and fries) and (salt or
> vinegar))
>  (fish and chips and one of salt or vinegar, in a 5 word window)
>  * dc.title window/distance=2/unit=word/ordered "fries salt "
>  (fries followed by salt with 2 words between)
> 
> -- Adding a boolean modifier 'prox' which acts the same as the 
> current boolean, however can be attached to either AND (the current 
> style of proximity) or  NOT for negative proximity. Example: * "fish 
> and" not/prox chips
>    ("fish and" followed by anything other than chips)
>

What is more interesting (and YES, I have implemented it and it works
very well so its not just "theory") are the following booleans;

- In the same container (field) instance.

A container (field) is not a unit but a field, resp. tag or even path...

To model the desire to have things in the same instance of a named field.

"fries" AND:title "salt" to have them in the same title instance.

- I also have operators to handle anonymous (unnamed by query) fields and
all kinds of other variations..

In talking about indexing XML we have sometimes mark-up such as

<TITLE SCHEMA="Foo bar">Zinging it for fun and profit</TITLE>.

Foo and bar are in the same Schema as a complex attribute of TITLE and
fun and profit are in the same container instance of title.. But we also
want to search for foo and fun in the same abstract TITLE.. We only
want the fun in those TITLES of schema which contains foo?

Its doable.. and also the anonymous case.. (unnamed)..

Its all just logical reason, one after the next...

We can walk down a tree and also say within X steps in a tree.. I did not
implement that on search but could (just have not seen its utility as yet)..
Designing a generic model this would ultimately make sense.

<identity>
  <number>1234</number>
  <person>
     <name>
        <last>zimmermann</last>
...</identity>

Now if we're going to finally start to think in a more abstract/structural
manner I'd suggest we also consider rethinking our unit of retrieval away
from the monolithic "record" or at least consider granularity: that the
objects of retrieval from a query may be fragments that have either been
explicitly defined or derived from the query.

Explict queries: Who said what?

<SPEECH>
  <SPEAKER>LADY MACBETH<SPEAKER>
  <LINE>Yet here's a spot.</LINE>
</SPEECH>

Give me the content of the SPEAKER of the SPEECH where a LINE contains
"here's a spot" ... The record (play) can contain loads of speeches...

Implicit: In designing our S/R systems we have turned to structure and 
granularity away from records...

Searching, for example, for "war" is not the same as searching for
"war and peace" as the title of a book.. War might be war as in Warhammer..
it might be the pop band (Eric Burdon and War).. it might be conflict
(as in the Dictionary of War).. it might be war (German "was").

SRU/W should not just reduce some of the complexity of ISO 23950 but also
finally liberate it from the card catalogue model. CQL needs to become
something suitable to abstract structure search (beyond XQuery and friends)..

--

Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany
http://www.nonmonotonic.net
------- End of Forwarded Message -------