LISTSERV mailing list manager LISTSERV 16.0

Help for ZNG Archives


ZNG Archives

ZNG Archives


[email protected]


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ZNG Home

ZNG Home

ZNG  August 2008

ZNG August 2008

Subject:

Fw: Re: SRU/CQL 2.0: Invitation to participate in OASIS SWS TC Development

From:

"Edward C. Zimmermann" <[log in to unmask]>

Reply-To:

SRU (Search and Retrieve Via URL) Implementors

Date:

Thu, 28 Aug 2008 20:48:49 +0200

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (185 lines)

---------- Forwarded Message -----------
From: "Edward C. Zimmermann" <[log in to unmask]>
To: "SRU (Search and Retrieve Via URL) Implementors" <[log in to unmask]>,
[log in to unmask]
Sent: Thu, 28 Aug 2008 15:28:12 +0200
Subject: Re: SRU/CQL 2.0: Invitation to participate in OASIS SWS TC Development

>  Among the suggested 2.0 features are:
> 
> 1. Allow Non-XML Record Representations
> Many formats do not map easily into XML, for example multimedia, 
> images, and even complex text formats. The suggestion is to allow 
> non-xml serialized data in the response, as well as value by 
> reference. These would be signaled by additional values for the 
> recordPacking parameter. For example recordPacking="base64" or
recordPacking="uri".

Makes sense even for text.. We are, after all, often indexing objects that
don't fit into XML (such as plain text  words, lines, sentences, paragraphs,
pages and their overlaps).

> 
> 2. Proximity
>  deprecate the PROX BOOLEAN operator and instead represent proximity 
> by two methods:

Can't agree more since I think the very concept of "proximity"--- and
I've voiced this quite a few times--- is WRONG here.

We should NEVER, I'd say, speak of proximity other than, at best, a linear
metric of octets in the original data as stored.

http://ibu.de/node/52
(look started at "Query Model")

"In XML we not only have a parent/child ancestry of nodes but we also have
within nodes a linear ordered relationship. One letter follows the next and
one word follows the other in a container. In the above example "Yet" precedes
"here's" and "a" follows after and finishing with "spot". We have order and at
at least a qualitative (intuitive) notion of distance.

In XML we do not, however, have any well-defined order among the siblings
(different LINEs). The XML 1.0 well-formedness definition specifically states
that attributes are unordered and says also nothing about elements. Document
order (how they are marked-up) and the order a conforming XML parser might
decide to report the child elements of SPEECH might not be the same. Most
systems handling XML from a disk and using popular parsers typically deliver
it in the same order but the standard DOES NOT specify that it need be--- and
for good reason."

Even if we restrict the current model of proximity to the ordering within
a single container we have, beyond a metric of bytes as stored, problems when
we start to speak of words. What is a word? Its up to the server, after all,
to decide that and there is often little way of having the user know what
it might be.. For example in my own engine... depending upon configuration
any of number of non-alphanumeric characters may belong to a word depending
upon what is before and after that character. What's 6 words over? Depends..

What's words in XML marked-up text? Are the different ingredients below
each 1 word over from the next?

<ingredients>
    <item>Chocolate</item>
    <item>Flour</item>
    <item>Butter</item>
    <item>eggs</item>
  </ingredients>

Are "eggs" and "flour" within 3 words of each other?

You need to kick the habit of units and think instead of structure..

Instead of proximity we should (well, actually need) to talk about
something being in an element within some structure.

Words, sentences, etc. as you have defined a units is really nothing other
than a structure.. The above ingredients and items are structure...

What you have called poximity with unit as words and distance of less
than 5 is really nothing other than:
- a map of a record into words.
<word>this</word><word>is</word><word>a</word><word>word</word>
together with a linear order which yields a count..

We should NOT assume or demand that all records have word, line, paragraph
etc. structure or even that we can agree upon the application of word, line
etc. My word model and your word model may be different.. Searching for a
word model means to search for the word model as defined by the record as
indexed. Its just like searching for title..

>  -- Adding a relation: 'window'.
> examples:
>  * dc.title window/distance<5/unit=word "fries salt vinegar"

See above why that's still "wrong"

>   (fries, salt, and vinegar all within a span of 5 words)
>  *dc.title window/distance<5/unit=word ((fish and fries) and (salt or
> vinegar))
>  (fish and chips and one of salt or vinegar, in a 5 word window)
>  * dc.title window/distance=2/unit=word/ordered "fries salt "
>  (fries followed by salt with 2 words between)
> 
> -- Adding a boolean modifier 'prox' which acts the same as the 
> current boolean, however can be attached to either AND (the current 
> style of proximity) or  NOT for negative proximity. Example: * "fish 
> and" not/prox chips
>    ("fish and" followed by anything other than chips)
>

What is more interesting (and YES, I have implemented it and it works
very well so its not just "theory") are the following booleans;

- In the same container (field) instance.

A container (field) is not a unit but a field, resp. tag or even path...

To model the desire to have things in the same instance of a named field.

"fries" AND:title "salt" to have them in the same title instance.

- I also have operators to handle anonymous (unnamed by query) fields and
all kinds of other variations..

In talking about indexing XML we have sometimes mark-up such as

<TITLE SCHEMA="Foo bar">Zinging it for fun and profit</TITLE>.

Foo and bar are in the same Schema as a complex attribute of TITLE and
fun and profit are in the same container instance of title.. But we also
want to search for foo and fun in the same abstract TITLE.. We only
want the fun in those TITLES of schema which contains foo?

Its doable.. and also the anonymous case.. (unnamed)..

Its all just logical reason, one after the next...

We can walk down a tree and also say within X steps in a tree.. I did not
implement that on search but could (just have not seen its utility as yet)..
Designing a generic model this would ultimately make sense.

<identity>
  <number>1234</number>
  <person>
     <name>
        <last>zimmermann</last>
...</identity>

Now if we're going to finally start to think in a more abstract/structural
manner I'd suggest we also consider rethinking our unit of retrieval away
from the monolithic "record" or at least consider granularity: that the
objects of retrieval from a query may be fragments that have either been
explicitly defined or derived from the query.

Explict queries: Who said what?

<SPEECH>
  <SPEAKER>LADY MACBETH<SPEAKER>
  <LINE>Yet here's a spot.</LINE>
</SPEECH>

Give me the content of the SPEAKER of the SPEECH where a LINE contains
"here's a spot" ... The record (play) can contain loads of speeches...

Implicit: In designing our S/R systems we have turned to structure and 
granularity away from records...

Searching, for example, for "war" is not the same as searching for
"war and peace" as the title of a book.. War might be war as in Warhammer..
it might be the pop band (Eric Burdon and War).. it might be conflict
(as in the Dictionary of War).. it might be war (German "was").

SRU/W should not just reduce some of the complexity of ISO 23950 but also
finally liberate it from the card catalogue model. CQL needs to become
something suitable to abstract structure search (beyond XQuery and friends)..

--

Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany
http://www.nonmonotonic.net
------- End of Forwarded Message -------

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

July 2017
October 2016
July 2016
August 2014
February 2014
December 2013
November 2013
October 2013
February 2013
January 2013
October 2012
August 2012
April 2012
January 2012
October 2011
May 2011
April 2011
November 2010
October 2010
September 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
October 2009
September 2009
August 2009
July 2009
May 2009
April 2009
March 2009
February 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001

ATOM RSS1 RSS2



LISTSERV.LOC.GOV

CataList Email List Search Powered by the LISTSERV Email List Manager