LISTSERV mailing list manager LISTSERV 16.0

Help for ZNG Archives


ZNG Archives

ZNG Archives


[email protected]


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ZNG Home

ZNG Home

ZNG  December 2006

ZNG December 2006

Subject:

Re: Models of proximity and where I'd like to take ZING.

From:

Mark Hinnebusch <[log in to unmask]>

Reply-To:

SRU (Search and Retrieve Via URL) Implementors

Date:

Thu, 7 Dec 2006 10:48:44 -0500

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (257 lines)

Edward,

  The whole issue of proximity has always been confused with issues of 
representation and structure.  But, we tended to try to finesse the issue in 
a couple of ways:
        (1) how the query is interpreted is a "local issue" and you get what 
the server says you meant.
        (2) explicit proximity rules only make sense within the construct of 
a defined record structure.  In the old Z world, this meant
                    (a) unstructured text, where we all thought we 
understood what proximity meant and used an implicit ordering and 
representation and structure were the same, and
                    (b) GRS, where we defined proximity in terms of its 
defined structure.  However, GRS did have specific ordering so proximity 
made sense.

If I understand your email, you are trying to grapple with what proximity 
means when there is no usable implicit ordering nor is there an explicit 
ordering.  I would argue that in this case, proximity is meaningless.  If 
you want to use the byte position within the XML, then that is an implicit 
ordering and could be used, but seems to violate the spirit of the XML 
standard.

Of course, we could say that the XML standard is deficient in not defining 
explicit ordering.  A standard-defined optional order attribute would have 
been nice.  In my opinion, the support of explicit ordering, along with the 
ability to encapsulate native binary data, were advantages of GRS over XML. 
(Not that it matters any more).

-mark


----- Original Message ----- 
From: "Edward C. Zimmermann" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Thursday, December 07, 2006 5:32 AM
Subject: Models of proximity and where I'd like to take ZING.


> WARNING: The stuff ahead is not for the faint of technology and only
> suitable to reading by those wading deep in search.
>
> Models of proximity and where I'd like to take ZING.
>
> Traditionally one looked at proximity as if the world was unstructured
> and one just viewed the text as stored. In the last SRU/W meeting the
> issue of extending proximity to structured documents was brought up and
> I argued that its not proximity.
>
> Lets look a bit closer by example of XML fragments (from SGML/XML markup
> of Shakespeare's works by Jon Bosak):
>
> <SPEECH>
> <SPEAKER>LADY MACBETH</SPEAKER>
> <LINE>Out, damned spot! out, I say!--One: two: why,</LINE>
> <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE>
> <LINE>lord, fie! a soldier, and afeard? What need we</LINE>
> <LINE>fear who knows it, when none can call our power to</LINE>
> <LINE>account?--Yet who would have thought the old man</LINE>
> <LINE>to have had so much blood in him.</LINE>
> </SPEECH>
>
> First off I think we have an idea of "nearness": being in the same leaf.
>
> The words "out" and "spot" are in the same node (with path 
> ...\SPEECH\LINE ).
> Its named SPEECH ancestor is the above speech--- the only speech in all of
> Shakespeare's works to have the words "out" and "spot" in the same LINE. 
> The
> SPEAKER descendant of that SPEECH is "LADY MACBETH".
>
> That's the view of LINE metric and distance is 0, viz. in the same line---  
> this
> is not the same as near as a view of PLAY would include anything in that 
> play
> and that's hardly near. What does a distance other than 0 mean? I've 
> argued it
> means nothing.
>
> The word "spot" is said within the works, by contrast, in many other 
> speeches
> by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death 
> of
> King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar',
> MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy 
> of
> Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy 
> of
> Antony and Cleopatra'.
>
> Lady Macheth says "spot" in another speech too..
> <SPEECH>
>  <SPEAKER>LADY MACBETH<&ltSPEAKER>
>  <LINE>Yet here's a spot.</LINE>
> </SPEECH>
>
> These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE"
>
> The word 'spot' and 'out', I'd argue, are near (a quality) but what about 
> the
> words 'why' and 'then'?
>
> In XML we not only have a parent/child ancestry of nodes but we also have 
> within
> nodes a linear ordered relationship. One letter follows the next and one 
> word
> follows the other in a container.  In the above example "Yet" precedes 
> "here's"
> and "a" follows after and finishing with "spot". We have order and at at 
> least
> a qualitative (intuitive) notion of distance.
>
> In XML we do not, however, have any well-defined order among the siblings
> (different LINEs). The XML 1.0 well-formedness definition specifically 
> states
> that attributes are unordered and says also nothing about elements. 
> Document
> order (how they are marked-up) and the order a conforming XML parser might
> decide to report the child elements of SPEECH might not be the same. Most
> systems handling XML from a disk and using popular parsers typically 
> deliver
> it in the same order but the standard DOES NOT specify that it need be---  
> and
> for good reason. Note: not all XML is so stored.
>
> One could then specify an inclusion (within the same unnamed or named 
> field or
> path), an order and even a character (octet) metric.
>
> I have not attempted to implement a word metric as the concept of word is 
> more
> complicated then commonly held. Is [log in to unmask] a single word? Two words? 
> One
> word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two
> words? On the other hand what does such a distance mean?
>
> What's the distance in the above example between 'spot' and 'time'?
> Do we count the tag markup (<LINE></LINE>) or only content?  Worse still 
> the
> order is (unless we specify document order) not well defined.
>
> In SRW/U we have the default metric as words. Does this make sense? Does 
> the
> semantics of one platform, one language, one representation lift from one
> system to the next? Or is it just arbitrary like alphanumeric sorting of 
> titles
> (where each does their own thing)?
>
> Is it rendered level (where the tag elements don't exist to "get in the 
> way")?
> Makes things even worse.  In a three column newspaper what's the distance
> between the first word in the second column and the last word of the first
> sentence in the first column?  Different devices, different distances?
>
> Words in an unstructured world makes sense as an entire document can be
> segmented into its words. The set of all words more or less would be the 
> set
> of the whole document viewed as a serial object. In more abstract 
> documents
> using mark-up this is not the case. The mark-up does not belong to the 
> content
> but describes the content--- at another layer (search) we even go ahead 
> and
> start to associate a semantics (title, author etc.).
>
> Trying to extend this to an arbitrary field (tag, attribute) is not a good 
> idea.
> What is a distance of 2 with respect to LINE mean?  How about 100?  1000?
> LINES maybe in the same speech but in some other speech? I think this path
> would take one deeper and deeper in the wrong direction. Its also, I'd 
> argue,
> not even needed to be able to express the kind of ordered queries 
> (appealing
> to document order) that one might want to express.
>
> In my system I've kept my metric of proximity to the distance defined as 
> the
> file offsets (octets) as the record is stored on the file system. The 
> render
> of \xdcberzeugung and &Uuml;berzeugung are equivalent but their lengths 
> are
> different.  The characters 'L' 'I' 'N' 'E' are no different from 's' 'p' 
> 'o'
> 't'. Different mark-ups for the same content have different distances. 
> That's
> document order. What we, I think, really want! (and maybe the only 
> proximity
> that makes sense in a generic model)
>
> The advantage is that I have now a metric of the document order as byte 
> offsets
> and may combine it with order in the tree (path as <LINE> follows <SPEECH>
> follows <ACT> etc.) to also specify in queries a search that does respect
> EXPLICITLY and with full intent the "document order".
>
> In our search models we have also the idea of record. But what's a record?
> Should not our model of "record" too be defined by our queries?  No XPath
> stuff applied to a document but as per the query--- recall also that we 
> can
> and might have information that is more abstract than can be represented 
> in
> XML.
>
> If I want to know who spoke the lines 'out' and 'spot' in their speeches I
> want for each hit the SPEAKER sub-element of the SPEECH associated with my
> hit, right?
>
> A single play might then have multiple hits. Here a result record is a 
> document
> fragment (here an XML fragment) and not the whole document. As one can see
> the views as to what a hit is .. I do a lot of RSS/CAP indexing. A single 
> RSS
> document may contain multiple items. When I search from the view of 
> looking for
> stories each item is probably what we'd consider a hit and not the channel 
> and
> hardly the whole feed. At the same time perhaps I may indeed want to 
> search for
> feeds. Should this not be expressible in our language?  The model of 
> Ancestor
> and Descendant does solve this. Layer in the byte metric of the storage 
> level
> and I think we have the whole Magilla.
>
> The advantage of this view, I think, is that it explicitly rips apart the
> difference between structural and contextual view of information in the
> documents and the representation as marked-up and stored--- equivalent
> documents will deliver the same tree but might well have quite different
> markup on the document as storage object level. In fact.. it might not 
> even be
> stored as as a serial document.
>
> Please note. The above is not just theory of how one can search but I've
> fully implemented it in my engine-- I think I am probably the only one 
> among
> us that has even bothered to implement the distance=0 case for arbitrary
> elements. It does work and is general enough to let me index and  search
> diverse collections using very different models, mark-up etc.
>
> My suggestion is that we overhaul the model in 2.0. and break a bit with
> some of our past thinking that maybe made sense in Z39.50 in the 1980s or
> 1990s but...
>
> Comments?
>
> -- 
> -- 
> Edward C. Zimmermann, Basis Systeme netzwerk, Munich
> Office Leo (R&D):
>   Leopoldstrasse 53-55, D-80802 Munich,
>   Federal Republic of Germany
> Telephone:   Voice:=  +49 (89) 385-47074  Corp.Fax:= +49 (89)  692-8150
> Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55  Alt.Mobile:= +49 (179) 
> 205-0539
> http://www.nonmonotonic.net
> 

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

July 2017
October 2016
July 2016
August 2014
February 2014
December 2013
November 2013
October 2013
February 2013
January 2013
October 2012
August 2012
April 2012
January 2012
October 2011
May 2011
April 2011
November 2010
October 2010
September 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
October 2009
September 2009
August 2009
July 2009
May 2009
April 2009
March 2009
February 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001

ATOM RSS1 RSS2



LISTSERV.LOC.GOV

CataList Email List Search Powered by the LISTSERV Email List Manager