----- Original Message -----
From: "Edward C. Zimmermann" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Thursday, December 07, 2006 5:21 PM
Subject: Re: Models of proximity and where I'd like to take ZING.
> Quoting Mark Hinnebusch <[log in to unmask]>:
>
>> Edward,
>>
>> The whole issue of proximity has always been confused with issues of
>> representation and structure. But, we tended to try to finesse the issue
>> in
>
> Agree
>
>> a couple of ways:
>> (1) how the query is interpreted is a "local issue" and you get
>> what
>>
>> the server says you meant.
>
> Which can be fine when the chances that the two are not that far off from
> one another. It needs to be consistent and the more arbitrary it becomes
> the less satisfying the whole mechanism becomes.
>
>>
>> If I understand your email, you are trying to grapple with what proximity
>> means when there is no usable implicit ordering nor is there an explicit
>> ordering. I would argue that in this case, proximity is meaningless. If
>
> It goes back to last year--- and again in the Hague--- when I argued that
>
> proximity of an element is not really proximity since
> any measure other than distance=0 does not make sense.
Absent explicit ordering of the elements within the structure, I would
agree.
>
>> you want to use the byte position within the XML, then that is an
>> implicit
>
> Not the byte position within XML but the byte position as the information
> was stored as a serial object in storage. It could be binary, PDF, XML,
> might
> have been some GRS fastload.. who knows.. but there is an order. There may
> even be a extrasystem semantic for the order. In the XML markup of
> Shakespeare,
> for example, its the order of one act following the next, one speech
> following
> the next and one line following the next. This association is not demanded
> but can be specified by the searcher as part of the query expression.
>
>> ordering and could be used, but seems to violate the spirit of the XML
>> standard.
>
> Its a layer: additional information beyond just the XML. It may, in fact,
> be undefined. It is not placing anything upon the XML standard but
> enabling
> a set of query expressions to search collections that may have been
> marked-up
> in XML (or SGML or GRS or MARC or ..).
So you are making the ordering explicit, which means adding semantic value
to the original XML representation that does not have any order. Even using
the order in which the bytes are stored is adding semantic value. If you do
this, then proximity makes sense again. But you have transformed the
problem space. >
> Example:
>
> The lines where "love" and "king" are in the same line.
> Among those lines where within 100 bytes (as how its stored on the disk)
> the word "homage" is?
If the line is stored as a single element, we have no problem, right?
Except I would not make an explicit demand that the distance be as defined
on the disk; that is an implementation decision. What if I use multiple
bytes for some strange reason, or store a line in some weird tree structure
for obscure reasons. The distance should be interpreted as distance in the
original document and the implementation would need to be able to calculate
that from the actual stroage mechansim.
>
> In my own "internal" language I use the binary operator AND:path to mean
> in the same path or tag. NEAR:nn to mean within nn bytes of storage so as
> an RPN query:
>
> love king AND:line homage NEAR:100
>
> (NEAR without the :100 would mean in the same unnamed node which would
> just happen to be LINE)
>
> (requesting the SPEECH ancestor of the line hit elements, see below)
>
> `The Two Gentlemen of Verona'
> ** 'speech' Fragment:
> <SPEAKER>Third Outlaw</SPEAKER>
> <LINE>What say'st thou? wilt thou be of our consort?</LINE>
> <LINE>Say ay, and be the captain of us all:</LINE>
> <LINE>We'll do thee homage and be ruled by thee,</LINE>
> <LINE>Love thee as our commander and our king.</LINE>
>
> NOTE: Since within a container (field) we have an order we can talk, to
> keep to my nomenclature, of BEFORE:path and AFTER:path
If you do, in fact, have the order. Isn't that the crux of the matter? In
the example, we would clearly be able to intuit an order. But what if the
data were:
<OBSERVATION>
<LOCALE> location where the observations were taken </LOCALE>
<DATUM> n </DATUM>
<DATUM> n </DATUM>
<DATUM> n </DATUM>
<DATUM> n </DATUM>
<DATUM> n </DATUM>
</OBSERVATION>
then, without knowing, ex cathedra, the meaning of the data and the implicit
order, you can only depend on the physical ordering, yet the XML standard
tells you that you can't. And I don't agree with Ralph that you can fault
the XML tools. They meet the requirements of the standard and that is all
you should expect of them. Otherwise, you can complain that they don't give
a good back-rub. The problem is in the standard or in the data represented
failing to provide explicit ordering as data. So, I think in this case it
goes back to the "server knows all" solution. If you have a server that
somehow "knows" the ordering, then it can offer proximity across the
elements. If it doesn't, then a well-behaved server should refuse to
imagine it out of thin air, or at least give a good back-rub in the process.
>
> The line fragment of damned spot AND:line
>
> ** 'LINE' Fragment:
> Out, damned spot! out, I say!--One: two: why,
>
> or damned spot BEFORE:line but damned spot AFTER:line finding none.
>
>
> Quoting "LeVan,Ralph" <[log in to unmask]>:
>
>>
>> Then there's the issue of unit of retrieval. I've never had a good
>> answer for that one. When they ask for line="out damned", did they want
>> the line, the scene, the act or the play? Typically, I make that
>
> Right.. Or a specific element (path) of that unit.
>
> My model I've thought of as Ancestor/Descendant of hits.
>
> If I look for "out" and "spot" in the same line. I may want the SPEECH.
>
> We have for LINE the path "PLAY\ACT\SCENE\SPEECH\LINE".
>
> I let people specify either PLAY\ACT\SCENE\SPEECH or SPEECH.
> (or also partial paths)
>
> I get:
>
> `The Tragedy of Macbeth'
> ** 'speech' Fragment:
> <SPEAKER>LADY MACBETH</SPEAKER>
> <LINE>Out, damned spot! out, I say!--One: two: why,</LINE>
> <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE>
> <LINE>lord, fie! a soldier, and afeard? What need we</LINE>
> <LINE>fear who knows it, when none can call our power to</LINE>
> <LINE>account?--Yet who would have thought the old man</LINE>
> <LINE>to have had so much blood in him.</LINE>
>
> We could now have specified the SPEAKER:
> SPEECH/SPEAKER (SPEECH as Ancestor of the hit and SPEAKER as a
> descendant of the SPEECH).
>
> `The Tragedy of Macbeth'
> ** 'speech/speaker' Fragment:
> LADY MACBETH
>
> The path can make a difference..
>
> play/play\\title
>
> is
> `The Tragedy of Macbeth'
> ** 'play/play\title' Fragment:
> The Tragedy of Macbeth
>
> But looking at title we see there are multiple titles.. including of act
> etc.
> `The Tragedy of Macbeth'
> ** 'play/title' Fragment:
> The Tragedy of Macbeth
> ** 'play/title' Fragment:
> Dramatis Personae
> ** 'play/title' Fragment:
> ACT I
> ** 'play/title' Fragment:
> SCENE I. A desert place.
> ** 'play/title' Fragment:
> SCENE II. A camp near Forres.
> ** 'play/title' Fragment:
> SCENE III. A heath near Forres.
> ** 'play/title' Fragment:
> SCENE IV. Forres. The palace.
> ** 'play/title' Fragment:
> SCENE V. Inverness. Macbeth's castle.
> ** 'play/title' Fragment:
> SCENE VI. Before Macbeth's castle.
> ** 'play/title' Fragment:
> SCENE VII. Macbeth's castle.
> ** 'play/title' Fragment:
> ACT II
> ** 'play/title' Fragment:
> SCENE I. Court of Macbeth's castle.
> ** 'play/title' Fragment:
>
> etc etc etc
>
> Its pretty simple to express and quite powerful
>
> PLAY\ACT\SCENE\SPEECH/SPEAKER is the speaker of a speech..
> PLAY\ACT\SCENE/SPEECH/SPEAKER is the speakers of all the speeches that is
> in the scene.. etc.
>
> I think you get the idea.
>
>
>> decision statically and build a database where the play was decomposed
>> into a reasonable unit of retrieval with navigation information added to
>> support moving up and down. If it wasn't clear what unit of retrieval
>> was desired, I'll make versions of the database with records for each
>> unit of retrieval.
>
> With this model of addressing the elements of retrieval we let the
> searcher
> define their own unit of retrieval!
>
> I don't have to re-index my collection of Shakespeare's works to ask and
> get answers to questions like: Who said this and that? In what speech,
> what
> act.. etc.
>
> I can demonstrate the same on the 806791 Reuter's test collection or
> whatever..
>
> I can even apply this to information that can't be marked-up in XML but
> is represented in abstract trees with overlap.
>
> The key is the concept of "hit" and knowing where the coordinates of the
> hit are within the document/record tree.
>
> RDF (and RSS) are real world problems--- and I'm already applying this to
> many 100s of feeds (continuously indexed) in http://www.ibu.de
>
>
>
> --
> --
> Edward C. Zimmermann, Basis Systeme netzwerk, Munich
> Office Leo (R&D):
> Leopoldstrasse 53-55, D-80802 Munich,
> Federal Republic of Germany
> Telephone: Voice:= +49 (89) 385-47074 Corp.Fax:= +49 (89) 692-8150
> Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55 Alt.Mobile:= +49 (179)
> 205-0539
> http://www.nonmonotonic.net
>
|