Quoting Mark Hinnebusch <[log in to unmask]>: > Edward, > > The whole issue of proximity has always been confused with issues of > representation and structure. But, we tended to try to finesse the issue in Agree > a couple of ways: > (1) how the query is interpreted is a "local issue" and you get what > > the server says you meant. Which can be fine when the chances that the two are not that far off from one another. It needs to be consistent and the more arbitrary it becomes the less satisfying the whole mechanism becomes. > > If I understand your email, you are trying to grapple with what proximity > means when there is no usable implicit ordering nor is there an explicit > ordering. I would argue that in this case, proximity is meaningless. If It goes back to last year--- and again in the Hague--- when I argued that proximity of an element is not really proximity since any measure other than distance=0 does not make sense. > you want to use the byte position within the XML, then that is an implicit Not the byte position within XML but the byte position as the information was stored as a serial object in storage. It could be binary, PDF, XML, might have been some GRS fastload.. who knows.. but there is an order. There may even be a extrasystem semantic for the order. In the XML markup of Shakespeare, for example, its the order of one act following the next, one speech following the next and one line following the next. This association is not demanded but can be specified by the searcher as part of the query expression. > ordering and could be used, but seems to violate the spirit of the XML > standard. Its a layer: additional information beyond just the XML. It may, in fact, be undefined. It is not placing anything upon the XML standard but enabling a set of query expressions to search collections that may have been marked-up in XML (or SGML or GRS or MARC or ..). Example: The lines where "love" and "king" are in the same line. Among those lines where within 100 bytes (as how its stored on the disk) the word "homage" is? In my own "internal" language I use the binary operator AND:path to mean in the same path or tag. NEAR:nn to mean within nn bytes of storage so as an RPN query: love king AND:line homage NEAR:100 (NEAR without the :100 would mean in the same unnamed node which would just happen to be LINE) (requesting the SPEECH ancestor of the line hit elements, see below) `The Two Gentlemen of Verona' ** 'speech' Fragment: <SPEAKER>Third Outlaw</SPEAKER> <LINE>What say'st thou? wilt thou be of our consort?</LINE> <LINE>Say ay, and be the captain of us all:</LINE> <LINE>We'll do thee homage and be ruled by thee,</LINE> <LINE>Love thee as our commander and our king.</LINE> NOTE: Since within a container (field) we have an order we can talk, to keep to my nomenclature, of BEFORE:path and AFTER:path The line fragment of damned spot AND:line ** 'LINE' Fragment: Out, damned spot! out, I say!--One: two: why, or damned spot BEFORE:line but damned spot AFTER:line finding none. Quoting "LeVan,Ralph" <[log in to unmask]>: > > Then there's the issue of unit of retrieval. I've never had a good > answer for that one. When they ask for line="out damned", did they want > the line, the scene, the act or the play? Typically, I make that Right.. Or a specific element (path) of that unit. My model I've thought of as Ancestor/Descendant of hits. If I look for "out" and "spot" in the same line. I may want the SPEECH. We have for LINE the path "PLAY\ACT\SCENE\SPEECH\LINE". I let people specify either PLAY\ACT\SCENE\SPEECH or SPEECH. (or also partial paths) I get: `The Tragedy of Macbeth' ** 'speech' Fragment: <SPEAKER>LADY MACBETH</SPEAKER> <LINE>Out, damned spot! out, I say!--One: two: why,</LINE> <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE> <LINE>lord, fie! a soldier, and afeard? What need we</LINE> <LINE>fear who knows it, when none can call our power to</LINE> <LINE>account?--Yet who would have thought the old man</LINE> <LINE>to have had so much blood in him.</LINE> We could now have specified the SPEAKER: SPEECH/SPEAKER (SPEECH as Ancestor of the hit and SPEAKER as a descendant of the SPEECH). `The Tragedy of Macbeth' ** 'speech/speaker' Fragment: LADY MACBETH The path can make a difference.. play/play\\title is `The Tragedy of Macbeth' ** 'play/play\title' Fragment: The Tragedy of Macbeth But looking at title we see there are multiple titles.. including of act etc. `The Tragedy of Macbeth' ** 'play/title' Fragment: The Tragedy of Macbeth ** 'play/title' Fragment: Dramatis Personae ** 'play/title' Fragment: ACT I ** 'play/title' Fragment: SCENE I. A desert place. ** 'play/title' Fragment: SCENE II. A camp near Forres. ** 'play/title' Fragment: SCENE III. A heath near Forres. ** 'play/title' Fragment: SCENE IV. Forres. The palace. ** 'play/title' Fragment: SCENE V. Inverness. Macbeth's castle. ** 'play/title' Fragment: SCENE VI. Before Macbeth's castle. ** 'play/title' Fragment: SCENE VII. Macbeth's castle. ** 'play/title' Fragment: ACT II ** 'play/title' Fragment: SCENE I. Court of Macbeth's castle. ** 'play/title' Fragment: etc etc etc Its pretty simple to express and quite powerful PLAY\ACT\SCENE\SPEECH/SPEAKER is the speaker of a speech.. PLAY\ACT\SCENE/SPEECH/SPEAKER is the speakers of all the speeches that is in the scene.. etc. I think you get the idea. > decision statically and build a database where the play was decomposed > into a reasonable unit of retrieval with navigation information added to > support moving up and down. If it wasn't clear what unit of retrieval > was desired, I'll make versions of the database with records for each > unit of retrieval. With this model of addressing the elements of retrieval we let the searcher define their own unit of retrieval! I don't have to re-index my collection of Shakespeare's works to ask and get answers to questions like: Who said this and that? In what speech, what act.. etc. I can demonstrate the same on the 806791 Reuter's test collection or whatever.. I can even apply this to information that can't be marked-up in XML but is represented in abstract trees with overlap. The key is the concept of "hit" and knowing where the coordinates of the hit are within the document/record tree. RDF (and RSS) are real world problems--- and I'm already applying this to many 100s of feeds (continuously indexed) in http://www.ibu.de -- -- Edward C. Zimmermann, Basis Systeme netzwerk, Munich Office Leo (R&D): Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany Telephone: Voice:= +49 (89) 385-47074 Corp.Fax:= +49 (89) 692-8150 Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55 Alt.Mobile:= +49 (179) 205-0539 http://www.nonmonotonic.net