There are lots of issues here, practically none of them easy.
Simple word proximity isn't easy. I count words as I index them, but
don't guarantee the order that I index the fields, so proximity across
fields is a crap shoot. Mostly, I figure that false hits from adjacency
across fields is low probability and the further apart you ask for the
proximity, the more likely you are to get junk. The only way to
guarantee the accuracy of word proximity is to combine it with field
proximity as well.
I don't count field tags (element names) as words. For some databases,
I do count field tags so I can do proximity between them. (Find records
where speaker="lady macbeth" followed speaker="banquo". But, that
doesn't happen often.
Then there's the issue of unit of retrieval. I've never had a good
answer for that one. When they ask for line="out damned", did they want
the line, the scene, the act or the play? Typically, I make that
decision statically and build a database where the play was decomposed
into a reasonable unit of retrieval with navigation information added to
support moving up and down. If it wasn't clear what unit of retrieval
was desired, I'll make versions of the database with records for each
unit of retrieval.
I played with RDF for a while and one of my biggest problems was trying
to figure out what the unit of retrieval should be. That's one of many
reasons why I gave up on RDF.
(I don't care about XML's modelling problems. Bytes on the wire or on
the disk provide an absolute ordering. If your XML tools don't preserve
that ordering, then shame on them.)
So, Ed, what's your proposal? I agree that there's nothing easy about
proximity. Can you make it easy?
From: SRU (Search and Retrieve Via URL) Implementors
[mailto:[log in to unmask]] On Behalf Of Edward C. Zimmermann
Sent: Thursday, December 07, 2006 5:32 AM
To: [log in to unmask]
Subject: Models of proximity and where I'd like to take ZING.
WARNING: The stuff ahead is not for the faint of technology and only
suitable to reading by those wading deep in search.
Models of proximity and where I'd like to take ZING.
Traditionally one looked at proximity as if the world was unstructured
and one just viewed the text as stored. In the last SRU/W meeting the
issue of extending proximity to structured documents was brought up and
I argued that its not proximity.
Lets look a bit closer by example of XML fragments (from SGML/XML markup
of Shakespeare's works by Jon Bosak):
<LINE>Out, damned spot! out, I say!--One: two: why,</LINE> <LINE>then,
'tis time to do't.--Hell is murky!--Fie, my</LINE> <LINE>lord, fie! a
soldier, and afeard? What need we</LINE> <LINE>fear who knows it, when
none can call our power to</LINE> <LINE>account?--Yet who would have
thought the old man</LINE> <LINE>to have had so much blood in
First off I think we have an idea of "nearness": being in the same leaf.
The words "out" and "spot" are in the same node (with path
Its named SPEECH ancestor is the above speech--- the only speech in all
of Shakespeare's works to have the words "out" and "spot" in the same
LINE. The SPEAKER descendant of that SPEECH is "LADY MACBETH".
That's the view of LINE metric and distance is 0, viz. in the same
line--- this is not the same as near as a view of PLAY would include
anything in that play and that's hardly near. What does a distance other
than 0 mean? I've argued it means nothing.
The word "spot" is said within the works, by contrast, in many other
speeches by speakers in addition to Lady Macbeth: SALISBURY in `The Life
and Death of King John', BRUTUS as well as ANTONY in `The Tragedy of
Julius Caesar', MISTRESS QUICKLY in `The Merry Wives of Windsor',
VALERIA in `The Tragedy of Coriolanus', ROSALIND in `As You Like It' and
MARK ANTONY in `The Tragedy of Antony and Cleopatra'.
Lady Macheth says "spot" in another speech too..
<LINE>Yet here's a spot.</LINE>
These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE"
The word 'spot' and 'out', I'd argue, are near (a quality) but what
about the words 'why' and 'then'?
In XML we not only have a parent/child ancestry of nodes but we also
have within nodes a linear ordered relationship. One letter follows the
next and one word follows the other in a container. In the above
example "Yet" precedes "here's"
and "a" follows after and finishing with "spot". We have order and at at
least a qualitative (intuitive) notion of distance.
In XML we do not, however, have any well-defined order among the
siblings (different LINEs). The XML 1.0 well-formedness definition
specifically states that attributes are unordered and says also nothing
about elements. Document order (how they are marked-up) and the order a
conforming XML parser might decide to report the child elements of
SPEECH might not be the same. Most systems handling XML from a disk and
using popular parsers typically deliver it in the same order but the
standard DOES NOT specify that it need be--- and for good reason. Note:
not all XML is so stored.
One could then specify an inclusion (within the same unnamed or named
field or path), an order and even a character (octet) metric.
I have not attempted to implement a word metric as the concept of word
is more complicated then commonly held. Is [log in to unmask] a single word?
Two words? One word? Maybe even 3? What about a URL? Hyphenation as in
"auto-mobile"? Two words? On the other hand what does such a distance
What's the distance in the above example between 'spot' and 'time'?
Do we count the tag markup (<LINE></LINE>) or only content? Worse still
the order is (unless we specify document order) not well defined.
In SRW/U we have the default metric as words. Does this make sense? Does
the semantics of one platform, one language, one representation lift
from one system to the next? Or is it just arbitrary like alphanumeric
sorting of titles (where each does their own thing)?
Is it rendered level (where the tag elements don't exist to "get in the
Makes things even worse. In a three column newspaper what's the
distance between the first word in the second column and the last word
of the first sentence in the first column? Different devices, different
Words in an unstructured world makes sense as an entire document can be
segmented into its words. The set of all words more or less would be the
set of the whole document viewed as a serial object. In more abstract
documents using mark-up this is not the case. The mark-up does not
belong to the content but describes the content--- at another layer
(search) we even go ahead and start to associate a semantics (title,
Trying to extend this to an arbitrary field (tag, attribute) is not a
What is a distance of 2 with respect to LINE mean? How about 100?
LINES maybe in the same speech but in some other speech? I think this
path would take one deeper and deeper in the wrong direction. Its also,
I'd argue, not even needed to be able to express the kind of ordered
queries (appealing to document order) that one might want to express.
In my system I've kept my metric of proximity to the distance defined as
the file offsets (octets) as the record is stored on the file system.
The render of \xdcberzeugung and Überzeugung are equivalent but
their lengths are different. The characters 'L' 'I' 'N' 'E' are no
different from 's' 'p' 'o'
't'. Different mark-ups for the same content have different distances.
That's document order. What we, I think, really want! (and maybe the
only proximity that makes sense in a generic model)
The advantage is that I have now a metric of the document order as byte
offsets and may combine it with order in the tree (path as <LINE>
follows <SPEECH> follows <ACT> etc.) to also specify in queries a search
that does respect EXPLICITLY and with full intent the "document order".
In our search models we have also the idea of record. But what's a
Should not our model of "record" too be defined by our queries? No
XPath stuff applied to a document but as per the query--- recall also
that we can and might have information that is more abstract than can be
represented in XML.
If I want to know who spoke the lines 'out' and 'spot' in their speeches
I want for each hit the SPEAKER sub-element of the SPEECH associated
with my hit, right?
A single play might then have multiple hits. Here a result record is a
document fragment (here an XML fragment) and not the whole document. As
one can see the views as to what a hit is .. I do a lot of RSS/CAP
indexing. A single RSS document may contain multiple items. When I
search from the view of looking for stories each item is probably what
we'd consider a hit and not the channel and hardly the whole feed. At
the same time perhaps I may indeed want to search for feeds. Should this
not be expressible in our language? The model of Ancestor and
Descendant does solve this. Layer in the byte metric of the storage
level and I think we have the whole Magilla.
The advantage of this view, I think, is that it explicitly rips apart
the difference between structural and contextual view of information in
the documents and the representation as marked-up and stored---
equivalent documents will deliver the same tree but might well have
quite different markup on the document as storage object level. In
fact.. it might not even be stored as as a serial document.
Please note. The above is not just theory of how one can search but I've
fully implemented it in my engine-- I think I am probably the only one
among us that has even bothered to implement the distance=0 case for
arbitrary elements. It does work and is general enough to let me index
and search diverse collections using very different models, mark-up
My suggestion is that we overhaul the model in 2.0. and break a bit with
some of our past thinking that maybe made sense in Z39.50 in the 1980s
or 1990s but...
Edward C. Zimmermann, Basis Systeme netzwerk, Munich Office Leo (R&D):
Leopoldstrasse 53-55, D-80802 Munich,
Federal Republic of Germany
Telephone: Voice:= +49 (89) 385-47074 Corp.Fax:= +49 (89) 692-8150
Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55 Alt.Mobile:= +49 (179)