WARNING: The stuff ahead is not for the faint of technology and only
suitable to reading by those wading deep in search.
Models of proximity and where I'd like to take ZING.
Traditionally one looked at proximity as if the world was unstructured
and one just viewed the text as stored. In the last SRU/W meeting the
issue of extending proximity to structured documents was brought up and
I argued that its not proximity.
Lets look a bit closer by example of XML fragments (from SGML/XML markup
of Shakespeare's works by Jon Bosak):
<SPEECH>
<SPEAKER>LADY MACBETH</SPEAKER>
<LINE>Out, damned spot! out, I say!--One: two: why,</LINE>
<LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE>
<LINE>lord, fie! a soldier, and afeard? What need we</LINE>
<LINE>fear who knows it, when none can call our power to</LINE>
<LINE>account?--Yet who would have thought the old man</LINE>
<LINE>to have had so much blood in him.</LINE>
</SPEECH>
First off I think we have an idea of "nearness": being in the same leaf.
The words "out" and "spot" are in the same node (with path ...\SPEECH\LINE ).
Its named SPEECH ancestor is the above speech--- the only speech in all of
Shakespeare's works to have the words "out" and "spot" in the same LINE. The
SPEAKER descendant of that SPEECH is "LADY MACBETH".
That's the view of LINE metric and distance is 0, viz. in the same line--- this
is not the same as near as a view of PLAY would include anything in that play
and that's hardly near. What does a distance other than 0 mean? I've argued it
means nothing.
The word "spot" is said within the works, by contrast, in many other speeches
by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death of
King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar',
MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy of
Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy of
Antony and Cleopatra'.
Lady Macheth says "spot" in another speech too..
<SPEECH>
<SPEAKER>LADY MACBETH<<SPEAKER>
<LINE>Yet here's a spot.</LINE>
</SPEECH>
These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE"
The word 'spot' and 'out', I'd argue, are near (a quality) but what about the
words 'why' and 'then'?
In XML we not only have a parent/child ancestry of nodes but we also have within
nodes a linear ordered relationship. One letter follows the next and one word
follows the other in a container. In the above example "Yet" precedes "here's"
and "a" follows after and finishing with "spot". We have order and at at least
a qualitative (intuitive) notion of distance.
In XML we do not, however, have any well-defined order among the siblings
(different LINEs). The XML 1.0 well-formedness definition specifically states
that attributes are unordered and says also nothing about elements. Document
order (how they are marked-up) and the order a conforming XML parser might
decide to report the child elements of SPEECH might not be the same. Most
systems handling XML from a disk and using popular parsers typically deliver
it in the same order but the standard DOES NOT specify that it need be--- and
for good reason. Note: not all XML is so stored.
One could then specify an inclusion (within the same unnamed or named field or
path), an order and even a character (octet) metric.
I have not attempted to implement a word metric as the concept of word is more
complicated then commonly held. Is [log in to unmask] a single word? Two words? One
word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two
words? On the other hand what does such a distance mean?
What's the distance in the above example between 'spot' and 'time'?
Do we count the tag markup (<LINE></LINE>) or only content? Worse still the
order is (unless we specify document order) not well defined.
In SRW/U we have the default metric as words. Does this make sense? Does the
semantics of one platform, one language, one representation lift from one
system to the next? Or is it just arbitrary like alphanumeric sorting of titles
(where each does their own thing)?
Is it rendered level (where the tag elements don't exist to "get in the way")?
Makes things even worse. In a three column newspaper what's the distance
between the first word in the second column and the last word of the first
sentence in the first column? Different devices, different distances?
Words in an unstructured world makes sense as an entire document can be
segmented into its words. The set of all words more or less would be the set
of the whole document viewed as a serial object. In more abstract documents
using mark-up this is not the case. The mark-up does not belong to the content
but describes the content--- at another layer (search) we even go ahead and
start to associate a semantics (title, author etc.).
Trying to extend this to an arbitrary field (tag, attribute) is not a good idea.
What is a distance of 2 with respect to LINE mean? How about 100? 1000?
LINES maybe in the same speech but in some other speech? I think this path
would take one deeper and deeper in the wrong direction. Its also, I'd argue,
not even needed to be able to express the kind of ordered queries (appealing
to document order) that one might want to express.
In my system I've kept my metric of proximity to the distance defined as the
file offsets (octets) as the record is stored on the file system. The render
of \xdcberzeugung and Überzeugung are equivalent but their lengths are
different. The characters 'L' 'I' 'N' 'E' are no different from 's' 'p' 'o'
't'. Different mark-ups for the same content have different distances. That's
document order. What we, I think, really want! (and maybe the only proximity
that makes sense in a generic model)
The advantage is that I have now a metric of the document order as byte offsets
and may combine it with order in the tree (path as <LINE> follows <SPEECH>
follows <ACT> etc.) to also specify in queries a search that does respect
EXPLICITLY and with full intent the "document order".
In our search models we have also the idea of record. But what's a record?
Should not our model of "record" too be defined by our queries? No XPath
stuff applied to a document but as per the query--- recall also that we can
and might have information that is more abstract than can be represented in
XML.
If I want to know who spoke the lines 'out' and 'spot' in their speeches I
want for each hit the SPEAKER sub-element of the SPEECH associated with my
hit, right?
A single play might then have multiple hits. Here a result record is a document
fragment (here an XML fragment) and not the whole document. As one can see
the views as to what a hit is .. I do a lot of RSS/CAP indexing. A single RSS
document may contain multiple items. When I search from the view of looking for
stories each item is probably what we'd consider a hit and not the channel and
hardly the whole feed. At the same time perhaps I may indeed want to search for
feeds. Should this not be expressible in our language? The model of Ancestor
and Descendant does solve this. Layer in the byte metric of the storage level
and I think we have the whole Magilla.
The advantage of this view, I think, is that it explicitly rips apart the
difference between structural and contextual view of information in the
documents and the representation as marked-up and stored--- equivalent
documents will deliver the same tree but might well have quite different
markup on the document as storage object level. In fact.. it might not even be
stored as as a serial document.
Please note. The above is not just theory of how one can search but I've
fully implemented it in my engine-- I think I am probably the only one among
us that has even bothered to implement the distance=0 case for arbitrary
elements. It does work and is general enough to let me index and search
diverse collections using very different models, mark-up etc.
My suggestion is that we overhaul the model in 2.0. and break a bit with
some of our past thinking that maybe made sense in Z39.50 in the 1980s or
1990s but...
Comments?
--
--
Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
Leopoldstrasse 53-55, D-80802 Munich,
Federal Republic of Germany
Telephone: Voice:= +49 (89) 385-47074 Corp.Fax:= +49 (89) 692-8150
Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55 Alt.Mobile:= +49 (179) 205-0539
http://www.nonmonotonic.net
|