---------- Forwarded Message ----------- From: "Edward C. Zimmermann" <[log in to unmask]> To: "SRU (Search and Retrieve Via URL) Implementors" <[log in to unmask]>, [log in to unmask] Sent: Thu, 28 Aug 2008 15:28:12 +0200 Subject: Re: SRU/CQL 2.0: Invitation to participate in OASIS SWS TC Development > Among the suggested 2.0 features are: > > 1. Allow Non-XML Record Representations > Many formats do not map easily into XML, for example multimedia, > images, and even complex text formats. The suggestion is to allow > non-xml serialized data in the response, as well as value by > reference. These would be signaled by additional values for the > recordPacking parameter. For example recordPacking="base64" or recordPacking="uri". Makes sense even for text.. We are, after all, often indexing objects that don't fit into XML (such as plain text words, lines, sentences, paragraphs, pages and their overlaps). > > 2. Proximity > deprecate the PROX BOOLEAN operator and instead represent proximity > by two methods: Can't agree more since I think the very concept of "proximity"--- and I've voiced this quite a few times--- is WRONG here. We should NEVER, I'd say, speak of proximity other than, at best, a linear metric of octets in the original data as stored. http://ibu.de/node/52 (look started at "Query Model") "In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other in a container. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. In XML we do not, however, have any well-defined order among the siblings (different LINEs). The XML 1.0 well-formedness definition specifically states that attributes are unordered and says also nothing about elements. Document order (how they are marked-up) and the order a conforming XML parser might decide to report the child elements of SPEECH might not be the same. Most systems handling XML from a disk and using popular parsers typically deliver it in the same order but the standard DOES NOT specify that it need be--- and for good reason." Even if we restrict the current model of proximity to the ordering within a single container we have, beyond a metric of bytes as stored, problems when we start to speak of words. What is a word? Its up to the server, after all, to decide that and there is often little way of having the user know what it might be.. For example in my own engine... depending upon configuration any of number of non-alphanumeric characters may belong to a word depending upon what is before and after that character. What's 6 words over? Depends.. What's words in XML marked-up text? Are the different ingredients below each 1 word over from the next? <ingredients> <item>Chocolate</item> <item>Flour</item> <item>Butter</item> <item>eggs</item> </ingredients> Are "eggs" and "flour" within 3 words of each other? You need to kick the habit of units and think instead of structure.. Instead of proximity we should (well, actually need) to talk about something being in an element within some structure. Words, sentences, etc. as you have defined a units is really nothing other than a structure.. The above ingredients and items are structure... What you have called poximity with unit as words and distance of less than 5 is really nothing other than: - a map of a record into words. <word>this</word><word>is</word><word>a</word><word>word</word> together with a linear order which yields a count.. We should NOT assume or demand that all records have word, line, paragraph etc. structure or even that we can agree upon the application of word, line etc. My word model and your word model may be different.. Searching for a word model means to search for the word model as defined by the record as indexed. Its just like searching for title.. > -- Adding a relation: 'window'. > examples: > * dc.title window/distance<5/unit=word "fries salt vinegar" See above why that's still "wrong" > (fries, salt, and vinegar all within a span of 5 words) > *dc.title window/distance<5/unit=word ((fish and fries) and (salt or > vinegar)) > (fish and chips and one of salt or vinegar, in a 5 word window) > * dc.title window/distance=2/unit=word/ordered "fries salt " > (fries followed by salt with 2 words between) > > -- Adding a boolean modifier 'prox' which acts the same as the > current boolean, however can be attached to either AND (the current > style of proximity) or NOT for negative proximity. Example: * "fish > and" not/prox chips > ("fish and" followed by anything other than chips) > What is more interesting (and YES, I have implemented it and it works very well so its not just "theory") are the following booleans; - In the same container (field) instance. A container (field) is not a unit but a field, resp. tag or even path... To model the desire to have things in the same instance of a named field. "fries" AND:title "salt" to have them in the same title instance. - I also have operators to handle anonymous (unnamed by query) fields and all kinds of other variations.. In talking about indexing XML we have sometimes mark-up such as <TITLE SCHEMA="Foo bar">Zinging it for fun and profit</TITLE>. Foo and bar are in the same Schema as a complex attribute of TITLE and fun and profit are in the same container instance of title.. But we also want to search for foo and fun in the same abstract TITLE.. We only want the fun in those TITLES of schema which contains foo? Its doable.. and also the anonymous case.. (unnamed).. Its all just logical reason, one after the next... We can walk down a tree and also say within X steps in a tree.. I did not implement that on search but could (just have not seen its utility as yet).. Designing a generic model this would ultimately make sense. <identity> <number>1234</number> <person> <name> <last>zimmermann</last> ...</identity> Now if we're going to finally start to think in a more abstract/structural manner I'd suggest we also consider rethinking our unit of retrieval away from the monolithic "record" or at least consider granularity: that the objects of retrieval from a query may be fragments that have either been explicitly defined or derived from the query. Explict queries: Who said what? <SPEECH> <SPEAKER>LADY MACBETH<SPEAKER> <LINE>Yet here's a spot.</LINE> </SPEECH> Give me the content of the SPEAKER of the SPEECH where a LINE contains "here's a spot" ... The record (play) can contain loads of speeches... Implicit: In designing our S/R systems we have turned to structure and granularity away from records... Searching, for example, for "war" is not the same as searching for "war and peace" as the title of a book.. War might be war as in Warhammer.. it might be the pop band (Eric Burdon and War).. it might be conflict (as in the Dictionary of War).. it might be war (German "was"). SRU/W should not just reduce some of the complexity of ISO 23950 but also finally liberate it from the card catalogue model. CQL needs to become something suitable to abstract structure search (beyond XQuery and friends).. -- Edward C. Zimmermann, Basis Systeme netzwerk, Munich Office Leo (R&D): Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany http://www.nonmonotonic.net ------- End of Forwarded Message -------