> Date: Mon, 28 Jun 2004 18:09:45 +0200
> From: Adam Dickmeiss <[log in to unmask]>
>
> http://www.w3.org/TR/2004/REC-xml-20040204/#charsets
>
> The XML spec guys really did exclude most chars in the 0-0x01f
> range. I wonder why.
Un-bee-LEEV-able. Here's us, all this time saying that XML is pretty
much a generic record syntax analogous to GRS-1, and now it turns out
that it's no such thing. More fool me for not having checked this out
properly before, but -- What CAN they have been thinking? How in the
name of all that is rational can it be any of XML's business what kind
of data we choose to embed in it?
Regarding Rob's actual question, the correct answer is and must be
that XML is just a broken transport. If a server wants to have terms
that contain control characters, wants to return them in scan
responses and accept them in queries, then that is the server's
prerogative, and it is ABSOLUTELY not the place of the transport layer
to say "you're not allowed to have those characters in your database".
So the only way to fix it is to route around XML's damage. This means
that we need to wrap scan-response terms in an additional layer of
encoding <sigh>. The obvious one these days is base64. Clearly we
need to continue to allow the existing version to work, too, so we
need to engineer backwards compatibility by adding an additonal,
optional, attribute onto scan-term elements to indicate that
base64-encoding is in operation. So:
<terms>
<term>fish</term>
<term base64Encoded="1">ZmlzaGluZ2==</term>
<term>fishy</term>
</terms>
HOWEVER, we clearly also need to be able to send base64-encoding CQL
queries, since they may be built out of scan-response terms, so we
also need an optional boolean base64Encoding attribute on the query
element. And since the same problem could rear its foul stinking head
anywhere else in the protocol, the best thing is probably just to say
that ANY element in an SRW message may carry this attribute, to mean
that its content is base64-encoded and that the toolkit (or
application) needs to decode it before continuing.
Holy moley. All that to fix a bug that was deliberately written into
the XML specification. Unbelievable. Unbelievable.
Regarding surrogate-diagnostic terms and suchlike: I think we should
avoid getting sidetracked by such considerations. Rob's issue is not
a Scan problem. It's a transport problem. Let's address the root
issue.
_/|_ _______________________________________________________________
/o ) \/ Mike Taylor <[log in to unmask]> http://www.miketaylor.org.uk
)_v__/\ "Network: Any thing reticulated or decussated, at equal
distances, with interstices between the intersections." --
Samuel Johnson's "Dictionary of the English Language"
--
Listen to free demos of soundtrack music for film, TV and radio
http://www.pipedreaming.org.uk/soundtrack/
|