> Date: Mon, 20 Dec 2004 22:37:02 +0000
> From: Dr Robert Sanderson <[log in to unmask]>
> > I've seen assertions that POST must be used wherever the form data
> > has non-ASCII characters. I gather this means
> I can demonstrate that to be incorrect.
> Just throw UTF-8 in a query at any SRU server.
Just because we can get away with it on some platforms certainly does
not mean that it is safe or reliable.
The most recent RFC describing URI syntax seems to be RFC 2396,
but this is full of increasingly common weasel-words on the matter:
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a
source of confusion for characters that are not part
of US-ASCII. To describe the relationship, it is
useful to distinguish between a "character" (as a
distinguishable semantic entity) and an "octet" (an
8-bit byte). There are two mappings, one from URI
characters to octets, and a second from octets to
URI character sequence->octet sequence->original
A URI is represented as a sequence of characters, not
as a sequence of octets. That is because URI might be
"transported" by means that are not through a computer
network, e.g., printed on paper, read over the radio,
A URI scheme may define a mapping from URI characters
to octets; whether this is done depends on the
scheme. Commonly, within a delimited component of a
URI, a sequence of characters may be used to represent
a sequence of octets. For example, the character "a"
represents the octet 97 (decimal), while the character
sequence "%", "0", "a" represents the octet 10
There is a second translation for some resources: the
sequence of octets defined by a component of the URI
is subsequently used to represent a sequence of
characters. A 'charset' defines this mapping. There
are many charsets in use in Internet protocols. For
example, UTF-8 [UTF-8] defines a mapping from
sequences of octets to sequences of characters in the
repertoire of ISO 10646.
In the simplest case, the original character sequence
contains only characters that are defined in US-ASCII,
and the two levels of mapping are simple and easily
invertible: each 'original character' is represented
as the octet for the US-ASCII code for it, which is,
in turn, represented as either the US-ASCII character,
or else the "%" escape sequence for that octet.
For original character sequences that contain
non-ASCII characters, however, the situation is more
difficult. Internet protocols that transmit octet
sequences intended to represent character sequences
are expected to provide some way of identifying the
charset used, if there might be more than one
[RFC2277]. However, there is currently no provision
within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a
single charset, define a default charset, or provide a
way to indicate the charset used.
It is expected that a systematic treatment of
character encoding within URI will be developed as a
future modification of this specification.
I am not 100% confident what all this means, but so far as I can make
it out, the conclusion is that if you use a URI that includes
characters from outside the universal seven-bit repertoire, then there
is no general way to state what character encoding is in use, so that
(for example) the octed 0xe6 might represent the Danish "ae" ligature
character (in Latin-1) or part of a UTF-8 sequence.
By contrast, when data is POSTed, an accompanying Content-Type header
can explicitly state the character-set.
In conclusion, sending non-ASCII characters seems to be unambigous
when using POST but not when using GET. Which is another reason to
allow SRU/POST, especially for Europeans. (Of course, for we Brits
and you Yanks, it doesn't make any difference :-)
/o ) \/ Mike Taylor <[log in to unmask]> http://www.miketaylor.org.uk
)_v__/\ "It took me fifteen years to discover I had no talent for
writing, but I couldn't give it up because by that time I
was too famous" -- Robert Benchley.
Listen to free demos of soundtrack music for film, TV and radio