On 8/17/06, Mike Taylor <[log in to unmask]> wrote:
> Ray Denenberg writes:
> > I'm trying to track down why we changed it from length to position.
> > The discussion was among the Ed. Board, and as I recall there was a
> > good reason but I haven't tracked it down yet (in the Ed. Board
> > archive). If I can't find it we'll go with position. (Thoug that
> > does destabalize the proposal.) The zero-base issue is more
> > difficult, as it seems there is fairly overwhelming support for
> > 1-based as far as I can see.
>
> I only agreed to 1-based because someone on list list asserted that
> it's what most substring operations use. But if Mike Rylander would
> like to repeat his helpful survey of other environments' substring
> operations, I'd support whichever index-base is most widely used
> elsewhere.
Now look what I've gotten myself into ... ;)
I'll go through the environments I use most, in no particular order whatsoever.
There is some amount of discordance between different SQL
implementations regarding the index base of the start position.
Postgresql uses 1-based indexes in order to provide consistency with
ordinal tuple position[1]. It does not allow negative length, but
using negative (or zero) start position will place the start logically
before the beginning of the string, thusly:
poop=# select substr('123456',-2,5);
substr
--------
12
(1 row)
While that does have a brand of internal consistency, I wouldn't
promote it here.
Oracle[2], while being mostly 1-based also, allows a start position of
zero which means exactly the same thing as 1. I think this is worse
than Pg's plan, since there's less consistency.
Just to add more inconsistency, MySQL[3] has decided to allow negative
positions and uses that to mean "from the end of the string", however,
the second parameter is still always length. They use 1-based start
positions, and don't say anything about zero (and I don't use MySQL),
so I can't comment on that piece.
So that covers SQL implementations, but it doesn't really give much
direction since there is so much disagreement over details. Arguably,
Pg is the closest to the SQL spec, but the spec does not constrain
extensions in most cases, so it's hard to point to anything solid,
other than nominally 1-based with length.
On to programming languages!
We've seen Perl, PHP, C/C++ (0-based, length) and Python (no substr,
just slice) ... another currently popular language, Ruby[4], uses
0-based start position and length for substr, and /also/ uses 0-based
start position for slices when the '[]' method is handed a range (
x..y ) instead of two numbers. Java[5], trying to be a rebel, uses
the second param in its .substring() method for end-position (that's a
/slice/, Sun!), but it is also 0-based. XPath[6] uses 1-based start
position and length.
So here are my ill considered, biased conclusions:
Traditionally, string indexing is 0-based. This comes from looking at
strings as arrays of characters, and for the purpose of extraction
(with the goal of matching), this feels "right" to me. Those that
buck real-programming convention and use 1-based position are doing so
for local consistency (SQL -- 1-based ordinal position) or just
because (Java -- any idea?) or because they are following one (or
both) of the 1-based crowd (XPath -- following Java, and attempting to
be a query language like SQL).
Being a programmer, I'm more comfortable with 0-based string indexing,
so that's what my heart would prefer. However, we are designing a
query language here, so it makes sense to follow the convention of
other query languages. Both SQL and XPath/XSLT have 1-based substring
implementations, so I'm not going to put up too much of a fight to get
0-base substrings. I would, however, urge everyone to discard the
idea of "logical" starting positions -- no zero or negatives, please!
There's just too much confusion over what they should mean, and there
are other methods that can be employed for providing "match the end of
the string" stuff.
As to lenghth vs end-position, every single implementation of substr
uses length, with the exception of slice methods that happen to be
spelled "substr". IMHO, if an end-position version of this method is
desired/desirable, it should be called "slice". Note, however, that
you can implement slice via substr by saying
slice( string, substr-start, substr-start + substr-len - 1 )
and substr via slice with
substr( string, slice-start, slice-end - slice-start + 1 )
So, if that's how you want to do it under the covers, or your language
of choice doesn't have a proper substr implementation, you can
transform to slice in the backend.
1) http://www.postgresql.org/docs/8.1/interactive/functions-string.html
2) http://www.adp-gmbh.ch/ora/sql/substr.html
3) http://mysql.com/doc/refman/5.0/en/string-functions.html
4) http://www.rubycentral.com/book/ref_c_string.html#String._ob_cb
5) http://mindprod.com/jgloss/substring.html
6) http://www.topxml.com/xsl/funcsubstring.asp
>
> (And for the same reason I prefer length to end.)
>
> _/|_ ___________________________________________________________________
> /o ) \/ Mike Taylor <[log in to unmask]> http://www.miketaylor.org.uk
> )_v__/\ "He looks around; he sees angels in the architecture, spinning
> in infinity" -- Paul Simon, "Call Me Al"
>
--
Mike Rylander
[log in to unmask]
GPLS -- PINES Development
Database Developer
http://open-ils.org
|