Print

Print


Great work there from Mike -- thanks.  That's enough to convince me
that length is definitely preferable to end-position.  I maintain my
preference for zero-based indexing, but I don't feel strongly about
it.  As I said before (but the immediately contradicted by including
poorly edited old text :-) I'm quite happy to simplify by saying the
negative numbers are Just Pain Not Allowed.

 _/|_	 ___________________________________________________________________
/o ) \/  Mike Taylor  <[log in to unmask]>  http://www.miketaylor.org.uk
)_v__/\  "Writing is like driving a car at night.  You can only see as
	 far as the headlights, but you make the whole trip that way" --
	 E. L. Doctorow.




Mike Rylander writes:
 > On 8/17/06, Mike Taylor <[log in to unmask]> wrote:
 > > Ray Denenberg writes:
 > >  > I'm trying to track down why we changed it from length to position.
 > >  > The discussion was among the Ed. Board, and as I recall there was a
 > >  > good reason but I haven't tracked it down yet (in the Ed. Board
 > >  > archive). If I can't find it we'll go with position. (Thoug that
 > >  > does destabalize the proposal.)  The zero-base issue is more
 > >  > difficult, as it seems there is fairly overwhelming support for
 > >  > 1-based as far as I can see.
 > >
 > > I only agreed to 1-based because someone on list list asserted that
 > > it's what most substring operations use.  But if Mike Rylander would
 > > like to repeat his helpful survey of other environments' substring
 > > operations, I'd support whichever index-base is most widely used
 > > elsewhere.
 > 
 > Now look what I've gotten myself into ... ;)
 > 
 > I'll go through the environments I use most, in no particular order whatsoever.
 > 
 > There is some amount of discordance between different SQL
 > implementations regarding the index base of the start position.
 > 
 > Postgresql uses 1-based indexes in order to provide consistency with
 > ordinal tuple position[1].  It does not allow negative length, but
 > using negative (or zero) start position will place the start logically
 > before the beginning of the string, thusly:
 > 
 > poop=# select substr('123456',-2,5);
 >   substr
 > --------
 >  12
 > (1 row)
 > 
 > While that does have a brand of internal consistency, I wouldn't
 > promote it here.
 > 
 > Oracle[2], while being mostly 1-based also, allows a start position of
 > zero which means exactly the same thing as 1.  I think this is worse
 > than Pg's plan, since there's less consistency.
 > 
 > Just to add more inconsistency, MySQL[3] has decided to allow negative
 > positions and uses that to mean "from the end of the string", however,
 > the second parameter is still always length.  They use 1-based start
 > positions, and don't say anything about zero (and I don't use MySQL),
 > so I can't comment on that piece.
 > 
 > So that covers SQL implementations, but it doesn't really give much
 > direction since there is so much disagreement over details.  Arguably,
 > Pg is the closest to the SQL spec, but the spec does not constrain
 > extensions in most cases, so it's hard to point to anything solid,
 > other than nominally 1-based with length.
 > 
 > On to programming languages!
 > 
 > We've seen Perl, PHP, C/C++ (0-based, length) and Python (no substr,
 > just slice) ... another currently popular language, Ruby[4], uses
 > 0-based start position and length for substr, and /also/ uses 0-based
 > start position for slices when the '[]' method is handed a range (
 > x..y ) instead of two numbers.  Java[5], trying to be a rebel, uses
 > the second param in its .substring() method for end-position (that's a
 > /slice/, Sun!), but it is also 0-based.  XPath[6] uses 1-based start
 > position and length.
 > 
 > So here are my ill considered, biased conclusions:
 > 
 > Traditionally, string indexing is 0-based.  This comes from looking at
 > strings as arrays of characters, and for the purpose of extraction
 > (with the goal of matching), this feels "right" to me.  Those that
 > buck real-programming convention and use 1-based position are doing so
 > for local consistency (SQL -- 1-based ordinal position) or just
 > because (Java -- any idea?) or because they are following one (or
 > both) of the 1-based crowd (XPath -- following Java, and attempting to
 > be a query language like SQL).
 > 
 > Being a programmer, I'm more comfortable with 0-based string indexing,
 > so that's what my heart would prefer.  However, we are designing a
 > query language here, so it makes sense to follow the convention of
 > other query languages.  Both SQL and XPath/XSLT have 1-based substring
 > implementations, so I'm not going to put up too much of a fight to get
 > 0-base substrings.  I would, however, urge everyone to discard the
 > idea of "logical" starting positions -- no zero or negatives, please!
 > There's just too much confusion over what they should mean, and there
 > are other methods that can be employed for providing "match the end of
 > the string" stuff.
 > 
 > As to lenghth vs end-position, every single implementation of substr
 > uses length, with the exception of slice methods that happen to be
 > spelled "substr".  IMHO, if an end-position version of this method is
 > desired/desirable, it should be called "slice".  Note, however, that
 > you can implement slice via substr by saying
 > 
 >   slice( string, substr-start, substr-start + substr-len - 1 )
 > 
 > and substr via slice with
 > 
 >   substr( string, slice-start, slice-end - slice-start + 1 )
 > 
 > So, if that's how you want to do it under the covers, or your language
 > of choice doesn't have a proper substr implementation, you can
 > transform to slice in the backend.
 > 
 > 
 > 1) http://www.postgresql.org/docs/8.1/interactive/functions-string.html
 > 2) http://www.adp-gmbh.ch/ora/sql/substr.html
 > 3) http://mysql.com/doc/refman/5.0/en/string-functions.html
 > 4) http://www.rubycentral.com/book/ref_c_string.html#String._ob_cb
 > 5) http://mindprod.com/jgloss/substring.html
 > 6) http://www.topxml.com/xsl/funcsubstring.asp
 > 
 > 
 > >
 > > (And for the same reason I prefer length to end.)
 > >
 > >  _/|_    ___________________________________________________________________
 > > /o ) \/  Mike Taylor  <[log in to unmask]>  http://www.miketaylor.org.uk
 > > )_v__/\  "He looks around; he sees angels in the architecture, spinning
 > >          in infinity" -- Paul Simon, "Call Me Al"
 > >
 > 
 > 
 > -- 
 > Mike Rylander
 > [log in to unmask]
 > GPLS -- PINES Development
 > Database Developer
 > http://open-ils.org