Idea for case-insensitive string comparisons

**Robert Hodge** · 17-06-2013, 01:50

It's very common to require case-insensitive comparisons of two strings.

For example,

DIM ONE AS STRING, TWO AS STRING
' ...
IF UCASE$(ONE) = UCASE$(TWO) THEN ...

This is such a common requirement that a shortcut syntax would be helpful in making code more clear.

Assume that the comparison operators could be written with ! as a suffix. Then, you would have,

=!    means    case-insensitive  =
<>!   means    case-insensitive  <>
<!    means    case-insensitive  <
<=!   means    case-insensitive  <=
>!    means    case-insensitive  >
>=!   means    case-insensitive  >=

So, the example above could be rewritten as,

IF ONE =! TWO THEN ...

What is interesting about using ! as a suffix is that it makes possible using it on the compound assignment operators.

Example:

ONE =! TWO     ' same as ONE =  UCASE$(TWO)
ONE &=! TWO    ' same as ONE &= UCASE$(TWO)

etc.

**ReneMiner** · 17-06-2013, 08:54

Direction sounds reasonable but I don't like those hyroglyphical annotations.

What would make sense and reads understandable:

Dim a, b as String 

a= "Hello123"
b= "hello123"

If SameText(a, b [,Collate Ucase]) Then 
 '...
Endif

this even a tB-noob can read and understand without checking the help-file - while I don't know if I could remember what "&>=!" means after one week not using it.
The exclamation-mark disturbs me that far because it's used in some scripting languages of a few games I know, as NOT,
so this " != " means (translated to tB) " <> " (not equal) , "!<" means not smaller, " !>=< " not within, " !>< " not between etc.
So reading this "One &=! Two" makes me read something else - in words: "Let One Be One And Add Not Two" - excluding some thing from some other thing. Result of this : "One" = "One" And Add Not "Two" logical would be "neTw"...
Since its about String+ some wildcard anyway how about "$$" or "$*" as a shortcut for already existing Collate Ucase?

Perhaps one could use/improve already available methods instead of inventing new ones -

I would suggest to add some "Collate Ucase"- switch as known from "Array Scan"-method to existing memory comparison methods (where it makes sense) - which would be Memory_Equals, Memory_Differs and Memory_Compare - so not just to be able to compare string-content which is stored to some dynamic string but anywhere like this:

' regular String-Example:
If StrPtrLen(StrPtr(a)) = StrPtrLen(StrPtr(b)) Then
  If Memory_IsEqual(StrPtr(a), StrPtr(b), StrPtrLen(StrPtr(a)), Collate Ucase) Then '...
Else
  ' no match anyway...
Endif

' "wild" String-comparison
Long lLen = Iif( Heap_Size(pA) > Heap_Size(pB), Heap_Size(pB), Heap_Size(pA)) ' just to make the next line look shorter....

If Memory_IsEqual(pA, pB,  lLen, Collate Ucase) Then '...

' Mid$ + Left$ + Right$-comparison-substitute
If Memory_IsEqual(pA + whatEverStart, pB + whatEverStart,  whatEverLen, Collate Ucase) Then...

' alternative wildcard annotation
 If Memory_Differs(pA, pB,  lLen, $*) Then '...not the same text, case insensitive...

this way one could do case-insensitive text-comparison even in "wild-string-space" (heap) or just within a certain section of the string without peeking it to some local storage-variable in advance - like Mid$ with built-in Ucase$ on both parameters - but a few times faster I guess..
Can even compare mixed from strings, heap, file-line-content, dictionary-content etc. without the need of converting to something nor storing it locally before

**Robert Hodge** · 17-06-2013, 15:53

I can't argue the point about what is done in other languages. I am a C coder from way back, so I know all about != meaning NOT EQUAL.

A better comparison is Rexx, which does a similar (but not exactly the same) thing. They have = to mean Equal, and == to mean Exactly Equal. The "exactly" part for strings means that if leading or trailing spaces are present, they are treated as normal data and not ignored.

So, in Rexx, "ABC" = " ABC " is true, but "ABC" == " ABC " is false. In TB terms, == is much like Trim$("ABC") = Trim$(" ABC ").

They use ==, >==, <==, << and >>. For Not Equal, they have various notations, but the one that best fits this is /==.

I am open to other punctuation, but adding alternative function names goes against the main idea, which is to keep this stuff short.

A possible character might be a suffix of $. That would give operators like =$, <>$, <$, >$, <=$, >=$ and so on. Compound assignments could be done by ONE &=$ TWO or ONE +=$ TWO, etc.

Another possible syntax is to use the ^ character. This has the nice quality that it "points up" and so it might help people remember that there is a transformation to upper case going on. This would give operators like =^, <>^, <^, >^, <=^, >=^ and so on, with compound operators ONE &=^ TWO or ONE +=^ TWO, etc. For operators like =, =^ doesn't look to bad, but <^ is just plain ugly, and so I couldn't recommend it.

Similar remarks could be made about combining operators with @ or #. The main problem is trying to find something that is still available, won't break TB's lexical scanner, and is readable without looking confusing or 'gross'.

One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.

This is readable, doesn't use the "NOT-like" character !, isn't ugly and is nice an concise. No, you couldn't understand it without reading the manual. However, this notation is not for idle passers-by that read TB code once a year; it's for people pounding out lots and lots of code, and need it to be more concise.

As for making syntax that is understood by someone who never read the manual, that is true enough as far as it goes, but at some point, you just have to read the manual. Making code readable is a fine goal, but there are limits to how far you can take that. When you simplify the syntax, that makes code more readable, too.

**ReneMiner** · 17-06-2013, 16:12

yeah, $^ might be better than $*...

I think it's about the exchange of ideas here so they can grow from "but when this or that" what other people mean. If you know "!=" from C already you can imagine the confusion about "NOT" when reading "!" (the games I'm talking about is like TES Part 3 to whatever now is available which have some c-script alike language - that's where I know the "Elsewhile" from).

Maybe for common strings it's faster to develop a solution if Ucase or 'CaseDoesNotMatter' is done by string-methods only. But they are always slower in execution. If you compare memory numerical bitwise to ignore the LCase-bit (32 if 64 is set) I think it'll still be faster.
But I fear you are more up to some shorter way to write it - than to improve functionality or execution speed?

Perhaps another person can see... another way?

**Robert Hodge** · 18-06-2013, 05:21

One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.

I think in terms of syntax, this one here is about as good as it's going to get.

As far as runtime efficiency, it has an advantage as well - at least, a potential one.

Consider. If I have

DIM ONE AS STRING
DIM TWO AS STRING
' assign ONE and TWO
IF UCASE$(ONE) = UCASE$(TWO) THEN ...

what does TB have to do to implement this? To my eyes, it would have to create a temporary copy of ONE and upper-case it, then make a temporary copy of TWO and upper-case it, then compare the two temporaries. But, what happens if we allow this ...

IF ONE =: TWO THEN ...

Now, we can have a case-insensitive comparsion operator at a low level, so that it could translate to upper case and do the comparison a byte at a time, and so no temporaries would need to be created.

So, you get two advantages with this syntax: (a) much less code, and (b) faster execution of case-insensitive comparisons.

**ReneMiner** · 18-06-2013, 07:44

The colon I don't like because it has meaning: "Expression ends here" since 1977 or longer - but the chars used in the end won't matter...

I would not know how to make it to give the Ucase-order to both sides of the expression without any parenthesis - perhaps Eros can - but I have no idea about all this- I'm only a basic end-user . In order to proceed strings more than once case-insensitive I would use some state as known from OpenGL:

Enable Collate Ucase 
'do case-insensitive string-stuff here using normal syntax as 

a += b  
'whatever

Disable Collate Ucase
' now case sensitive again

you could enable that at very front of script and never disable for the whole script and use ordinary syntax without the need of additional annotation.

or short for some just once expression in parenthesis like

If Collate Ucase(a = b) Then PrintL "a and b have same text"
' equals Ucase$(a) = Ucase$(b) 

' shorter syntax could read
If $^(a = b) Then PrintL "a and b have same text"

' other meaning "$^" as simple shortcut/replacement for Ucase$
$^(a) += $^(b) ' etc...

**Michael Clease** · 18-06-2013, 10:41

Originally Posted by John Spikowski

Seems like a good enhancement to me.

one = "a"
two = "b"
three = "A"
four = "B"

IF UCASE(ONE) = UCASE(TWO) AND  LCASE(THREE) = LCASE(FOUR) THEN
  PRINT "You need a new computer\n"
ELSE
  PRINT "Try again with like values\n"
END IF

I like this version the best it stays subscribed to the title of the language BASIC and not "C" as some people seem to want it to become, lets not reinvent the wheel just how much hand holding do you think people need?

The above would still be the most clear version to me, its intention is quite clear and even has a greater flexibility.

Mike C.

**Robert Hodge** · 18-06-2013, 17:20

This concept doesn't really demand the use of a colon, but that is really not that big a deal in terms of implementing it. The issue about colons ending a statement is really not that hard to solve.

The way it works is that, for any compiler or interpreter, they have a 'lexical' phase and a 'parsing' phase. The lexical phase 'grabs tokens' and categorizes them into known types, such as keyword, number, closures (like parens), punctuation, etc. The parser assigns meaning to these tokens.

If we had a token like =: then the lexical scanner would see '=' and then it would 'look ahead' to see if there was a colon following it. If so, it would 'grab' the two characters and treat them as a composite token of "=:" rather than an = sign followed by an end-of-statement : colon delimiter. This kind of stuff is very standard in compiler-like software, and certainly Eros knows all about this.

Some of the suggestions like ENABLE COLLATE UCASE, IF COLLATE UCASE, etc. are certainly possible. But, the goal of this exercise was to make case insensitive comparisons shorter, not longer. If I wanted a long expression for this, then

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

' is just as wordy as

IF COLLATE UCASE (ONE = TWO) THEN ...

' and neither of which are nearly as nice to type or easy to read as

IF ONE =: TWO THEN ...

**ReneMiner** · 18-06-2013, 20:37

anyway- there's already IsLike()-method that allows string comparison with some fancy stuff around as whatever you want -from $DQ to Chr$(34) or """ or whatever - even spaces or - I dunno - what you can type in 5 seconds... around - it has also 6 letters and two parens to type , allows even wildcards or leading/trailing truncate and has case-insensitiive-switch if desired. Maybe think about using this method to get forward developing that rexx-module - I'm nosy

John you're sniffing a new victim out?

**Robert Hodge** · 18-06-2013, 23:27

The IsLike function is certainly a possibility. You would have to treat the right-hand side of the comparison as the "pattern" string, so it would be:

DIM ONE AS STRING
DIM TWO AS STRING

' instead of ...

IF UCASE$(ONE) = UCASE$(TWO) THEN ...

' you'd use ...

IF IsLike(ONE, TWO, %FALSE) THEN ...

The main drawback to IsLike is that if the second argument contained any pattern characters like * ? or # the comparison would not work right.

Now, if you really wanted functional notation rather than my clever new operators, you could use EQ, NE, GT, LT, GE and LE, since these names don't appear to be taken by anything else. That would render the comparison as,

IF EQ(ONE, TWO) THEN ...

which isn't too bad. I still like my way better:

IF ONE =: TWO THEN ...

but hey - "you takes what you's can gets".

If we wanted to have case-insensitive comparisons that were also trimmed, we could use EQ_T, NE_T, GT_T, LT_T, GE_T and LE_T, or something like that. Implementing "trimmed" versions of these functions would have a lower priority though.