Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: Idea for case-insensitive string comparisons

  1. #1

    Idea for case-insensitive string comparisons

    It's very common to require case-insensitive comparisons of two strings.

    For example,

    DIM ONE AS STRING, TWO AS STRING
    ' ...
    IF UCASE$(ONE) = UCASE$(TWO) THEN ...
    
    This is such a common requirement that a shortcut syntax would be helpful in making code more clear.

    Assume that the comparison operators could be written with ! as a suffix. Then, you would have,

    =!    means    case-insensitive  =
    <>!   means    case-insensitive  <>
    <!    means    case-insensitive  <
    <=!   means    case-insensitive  <=
    >!    means    case-insensitive  >
    >=!   means    case-insensitive  >=
    
    So, the example above could be rewritten as,

    IF ONE =! TWO THEN ...
    

    What is interesting about using ! as a suffix is that it makes possible using it on the compound assignment operators.

    Example:

    ONE =! TWO     ' same as ONE =  UCASE$(TWO)
    ONE &=! TWO    ' same as ONE &= UCASE$(TWO)
    
    etc.
    Last edited by Robert Hodge; 17-06-2013 at 01:51.

  2. #2
    thinBasic MVPs
    Join Date
    Oct 2012
    Location
    Germany
    Age
    54
    Posts
    1,533
    Rep Power
    171
    Direction sounds reasonable but I don't like those hyroglyphical annotations.

    What would make sense and reads understandable:

    Dim a, b as String 
    
    a= "Hello123"
    b= "hello123"
    
    If SameText(a, b [,Collate Ucase]) Then 
     '...
    Endif
    
    this even a tB-noob can read and understand without checking the help-file - while I don't know if I could remember what "&>=!" means after one week not using it.
    The exclamation-mark disturbs me that far because it's used in some scripting languages of a few games I know, as NOT,
    so this " != " means (translated to tB) " <> " (not equal) , "!<" means not smaller, " !>=< " not within, " !>< " not between etc.
    So reading this "One &=! Two" makes me read something else - in words: "Let One Be One And Add Not Two" - excluding some thing from some other thing. Result of this : "One" = "One" And Add Not "Two" logical would be "neTw"...
    Since its about String+ some wildcard anyway how about "$$" or "$*" as a shortcut for already existing Collate Ucase?

    Perhaps one could use/improve already available methods instead of inventing new ones -

    I would suggest to add some "Collate Ucase"- switch as known from "Array Scan"-method to existing memory comparison methods (where it makes sense) - which would be Memory_Equals, Memory_Differs and Memory_Compare - so not just to be able to compare string-content which is stored to some dynamic string but anywhere like this:

    ' regular String-Example:
    If StrPtrLen(StrPtr(a)) = StrPtrLen(StrPtr(b)) Then
      If Memory_IsEqual(StrPtr(a), StrPtr(b), StrPtrLen(StrPtr(a)), Collate Ucase) Then '...
    Else
      ' no match anyway...
    Endif
    
    ' "wild" String-comparison
    Long lLen = Iif( Heap_Size(pA) > Heap_Size(pB), Heap_Size(pB), Heap_Size(pA)) ' just to make the next line look shorter....
    
    If Memory_IsEqual(pA, pB,  lLen, Collate Ucase) Then '...
    
    ' Mid$ + Left$ + Right$-comparison-substitute
    If Memory_IsEqual(pA + whatEverStart, pB + whatEverStart,  whatEverLen, Collate Ucase) Then...
    
    ' alternative wildcard annotation
     If Memory_Differs(pA, pB,  lLen, $*) Then '...not the same text, case insensitive...
    
    this way one could do case-insensitive text-comparison even in "wild-string-space" (heap) or just within a certain section of the string without peeking it to some local storage-variable in advance - like Mid$ with built-in Ucase$ on both parameters - but a few times faster I guess..
    Can even compare mixed from strings, heap, file-line-content, dictionary-content etc. without the need of converting to something nor storing it locally before
    Last edited by ReneMiner; 17-06-2013 at 14:51.
    I think there are missing some Forum-sections as beta-testing and support

  3. #3
    I can't argue the point about what is done in other languages. I am a C coder from way back, so I know all about != meaning NOT EQUAL.

    A better comparison is Rexx, which does a similar (but not exactly the same) thing. They have = to mean Equal, and == to mean Exactly Equal. The "exactly" part for strings means that if leading or trailing spaces are present, they are treated as normal data and not ignored.

    So, in Rexx, "ABC" = " ABC " is true, but "ABC" == " ABC " is false. In TB terms, == is much like Trim$("ABC") = Trim$(" ABC ").

    They use ==, >==, <==, << and >>. For Not Equal, they have various notations, but the one that best fits this is /==.

    I am open to other punctuation, but adding alternative function names goes against the main idea, which is to keep this stuff short.

    A possible character might be a suffix of $. That would give operators like =$, <>$, <$, >$, <=$, >=$ and so on. Compound assignments could be done by ONE &=$ TWO or ONE +=$ TWO, etc.

    Another possible syntax is to use the ^ character. This has the nice quality that it "points up" and so it might help people remember that there is a transformation to upper case going on. This would give operators like =^, <>^, <^, >^, <=^, >=^ and so on, with compound operators ONE &=^ TWO or ONE +=^ TWO, etc. For operators like =, =^ doesn't look to bad, but <^ is just plain ugly, and so I couldn't recommend it.

    Similar remarks could be made about combining operators with @ or #. The main problem is trying to find something that is still available, won't break TB's lexical scanner, and is readable without looking confusing or 'gross'.

    One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.

    This is readable, doesn't use the "NOT-like" character !, isn't ugly and is nice an concise. No, you couldn't understand it without reading the manual. However, this notation is not for idle passers-by that read TB code once a year; it's for people pounding out lots and lots of code, and need it to be more concise.

    As for making syntax that is understood by someone who never read the manual, that is true enough as far as it goes, but at some point, you just have to read the manual. Making code readable is a fine goal, but there are limits to how far you can take that. When you simplify the syntax, that makes code more readable, too.

  4. #4
    thinBasic MVPs
    Join Date
    Oct 2012
    Location
    Germany
    Age
    54
    Posts
    1,533
    Rep Power
    171
    yeah, $^ might be better than $*...

    I think it's about the exchange of ideas here so they can grow from "but when this or that" what other people mean. If you know "!=" from C already you can imagine the confusion about "NOT" when reading "!" (the games I'm talking about is like TES Part 3 to whatever now is available which have some c-script alike language - that's where I know the "Elsewhile" from).

    Maybe for common strings it's faster to develop a solution if Ucase or 'CaseDoesNotMatter' is done by string-methods only. But they are always slower in execution. If you compare memory numerical bitwise to ignore the LCase-bit (32 if 64 is set) I think it'll still be faster.
    But I fear you are more up to some shorter way to write it - than to improve functionality or execution speed?


    Perhaps another person can see... another way?
    Last edited by ReneMiner; 17-06-2013 at 16:32.
    I think there are missing some Forum-sections as beta-testing and support

  5. #5
    One that *might* work is colon. This would give operators like =:, <>:, <:, >:, <=:, >=: and so on, with compound operators ONE &=: TWO or ONE +=: TWO, etc.
    I think in terms of syntax, this one here is about as good as it's going to get.

    As far as runtime efficiency, it has an advantage as well - at least, a potential one.

    Consider. If I have

    DIM ONE AS STRING
    DIM TWO AS STRING
    ' assign ONE and TWO
    IF UCASE$(ONE) = UCASE$(TWO) THEN ...
    
    what does TB have to do to implement this? To my eyes, it would have to create a temporary copy of ONE and upper-case it, then make a temporary copy of TWO and upper-case it, then compare the two temporaries. But, what happens if we allow this ...

    IF ONE =: TWO THEN ...
    
    Now, we can have a case-insensitive comparsion operator at a low level, so that it could translate to upper case and do the comparison a byte at a time, and so no temporaries would need to be created.

    So, you get two advantages with this syntax: (a) much less code, and (b) faster execution of case-insensitive comparisons.

  6. #6
    thinBasic MVPs
    Join Date
    Oct 2012
    Location
    Germany
    Age
    54
    Posts
    1,533
    Rep Power
    171
    The colon I don't like because it has meaning: "Expression ends here" since 1977 or longer - but the chars used in the end won't matter...


    I would not know how to make it to give the Ucase-order to both sides of the expression without any parenthesis - perhaps Eros can - but I have no idea about all this- I'm only a basic end-user . In order to proceed strings more than once case-insensitive I would use some state as known from OpenGL:

    Enable Collate Ucase 
    'do case-insensitive string-stuff here using normal syntax as 
    
    a += b  
    'whatever
    
    Disable Collate Ucase
    ' now case sensitive again
    
    you could enable that at very front of script and never disable for the whole script and use ordinary syntax without the need of additional annotation.

    or short for some just once expression in parenthesis like

    If Collate Ucase(a = b) Then PrintL "a and b have same text"
    ' equals Ucase$(a) = Ucase$(b) 
    
    ' shorter syntax could read
    If $^(a = b) Then PrintL "a and b have same text"
    
    ' other meaning "$^" as simple shortcut/replacement for Ucase$
    $^(a) += $^(b) ' etc...
    
    I think there are missing some Forum-sections as beta-testing and support

  7. #7
    thinBasic MVPs
    Join Date
    May 2007
    Location
    UK
    Posts
    1,427
    Rep Power
    159

    Exclamation K.i.s.s.

    Quote Originally Posted by John Spikowski View Post
    Seems like a good enhancement to me.

    one = "a"
    two = "b"
    three = "A"
    four = "B"
    
    IF UCASE(ONE) = UCASE(TWO) AND  LCASE(THREE) = LCASE(FOUR) THEN
      PRINT "You need a new computer\n"
    ELSE
      PRINT "Try again with like values\n"
    END IF
    
    I like this version the best it stays subscribed to the title of the language BASIC and not "C" as some people seem to want it to become, lets not reinvent the wheel just how much hand holding do you think people need?

    The above would still be the most clear version to me, its intention is quite clear and even has a greater flexibility.

    Mike C.
    Home Desktop : Windows 7 - Intel Pentium (D) - 3.0 Ghz - 2GB - Geforce 6800GS
    Home Laptop : WinXP Pro SP3 - Intel Centrino Duo - 1.73 Ghz - 2 GB - Intel GMA 950
    Home Laptop : Windows 10 - Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz, 2401 Mhz, 2 Core(s), 4 Logical Processor(s) - 4 GB - Intel HD 4400
    Work Desktop : Windows 10 - Intel I7 - 4 Ghz - 8GB - Quadro Fx 370

  8. #8
    This concept doesn't really demand the use of a colon, but that is really not that big a deal in terms of implementing it. The issue about colons ending a statement is really not that hard to solve.

    The way it works is that, for any compiler or interpreter, they have a 'lexical' phase and a 'parsing' phase. The lexical phase 'grabs tokens' and categorizes them into known types, such as keyword, number, closures (like parens), punctuation, etc. The parser assigns meaning to these tokens.

    If we had a token like =: then the lexical scanner would see '=' and then it would 'look ahead' to see if there was a colon following it. If so, it would 'grab' the two characters and treat them as a composite token of "=:" rather than an = sign followed by an end-of-statement : colon delimiter. This kind of stuff is very standard in compiler-like software, and certainly Eros knows all about this.

    Some of the suggestions like ENABLE COLLATE UCASE, IF COLLATE UCASE, etc. are certainly possible. But, the goal of this exercise was to make case insensitive comparisons shorter, not longer. If I wanted a long expression for this, then

    IF UCASE$(ONE) = UCASE$(TWO) THEN ...
    
    ' is just as wordy as
    
    IF COLLATE UCASE (ONE = TWO) THEN ...
    
    ' and neither of which are nearly as nice to type or easy to read as
    
    IF ONE =: TWO THEN ...
    
    Last edited by Robert Hodge; 18-06-2013 at 17:22.

  9. #9
    thinBasic MVPs
    Join Date
    Oct 2012
    Location
    Germany
    Age
    54
    Posts
    1,533
    Rep Power
    171
    anyway- there's already IsLike()-method that allows string comparison with some fancy stuff around as whatever you want -from $DQ to Chr$(34) or """ or whatever - even spaces or - I dunno - what you can type in 5 seconds... around - it has also 6 letters and two parens to type , allows even wildcards or leading/trailing truncate and has case-insensitiive-switch if desired. Maybe think about using this method to get forward developing that rexx-module - I'm nosy
    John you're sniffing a new victim out?
    Last edited by ReneMiner; 18-06-2013 at 20:46.
    I think there are missing some Forum-sections as beta-testing and support

  10. #10
    The IsLike function is certainly a possibility. You would have to treat the right-hand side of the comparison as the "pattern" string, so it would be:

    DIM ONE AS STRING
    DIM TWO AS STRING
    
    ' instead of ...
    
    IF UCASE$(ONE) = UCASE$(TWO) THEN ...
    
    ' you'd use ...
    
    IF IsLike(ONE, TWO, %FALSE) THEN ...
    
    The main drawback to IsLike is that if the second argument contained any pattern characters like * ? or # the comparison would not work right.

    Now, if you really wanted functional notation rather than my clever new operators, you could use EQ, NE, GT, LT, GE and LE, since these names don't appear to be taken by anything else. That would render the comparison as,

    IF EQ(ONE, TWO) THEN ...
    
    which isn't too bad. I still like my way better:

    IF ONE =: TWO THEN ...
    
    but hey - "you takes what you's can gets".

    If we wanted to have case-insensitive comparisons that were also trimmed, we could use EQ_T, NE_T, GT_T, LT_T, GE_T and LE_T, or something like that. Implementing "trimmed" versions of these functions would have a lower priority though.
    Last edited by Robert Hodge; 18-06-2013 at 23:33.

Page 1 of 2 12 LastLast

Similar Threads

  1. String-in-String-Pointers?
    By ReneMiner in forum thinBasic General
    Replies: 9
    Last Post: 11-06-2013, 14:55
  2. Mixed Case Formatter
    By kryton9 in forum User tools
    Replies: 5
    Last Post: 04-06-2008, 00:28
  3. gamepad test comparisons
    By kryton9 in forum TBDI module. thinBasic Direct Input integration by MikeHart
    Replies: 18
    Last Post: 08-05-2007, 01:55

Members who have read this thread: 0

There are no members to list at the moment.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •