This is about unicode- why no-one really uses it in code for desktop-windows, why actually no-one can - besides of using UTF-8 which is a very programmer-unfriendly way of encoding chars. But actually the only that makes sense to use.
As programmers will prefer chars in a fixed size, most of them run into WIDE because the 32-bit version ( would it be DWIDE? ) is known as a waste of memory > 30% since Unicode requires 21 of the 32 bits only for all UniCodePoints in the Range u+0000..u+10FFFF
Now start simple, what did we have since computers on the market? Asc2 - don't look questioning, after ASC (which is still a command in many basic languages) came Asc-2
and the roman numeral II much more central on keyboards then a 2 that requires shifting on a few keyboards - with the time became ascii and almost no one knows about that
its the second release - makeover of ASC.
Can we agree that STRINGS ARE NOT ARRAYS but by default dynamic sequences of different possible kinds of elements, i can have a sequence of any datatype that is made from elements of homogenous components. A sequence allows to add, append, insert, remove elements anywhere in the default-occurence with random access.
It does not need to dimension sequences. Thats a difference from an array.
Easy to differ: a sequence is always 1-based and has no dimensions besides the only one it has. And those numbers in the brackets "[" & "]" are not like the "(" & ")" -wrapped ones - NOT INDEXES - but simply POSITIONS
Thats explains for the first why a sequences members are 1-based. The first one is [1] , from [5..7,11..16,(...)] can pick arbirtrary sub-sequences this way. WHo needs LEFT$,RIGHT$ or MID$ when the sequences can store chars as well as numbers?
Now how ever mix and shuffle or resort - for the elements of an Array the indexes remain with the values, but the order of display might change.
What else?
Oh yes, i mentioned the random access modell. as not difficult to detect a sequence is a very linear storage model, should have always some excess memory preallocated and
should at least expose the value how many elements are currently stored as information for the user. How much in reality allocated is probably depending on the need for dynamic and how often the count of elements change. o2basic comes with somewhat like this but somehow was it reverse-developed to become kind of arrays.
I mean its ok and usefule since sequences are portable to arrays any time- the reverse-operation casting an array to a sequence is free of any problem if the array has 1 dimension only, otherwise would have to iterate through all dimensions an can only the lowest one transfer in blocks matching the elements count of this dimension. Dimensions will be lost, but hey: if it was a sequence previously that was interpreted as array to take advantage of provided array-functions -fine- easy to cast back.
Same we can do with the functions that are for strings. And here you see why strings are not arrays but sequences. If you dim A$(3,3,3) do you have now any string that contains 3 chars?
So far. Very well known sequential storage-modells that you all have heard of or even developed it, the stack (push & pop) what goes in last comes out first and the uneven twin the queue (push & pull) what goes in last comes out after all the others that reach the other end before. Using to store notifications and process them in order as received when have time for it.
OK, the sequence is clarified? Understood. Guess so. Of course its possible to have a sequence of numeric types -bytes- does "blob" sound a bell? - and chars anyway.
For mixed types as collections sequences work only with pointers, 2 for each object in the collection location of data* and structure* how to order the bytes to chunks of what sizes and what they are.
Good.
Now we have datatypes that are CHARS- not numbers. they have a numeric code - yes - for the computer to find the right "drawer" where the char-drawing-instructions are to obtain as well as the comparison-object that it needs to not an identified char. But actually to note the numbers we require digits. And digits are? Chars.
ok, we have the 7 bit-Asc2 or AscII - as you like, do you want to enumerate the codes ASCII? recommend to use - and that only for this char-type and not any other chartype-
to use signed bytes. Why? Usually charcodes are unsigned.
Yes.
the Unsigned range that is covered by a signed byte makes any value exceeding 0..127 become an impossible char. It goes out of the ASCII-7-Bit-limits and must become another chartype or the sequence where its contained autocasts to binary (blob).
AscII by the way is 100% interchangeable to utf7 : you know what we do now?
Making really chars for really unicode - unions of course
Union codepointcomponent_1
bUtf7 as uByte
bAsc2 as sByte
cpc1 as String * 1
End Union
ok, has 8 bits but thats the clue- it goes as 7 bit as the MSB is unused. This is a char-component and from these chars will be made that are of components with a size of 1 byte. We also have the others, but not using word-type. because...
Union codepointcomponent_2
wByte(2) as uByte
cpc2 as String*2
End Union
Why wByte(2)
assume 1 based , needs 2 bytes for 16 bit charcomponents. Why not using the word type?
For heavens sake, didn't you pay attention? Should i repeat the most stupid fault in history of electronics and use a numeric type for characters? That's why we have that BYTE-ORDER-MESS on windows and 3 different encodings already that no one is satifsfied with.
I guess that was the same scientist who thought to squeze all languages into 65535 bitty-bytes not aware that that is not enough for japanese scripts alone. Who was that smart
to suggest using the 32 bits because its computer-conform and since we already have the numbers present using them to put captions on the drawers where each char is stored. And for really smart and a Nobel-Prize in category backward-development for backward thinkers goes to the one who painted the charcodes in human readable order onto the drawers the the computer will take the chars from. And the backside of this medal starts glowing in red when the smart approach for the second time must swap bytes 1 with 4 and 2 with 3. And that for every char. Twice. Each time.
Saddle the other horse Joel, we will ride the code nah'
· an utf8-char is a dynamic sequence of 2,3 or 4 components in size of a byte - it doesnt have that byte-order-mess because it's not based on a numeric type.
marked as idea to learn from
· an utf8-char made of 1 component does not exist - what you have in mind is an UTF7-char
· utf7-chars expand according to theirs value to one of 3 utf8-chartypes
· utf7-chars and utf7-sequences ("Strings") are 1:1 interchangeable with an Asc2-char or Asc2-sequences
· an sequence of Asc2-chars (II is a roman numeral!) can - if no CHR(0) contained - autocast to an ANSI-sequence
· if a char exceeds the 7th bit and has a value > 127 and <= 192
· if the char-component-code is 193..223 it checks the following char-coponent to be > 128 and < 192. If match the charcomponent that follows can not be in the same range
and the UTF7 or Asc2-char expands to an UTF8-char with 2 components. The sequence that was former Asc2 or UTF7 will auto-cast o an UTF8-sequence. Is the Asc2-Char in
range 225..239 two components 129..191 have to follow - and no third in that range- to be valid as UTF8 made of 3 bytes and is the first > 240 and < 248 has 3 follower-bytes 129..191
· that leaves the values 128,192,224,240 and 248 to 255 to immmediately change to ANSI, as these cannot be utf8 as well as a probably utf8-candidate fails the test.
· For ANSI-sequences is a restriction that no chr(0) is permitted. If there is a byte -value of 0 the sequence becomes a binary - byte-sequence a so called BLOB.
as you see the "ancient" asc2"-chartype covers onla a very small range of the available UniCodePoints that are from u+0000..u+10FFFF -
thats far more than a million codepoints: 1,114,111.
now UTF16 - often wrongly interchanged with WIDE-string. Let me visualize the max count displayable different values of 16 bit as WIDE is 65536 including 0.
That is just almost 5% of the unicode-range. If someone tells his app is Unicode-aware or shows unicode-text because he uses wide-strings is a hilarious liar.
For that i could name my AscII-7-Bit-chars displaying app also unicode - the gap from Asc2 to Wide is not that much of a difference in comparison to UNICODE.
WIDE is as ASCII only a small part of the unicode-range and WIDE uses the fixed size of 2 bytes per component. And it uses the fixed count of 1 component per char.
All fixed, the programmers darling.
Now UTF16 covers all codepoints but with the fixed size of 2 bytes per char-component and 95% of the UTF16-chars are using 2 components of 2 bytes. these were gained
by truncation of the WIDE-string-range for 2048 positions, 0x400 (1024) in the range' 0xD800..0xDBFF as lower and 0xDC00..0xDFFF as higher surrogates
these 1024*1024 give the missing million codepoints to make WIDE catch up and calm the long faces of the developers that dared to name WIDE as Unicode.
Actually the encoding saves nothin in comparison to UTF32 - it wastes exactly the same 11 bit of the 32 and the need to monitor each chars value if remains in the wide-range
or expands to 2 components per char makes it a performance-killer. Only used by cray-sea people on Unix-like systems where UTF32_BE is - due to missing the need to swap bytes 1 & 4 and 2 & 3 as its required on windows - most identical to unicode-codepoint enumeration.
Thats where we are now.
i suggest to take an example on AscII - that uneven count of 7 bits -
that by 3 makes the 21 - exactly as unicode requires. the datatype to use : a chartype , anyway we have no numeric types providing 21 bits. as 1 bit leftover for 1 byte ascii 7-bit and the idea is to use
union uchar21
ucc as string * 3
b(3) as uByte
end union
the byte values in notation exactly as codepoints, 3 highest bits are not required using them to enumerate what? and to find out if chars are missing when text is displayed on the web? why? don't you see where chars a e mi ing?
no byte order required the first (left) of the bytes can take values from 0x00 to 0x10 thats 16 out of 17 times this byte is used for unicode without any encoding to anything else but UNICODE, Call it maybe Unicodepoints UCP21 or utf21/24 - no one cares but anyone will be happy to open up the unicode-book take a codepoint and type it in without the need of tables to convert,
Byte by byte. same on every computer that can create a char as string * 3
For now it must be using a library as freetype (very aged) in order to call the glyphes from a fonts cmap pushing the value virtually over a 32-bit-slot - but speed is the same just the required memory is 25% less than utf32, no conversion time will beat utf8 out of the socks
if its unicode : think char - not number. Think ONE anywhere and the same everywhere.
no other way to get there