- DelphiTools - https://www.delphitools.info -

Unicode Leftover Bug From Hell

Lytovchenko_Olexandr_Kharon [1]Or in other words, before getting to the gory details, DWScript [2] now works when compiled with {$HIGHCHARUNICODE ON} on a machine with Cyrillic code-page 1251 [3].

DWScript was converted years ago to Unicode, and been working just fine.

But there was a leftover bug from that crossing of the Styx [4].

Failing in an unexpected place

Last week Alexey Kazantsev reported a bug, where the DWScript tokenizer was failing on very trivial code. This is a portion of the engine that is heavily trodden, trampled and pounded upon by the unit tests, so it was very surprising.

Even more surprising is that I couldn’t reproduce it, we checked Delphi versions, settings, even source ZIP to excluded any SCM quirk. And the issue was still very “reliably” there in his case with when HIGHCHARUNICODE [5] was ON, and very reliably not there in my case, regardless of settings.

After some more digging down to map files and executable binary comparison, it came down to two different constants values, and a simple line of code, in the tokenizer, where sets are used to define character ranges, f.i.

cANYCHAR - [#13, #10]

is used to describe any character but CR and LF, and cANYCHAR is declared as

const cANYCHAR = [#0..#255];

In practice, since the Unicode conversion, the tokenizer only uses those ranges for ASCII characters (so #0 to #127), so the extra #128..#255 range of cANYCHAR was unused, and if the range ended to anything above #127, everything worked.

Except when the code is compiled with HIGHCHARUNICODE ON for a machine running with codepage 1251 (Cyrillic)…

Hidden AnsiChar

Even though there is no AnsiString or AnsiChar in sight, character sets are hard-coded as being character sets of AnsiChar by the compiler (one of the problematic choices made when Unicode String were introduced in Delphi).

When compiling the #255, the compiler thus understands it as Unicode character Ux00FF (‘ÿ’, aka “Latin small letter y with diaeresis”), and then, silently converts it to Ansi using the current system code-page, which in that particular case means a ‘?’, as it’s not part of the 1251 code page.

So the constant declaration was then a silent equivalent of

const cANYCHAR = [#0..'?'];

Which obviously is not equivalent to “any char” anymore.

Bottom Line

Once that’s known, fixing it is simple, just change the declaration to 

const cANYCHAR = [#0..#127];

The bottom line is that even after being used years, there can still be bugs lurking in code that was converted to Unicode…