Recent commits to the DWScript repository doubled the compiler performance when compiling many small scripts, like happens in the unit tests suites.
This started from a first profiling run where the memory allocations around the UnicodeLowerCase function came out as top bottlenecks.
Thing is, Pascal being a case-insensitive language, there are lots of case-insensitive comparisons, lookups, searches and hashes, and turns out a key hash code was computed with code like
hashCode := SimpleStringHash(UnicodeLowerCase(string))
So with a memory allocation for each hash code computation. This was quickly solved with an optimistic implementation, taking advantage of (1) most identifiers actually being ASCII and (2) most identifiers being short.
The result is a new SimpleStringLowerCaseHash function.
But benchmarking yielded an oddity: Win32 version was now much faster than Win64!
A round of profiling pointed fingers at the xxHash implementation, which benefitted from an asm implementation in Win32, but none for Win64.
Now, usually the Delphi Win64 works quite well, but xxHash relies on bitwise rotation, which are trivial in asm, but not have to be emulated with with shifts at the language level. So the fix was to bring the asm to Win64.
Another benchmark demonstrated a marked improvement, but… Win32 was still ahead! Another round of profiling pointed to a specific line, which looked innocuous enough:
if c in [ Ord('A')..Ord('Z') ] then
Looking at what it’s compiled to, in Win32 we have
mov ecx, edx add ecx, -$41 sub ecx, $1a jnb $0124e097
which is good, it converts the range comparison to a subtraction (aptly disguised as an addition) and an unsigned comparison to zero (aptly disguised as as a subtraction).
But on the Win64 side, it’s compiled to something a little bit more arcane
mov ecx, edx sub ecx, $40 cmp ecx, $1f jnbe SimpleStringLowerCaseHash + $9D mov edi, $00000001 shl edi, cl mov ecx, [rel $00000046] test edi, ecx setnz cl jmp SimpleStringLowerCaseHash + $9F xor ecx, ecx test cl, cl jz SimpleStringLowerCaseHash + $AE
even without understanding what’s going, it’s obvious the compiler is doing something more complicated… The first lines (sub & cmp) are doing the same thing the Win32 compiler was doing, which is convert the range comparison to a subtraction (the sub) and an unsigned comparison (the cmp).
That’s when thing take a dark turn, because instead of just proceeding like the Win32 compiler did, the Win64 compiler is creating a temporary boolean, adjusting the and then jumping on it. It’s a bit like doing:
var bool : Boolean;; if (c in [ Ord('A')..Ord('Z') ]) = True then bool := True else bool := False; if bool then ...
No wonder it’s slower: we’re in the inner loop here, and doing that for all characters in a string.
The workaround is to make it simpler for the compiler:
// if Cardinal(c - Ord('A')) <= Cardinal(Ord('Z') - Ord('A')) then mov esi, eax sub esi, $41 cmp esi, $19 jnbe SimpleStringLowerCaseHash + $8C
and now the Win64 compiler generates similar code as Win32, just without aptly disguising the subtraction and comparison.
If you want to improve the performance of SimpleStringLowerCaseHash even further you might want to look at 2 Chars at once and do the uppercasing without a branch. For some inspiration, you might want to take a look at GetHashCode_UString_OrdinalCaseInsensitive in Spring.Comparers.pas
Thanks. Interesting approach, the “or $20” adds collisions a few non-letter characters, but those wouldn’t matter for identifiers (like @ with backtick, or _ with DELETE).
You can either use a lookup table like array[AnsiChar] but perhaps it needs 8-bit input, not WideChar.
Another possibility is to use a branchless comparison.
Here is a AnsiChar version:
https://github.com/synopse/mORMot2/blob/272ebfb63fa2e30668163315ce1b2146dee953ba/src/core/mormot.core.unicode.pas#L6849
But you can easily adapt it to WideChar and handle 4 WideChars per iteration, with NO branch.
@Stefan committed an update with SimpleStringCaseInsensitiveHash, it’s about twice as fast as the older one, unfortunately that’s not noticeable in global benchmark as its wasn’t anywhere near a bottleneck anymore
@Arnaud, sounds like the branchless version is something that could be adapted to a CompareText, which is still a bottleneck (as a comparison is required in the hashtable to confirm a hit)
@Eric
You can make a branchless uppercase conversion + hash compute with this pattern, one NativeInt at a time.
@Arnaud, but for hash the approximated ‘or $20’ lowercase approach is even faster, and the approximation is fine for identifier strings (if not for general string use). For CompareText, an exact version like yours is required.
I’ll run some benchmarks, because statistically, case matches more often than not when compiling, and the overhead of going branchless may be detrimental vs a branching comparison that is well predicted.