On this StackOverflow question David Heffernan asked about a hack I’m using in DWScript’s UnifyAssignString.
TStringListCracker
The code of the function looks like:
procedure UnifyAssignString(const fromStr : UnicodeString; var toStr : UnicodeString);
var
i : Integer;
sl : TUnifierStringList;
begin
if fromStr = '' then
toStr := ''
else begin
i := Ord(fromStr[1]) and High(vCharStrings);
sl := vCharStrings[i];
sl.FLock.Enter;
i := sl.AddObject(fromStr, nil);
toStr := TStringListCracker(sl).FList[i].FString; // HACK HERE
sl.FLock.Leave;
end;
end;
The non-hacky variant would be to have that line just be
toStr := sl[i];
but if you use that simpler form, you could see a performance drop by up to 45% (Delphi XE and XE2 32bit) or 38% (Delphi XE2 64bit).
Ouch! Why is that?
The hacky version
The hacky version compiles to a single UStrAsg, the internal function for string assignment which takes care of the reference-counting. So it’s just:
dwsUtils.pas.487: toStr:=TStringListCracker(sl).FList[i].FString; 00511594 8B0424 mov eax,[esp] 00511597 8B532C mov edx,[ebx+$2c] 0051159A 8B14F2 mov edx,[edx+esi*8] 0051159D E8A659EFFF call @UStrAsg
i having been returned by AddObject, no range-check is required.
(note that TStrings.AddObject is called directly because TStrings.Add is just an indirection to AddObject)
What about the “toStr := sl[i]” version?
The property is just syntax sugar on a getter method, so what is actually compiled is “toStr := sl.Get(i)“, which itself, being a virtual call, looks like
toStr := sl.VMT[ TStrings_Get ]( sl, i );
However the virtual call to TStringList.Get on its own costs only a tiny bitsy of extra CPU cycles, though it also prevents inlining (at least, until the compiler gets de-virtualisation optimizations).
But there are other factors at work, the virtual call on its own can’t come close to explaining the performance differential, as after all, the rest of the code UnifyAsstringString is far from trivial (with a critical section and a binary search…)
First, it’s a function returning a string, which is a managed type, so it’s actually compiled to something like:
try temp := sl.getValue( i ); UStrAsg( toStr, temp ); finally UStrClr( temp ); end;
Second, TStringList.GetValue itself is compiled to
TStringList.Get: 00448204 53 push ebx 00448205 56 push esi 00448206 57 push edi 00448207 8BF9 mov edi,ecx 00448209 8BF2 mov esi,edx 0044820B 8BD8 mov ebx,eax 0044820D 3B7330 cmp esi,[ebx+$30] 00448210 720F jb $00448221 00448212 8B1544EA5100 mov edx,[$0051ea44] 00448218 8BCE mov ecx,esi 0044821A 8BC3 mov eax,ebx 0044821C E8C3E3FFFF call TStrings.Error 00448221 8BC7 mov eax,edi 00448223 8B532C mov edx,[ebx+$2c] 00448226 8B14F2 mov edx,[edx+esi*8] 00448229 E81AEDFBFF call @UStrAsg 0044822E 5F pop edi 0044822F 5E pop esi 00448230 5B pop ebx 00448231 C3 ret
That’s quite a lot of code, a fair share of it is spent juggling registers because TStringList.Get is implemented as
function TStringList.Get(Index: Integer): string; begin if Cardinal(Index) >= Cardinal(FCount) then Error(@SListIndexError, Index); Result := FList^[Index].FString; end;
If the range check fails, Error triggers an exception and the last line won’t ever be reached, but the compiler can’t know that, so it compiles everything assuming the last line can be reached even after an Error, thus has to save the parameters, hence the stack juggling.
(the fix to the juggling? use an “else”)
Apart from that issue, there remains in TStringList.Get a range check and an extra UStrAsg call.
And UStrAsg itself includes an atomic incrementation instruction (bus locking)
lock inc dword ptr [edx-$08]
Even though modern CPUs handle them far more efficiently than their ancestors, atomic instructions still cost quite a bit more than their non-atomic counterparts, and the more multi-threaded you are, the more expensive they get.
Summary
So when all is said and done, the hacky version has:
- 1 atomic instruction (1 UStrAsg)
- 1 call
While the non-hacky version has
- 3 atomic instruction (2 UStrAsg, 1 UStrClr)
- 4 calls (2 UStrAsg, 1 UStrClr, 1 TStringList.Get)
- 1 exception frame (free in 64bit, not free in 32bit)
- higher register pressure in UnifyAssignString
- 1 range check
- a lot of register juggling (especially in 32bit)
So while “cracking” TStringList isn’t safe, depending on the circumstances, you can use it to achieve a boost for very little extra complexity.
The same remark works for TList and TObjectList. For TStringList, you have to “crack” it, but for TList and TObjectList, you can use its List: PPointerList property.
In our mORMot framework code, we use seldom the TList.List[] array of pointers instead of TList.Items[], when you are sure the index won’t be out of range:
for i := 0 to fSessions.Count-1 do with TAuthSession(fSessions.List[i]) do if (fIDCardinal=aSessionID) and (fUser.LogonName=aUserName) then …
Such code compiles very well, do not make any call nor run any complex inline statement. Sometimes, the “with … do” statement will save execution time when checking for the parameter values.
In all cases, you’ll better know a bit about assembler, and use Alt-F2 to ensure that your code is better than the initial version. Using a real profiler, as you do, is mandatory to make real speed improvements.
Please remove my previous two comments, for some reason, wordpress isn’t working properly… ):
I have looked at DWScript code and all I can say is that for the most part, a lot of hardcore optimizations are made, this post emphasises that even more.
“toStr&nbp;:= sl.Get(i)“ <- missed a "s" from " " "secon, TStringList.GetValue itself..." I think you meant "second" P.S. I still can't believe the speed deferences of over 35% => mind blowing
@A. Bouchez
TList only has some of the issues, as TList.Get isn’t virtual, and the returned value isn’t managed. So TList/TObjectList don’t trigger exception frames nor the extra bus locks that hurt TStringList.Get so bad.
But yes, .List is accessible without having to crack the classes and can be quite useful. Even better, it’s a pointer to an array, so rather than accessing for each item, you can store a copy in a local variable, which allows the compiler to allocate it to a register for even better performance (that’s done in a few places in DWScript).
Hi Eric, can you keep and document not optimised (hacked) code in source for future porting to FreePascal, ARM compiler, etc.
David did not ask this question – opc0de did.
@Leonardo Herrera – and of course you refer to a comment to the original question. Nevermind.
For the record, using older Delphi compilers, the hacked access to items in TStringList is only 10% faster then the original version (tested with D6, D7 and BDS 2006).
I still can’t install DWSScript on Delphi XE 2 i don’t know either is my version broken . Did anyone encountered this problem before ?
@opc0de
Make sure you’ve got the latest SVN (and aren’t stuck on a 2.1 branch in SVN) or a get recent ZIP from the downloads section.