First noticed by dewresearch, Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.
Here is an exploration of what was improved… and what was not improved.
When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.
With XE6, that roundtrip seems to have been optimized away in some cases.
A look at the most trivial case
Here are test cases for the usual conventions for functions returning a floating point value:
function GetFloat : Double; begin Result := 0; end; function GetFloatInline : Double; inline; begin Result := 0; end; procedure GetFloatVar(var Result : Double); begin Result := 0; end; procedure GetFloatVarInline(var Result : Double); inline; begin Result := 0; end;
The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.
Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:
Unit1.pas.48: f := GetFloat; 005D7357 E8DCFFFFFF call GetFloat 005D735C DD1C24 fstp qword ptr [esp] 005D735F 9B wait Unit1.pas.49: f := GetFloatInline; 005D7360 33C0 xor eax,eax 005D7362 890424 mov [esp],eax 005D7365 89442404 mov [esp+$04],eax Unit1.pas.50: GetFloatVar(f); 005D7369 8BC4 mov eax,esp 005D736B E8DCFFFFFF call GetFloatVar Unit1.pas.51: GetFloatVarInline(f); 005D7370 33C0 xor eax,eax 005D7372 890424 mov [esp],eax 005D7375 89442404 mov [esp+$04],eax
By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.
Unit1.pas.49: f := GetFloatInline; 004AB664 33C0 xor eax,eax 004AB666 89442408 mov [esp+$08],eax 004AB66A 8944240C mov [esp+$0c],eax 004AB66E 8B442408 mov eax,[esp+$08] // stack juggling 004AB672 890424 mov [esp],eax // stack juggling 004AB675 8B44240C mov eax,[esp+$0c] // stack juggling 004AB679 89442404 mov [esp+$04],eax // stack juggling
And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.
So Delphi XE6 compiler demonstrates a clear advantage.
What about the non-inlined functions?
Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:
Unit1.pas.24: function GetFloat : Double; Unit1.pas.25: begin 005D7338 83C4F8 add esp,-$08 Unit1.pas.26: Result := 0; 005D733B 33C0 xor eax,eax 005D733D 890424 mov [esp],eax 005D7340 89442404 mov [esp+$04],eax Unit1.pas.27: end; 005D7344 DD0424 fld qword ptr [esp] 005D7347 59 pop ecx 005D7348 5A pop edx 005D7349 C3 ret Unit1.pas.34: function GetFloatVar(var Result : Double); Unit1.pas.35: begin Unit1.pas.36: Result := 0; 005D734C 33D2 xor edx,edx 005D734E 8910 mov [eax],edx 005D7350 895004 mov [eax+$04],edx Unit1.pas.37: end; 005D7353 C3 ret
Next: A marginally more complex case & Conclusion
It looks to me there are some FWAIT instructions that could be avoided – also shouldn’t Delphi use SSE instead of the FPU for many floating point ops?
Of course SSE/SSE2 would be best, after all they were only recently introduced 13 years ago.
But then I wouldn’t have been able to use a picture of the venerable 8087 co-processor to illustrate this article, and would have had to use Pentium 4, which would have been a shame 🙂
Beware of the Pentium processor: I’ve heard that some of them has a bug in the FDIV instruction.
Luckily Delphi has a compiler option to work around it. You know, just in case your application will be running on a 20 year old CPU.
Any time you deal with floating point numbers in any context, hand coding the math will always have better results than any generic solution the compiler can come up with. If performance is so tight you are hunting for clock cycles, depending on the compiler is never the right way to go.
That said, the stack shuffle does seem kinda dumb, a clear sign that inline’ing fuctions are not the same as macro replacement, which obviously is what you are hoping for.
The whole WAIT thing is a totally different barrel of fish – I haven’t been clear how required it really is in a long, long time. Heck, even back in the 386 days when the FPU was a discrete part, you could frequently leave out most of the waits that the compiler tossed in.
Thanks Eric for exploring this.
I sometimes try this site to compare with what C++ compilers generate:
http://gcc.godbolt.org/
This source:
double Add(const double a, const double b) {
return a+b;
}
Generates simply:
addsd %xmm1, %xmm0
ret
I just found out a similar mess when using inlining, not only with floating-point values, but plain integer/string.
Some code of mine was in fact 70% SLOWER with Delphi XE6 than good old Delphi 7!
With some small methods declared as inline (in TBSONWriter.BSONWrite + TFileBufferWriter.Write1).
Generated code was awful, with a lot of intermediate value exchanges on the stack, and sometimes some temporary variables just prepared on the stack for nothing, with a corresponding try…finally hidden block, which was never used at all!
With only some well identified methods marked “inline”, the XE6 compiler generated some code 10% FASTER than Delphi 7…