Previous: A look at a trivial case.
A look at a marginally more complex case
What happens when the function is slightly more complex?
function Add(const a, b : Double) : Double; begin Result := a+b; end;
Well, the non-inlined form still compiles rather inefficiently in both XE & XE6
Unit1.pas.45: begin 005D7354 55 push ebp 005D7355 8BEC mov ebp,esp 005D7357 83C4F0 add esp,-$10 Unit1.pas.46: Result := a+b; 005D735A DD4510 fld qword ptr [ebp+$10] 005D735D DC4508 fadd qword ptr [ebp+$08] 005D7360 DD5DF0 fstp qword ptr [ebp-$10] 005D7363 9B wait 005D7364 DD45F0 fld qword ptr [ebp-$10] 005D7367 DD5DF8 fstp qword ptr [ebp-$08] 005D736A 9B wait Unit1.pas.47: end; 005D736B DD45F8 fld qword ptr [ebp-$08] 005D736E 8BE5 mov esp,ebp 005D7370 5D pop ebp 005D7371 C21000 ret $0010
By reference, the optimal form would involve just three instructions
fld a fadd b ret
What about inlining? Well things changed, but not all for the best…
Here is the inefficient inlining in Delphi XE
Unit1.pas.53: f := Add(a, b); 004AB65B DD442408 fld qword ptr [esp+$08] 004AB65F DC442410 fadd qword ptr [esp+$10] 004AB663 DD5C2418 fstp qword ptr [esp+$18] 004AB667 9B wait 004AB668 8B442418 mov eax,[esp+$18] 004AB66C 890424 mov [esp],eax 004AB66F 8B44241C mov eax,[esp+$1c] 004AB673 89442404 mov [esp+$04],eax
and here is the inefficient inlining in Delphi XE6
Unit1.pas.53: f := Add(a, b); 005D7357 DD442408 fld qword ptr [esp+$08] 005D735B DC442410 fadd qword ptr [esp+$10] 005D735F DD5C2418 fstp qword ptr [esp+$18] 005D7363 9B wait 005D7364 DD442418 fld qword ptr [esp+$18] 005D7368 DD1C24 fstp qword ptr [esp] 005D736B 9B wait
So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.
If your code is already bottle-necked by the FPU, this just won’t help…
Conclusion
The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.
It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.
It looks to me there are some FWAIT instructions that could be avoided – also shouldn’t Delphi use SSE instead of the FPU for many floating point ops?
Of course SSE/SSE2 would be best, after all they were only recently introduced 13 years ago.
But then I wouldn’t have been able to use a picture of the venerable 8087 co-processor to illustrate this article, and would have had to use Pentium 4, which would have been a shame 🙂
Beware of the Pentium processor: I’ve heard that some of them has a bug in the FDIV instruction.
Luckily Delphi has a compiler option to work around it. You know, just in case your application will be running on a 20 year old CPU.
Any time you deal with floating point numbers in any context, hand coding the math will always have better results than any generic solution the compiler can come up with. If performance is so tight you are hunting for clock cycles, depending on the compiler is never the right way to go.
That said, the stack shuffle does seem kinda dumb, a clear sign that inline’ing fuctions are not the same as macro replacement, which obviously is what you are hoping for.
The whole WAIT thing is a totally different barrel of fish – I haven’t been clear how required it really is in a long, long time. Heck, even back in the 386 days when the FPU was a discrete part, you could frequently leave out most of the waits that the compiler tossed in.
Thanks Eric for exploring this.
I sometimes try this site to compare with what C++ compilers generate:
http://gcc.godbolt.org/
This source:
double Add(const double a, const double b) {
return a+b;
}
Generates simply:
addsd %xmm1, %xmm0
ret
I just found out a similar mess when using inlining, not only with floating-point values, but plain integer/string.
Some code of mine was in fact 70% SLOWER with Delphi XE6 than good old Delphi 7!
With some small methods declared as inline (in TBSONWriter.BSONWrite + TFileBufferWriter.Write1).
Generated code was awful, with a lot of intermediate value exchanges on the stack, and sometimes some temporary variables just prepared on the stack for nothing, with a corresponding try…finally hidden block, which was never used at all!
With only some well identified methods marked “inline”, the XE6 compiler generated some code 10% FASTER than Delphi 7…