A Look at Improved Inlining in Delphi XE6

Previous: A look at a trivial case.

A look at a marginally more complex case

What happens when the function is slightly more complex?

function Add(const a, b : Double) : Double;
begin
   Result := a+b;
end;

Well, the non-inlined form still compiles rather inefficiently in both XE & XE6

Unit1.pas.45: begin
005D7354 55               push ebp
005D7355 8BEC             mov ebp,esp
005D7357 83C4F0           add esp,-$10
Unit1.pas.46: Result := a+b;
005D735A DD4510           fld qword ptr [ebp+$10]
005D735D DC4508           fadd qword ptr [ebp+$08]
005D7360 DD5DF0           fstp qword ptr [ebp-$10]
005D7363 9B               wait 
005D7364 DD45F0           fld qword ptr [ebp-$10]
005D7367 DD5DF8           fstp qword ptr [ebp-$08]
005D736A 9B               wait 
Unit1.pas.47: end;
005D736B DD45F8           fld qword ptr [ebp-$08]
005D736E 8BE5             mov esp,ebp
005D7370 5D               pop ebp
005D7371 C21000           ret $0010

By reference, the optimal form would involve just three instructions

fld a
fadd b
ret

What about inlining? Well things changed, but not all for the best…

Here is the inefficient inlining in Delphi XE

Unit1.pas.53: f := Add(a, b);
004AB65B DD442408         fld qword ptr [esp+$08]
004AB65F DC442410         fadd qword ptr [esp+$10]
004AB663 DD5C2418         fstp qword ptr [esp+$18]
004AB667 9B               wait 
004AB668 8B442418         mov eax,[esp+$18]
004AB66C 890424           mov [esp],eax
004AB66F 8B44241C         mov eax,[esp+$1c]
004AB673 89442404         mov [esp+$04],eax

and here is the inefficient inlining in Delphi XE6

Unit1.pas.53: f := Add(a, b);
005D7357 DD442408         fld qword ptr [esp+$08]
005D735B DC442410         fadd qword ptr [esp+$10]
005D735F DD5C2418         fstp qword ptr [esp+$18]
005D7363 9B               wait 
005D7364 DD442418         fld qword ptr [esp+$18]
005D7368 DD1C24           fstp qword ptr [esp]
005D736B 9B               wait

So the stack juggling is still there, except that instead of being handled by integer instructions, it’s now handled by FPU instructions, along with a pointless wait instruction.

If your code is already bottle-necked by the FPU, this just won’t help…

Conclusion

The new function inlining in XE6 can provide some improvements over XE, but it can also result in less efficient code in a floating-point heavy context.

It also means that the need for using procedure-with-var-for-result instead of functions has – alas – not been eliminated by XE6, and there may be just as many cases in which performance goes up as cases in which performance will go down.

6 thoughts on “A Look at Improved Inlining in Delphi XE6

  1. It looks to me there are some FWAIT instructions that could be avoided – also shouldn’t Delphi use SSE instead of the FPU for many floating point ops?

  2. Of course SSE/SSE2 would be best, after all they were only recently introduced 13 years ago.
    But then I wouldn’t have been able to use a picture of the venerable 8087 co-processor to illustrate this article, and would have had to use Pentium 4, which would have been a shame 🙂

  3. Beware of the Pentium processor: I’ve heard that some of them has a bug in the FDIV instruction.
    Luckily Delphi has a compiler option to work around it. You know, just in case your application will be running on a 20 year old CPU.

  4. Any time you deal with floating point numbers in any context, hand coding the math will always have better results than any generic solution the compiler can come up with. If performance is so tight you are hunting for clock cycles, depending on the compiler is never the right way to go.

    That said, the stack shuffle does seem kinda dumb, a clear sign that inline’ing fuctions are not the same as macro replacement, which obviously is what you are hoping for.

    The whole WAIT thing is a totally different barrel of fish – I haven’t been clear how required it really is in a long, long time. Heck, even back in the 386 days when the FPU was a discrete part, you could frequently leave out most of the waits that the compiler tossed in.

  5. I just found out a similar mess when using inlining, not only with floating-point values, but plain integer/string.

    Some code of mine was in fact 70% SLOWER with Delphi XE6 than good old Delphi 7!
    With some small methods declared as inline (in TBSONWriter.BSONWrite + TFileBufferWriter.Write1).

    Generated code was awful, with a lot of intermediate value exchanges on the stack, and sometimes some temporary variables just prepared on the stack for nothing, with a corresponding try…finally hidden block, which was never used at all!

    With only some well identified methods marked “inline”, the XE6 compiler generated some code 10% FASTER than Delphi 7…

Comments are closed.