A Look at Improved Inlining in Delphi XE6

Venerable 8087 FPU co-processorFirst noticed by dewresearch, Delphi XE6 introduced a new optimization for inlined functions that return a floating-point value.

Here is an exploration of what was improved… and what was not improved.

When inlining was introduced in Delphi, one limitation was that functions returning a floating point values would incur an unnecessary round-trip to the stack, which for short/simple math functions could sometimes not just negate the benefits of inlining, but make the performance worse.

With XE6, that roundtrip seems to have been optimized away in some cases.

A look at the most trivial case

Here are test cases for the usual conventions for functions returning a floating point value:

function GetFloat : Double;
begin
   Result := 0;
end;

function GetFloatInline : Double; inline;
begin
   Result := 0;
end;

procedure GetFloatVar(var Result : Double);
begin
   Result := 0;
end;

procedure GetFloatVarInline(var Result : Double); inline;
begin
   Result := 0;
end;

The procedure variants traditionally offered higher performance than function, by eliminating all round-trips to the stack.

Here is the Delphi XE6 compiler output for calls to those functions, the inlined variants are nice and tight:

Unit1.pas.48: f := GetFloat;
005D7357 E8DCFFFFFF       call GetFloat
005D735C DD1C24           fstp qword ptr [esp]
005D735F 9B               wait 
Unit1.pas.49: f := GetFloatInline;
005D7360 33C0             xor eax,eax
005D7362 890424           mov [esp],eax
005D7365 89442404         mov [esp+$04],eax
Unit1.pas.50: GetFloatVar(f);
005D7369 8BC4             mov eax,esp
005D736B E8DCFFFFFF       call GetFloatVar
Unit1.pas.51: GetFloatVarInline(f);
005D7370 33C0             xor eax,eax
005D7372 890424           mov [esp],eax
005D7375 89442404         mov [esp+$04],eax

By comparison, here is Delphi XE compiler output for the GetFloatInline call. The output is unchanged for the other calls.

Unit1.pas.49: f := GetFloatInline;
004AB664 33C0             xor eax,eax
004AB666 89442408         mov [esp+$08],eax
004AB66A 8944240C         mov [esp+$0c],eax
004AB66E 8B442408         mov eax,[esp+$08]   // stack juggling
004AB672 890424           mov [esp],eax       // stack juggling 
004AB675 8B44240C         mov eax,[esp+$0c]   // stack juggling
004AB679 89442404         mov [esp+$04],eax   // stack juggling

And that’s just for the call (you have other induced overhead in the pre-amble and post-amble), and just for a trivial function returning a constant.

So Delphi XE6 compiler demonstrates a clear advantage.

What about the non-inlined functions?

Well, nothing changed, and the procedure variant still has the edge, a function returning a float will still exhibit the stack round-trip in XE6 in the same way as Delphi XE:

Unit1.pas.24: function GetFloat : Double;
Unit1.pas.25: begin
005D7338 83C4F8           add esp,-$08
Unit1.pas.26: Result := 0;
005D733B 33C0             xor eax,eax
005D733D 890424           mov [esp],eax
005D7340 89442404         mov [esp+$04],eax
Unit1.pas.27: end;
005D7344 DD0424           fld qword ptr [esp]
005D7347 59               pop ecx
005D7348 5A               pop edx
005D7349 C3               ret 

Unit1.pas.34: function GetFloatVar(var Result : Double);
Unit1.pas.35: begin
Unit1.pas.36: Result := 0;
005D734C 33D2             xor edx,edx
005D734E 8910             mov [eax],edx
005D7350 895004           mov [eax+$04],edx
Unit1.pas.37: end;
005D7353 C3               ret 

Next: A marginally more complex case & Conclusion

6 thoughts on “A Look at Improved Inlining in Delphi XE6

  1. It looks to me there are some FWAIT instructions that could be avoided – also shouldn’t Delphi use SSE instead of the FPU for many floating point ops?

  2. Of course SSE/SSE2 would be best, after all they were only recently introduced 13 years ago.
    But then I wouldn’t have been able to use a picture of the venerable 8087 co-processor to illustrate this article, and would have had to use Pentium 4, which would have been a shame 🙂

  3. Beware of the Pentium processor: I’ve heard that some of them has a bug in the FDIV instruction.
    Luckily Delphi has a compiler option to work around it. You know, just in case your application will be running on a 20 year old CPU.

  4. Any time you deal with floating point numbers in any context, hand coding the math will always have better results than any generic solution the compiler can come up with. If performance is so tight you are hunting for clock cycles, depending on the compiler is never the right way to go.

    That said, the stack shuffle does seem kinda dumb, a clear sign that inline’ing fuctions are not the same as macro replacement, which obviously is what you are hoping for.

    The whole WAIT thing is a totally different barrel of fish – I haven’t been clear how required it really is in a long, long time. Heck, even back in the 386 days when the FPU was a discrete part, you could frequently leave out most of the waits that the compiler tossed in.

  5. I just found out a similar mess when using inlining, not only with floating-point values, but plain integer/string.

    Some code of mine was in fact 70% SLOWER with Delphi XE6 than good old Delphi 7!
    With some small methods declared as inline (in TBSONWriter.BSONWrite + TFileBufferWriter.Write1).

    Generated code was awful, with a lot of intermediate value exchanges on the stack, and sometimes some temporary variables just prepared on the stack for nothing, with a corresponding try…finally hidden block, which was never used at all!

    With only some well identified methods marked “inline”, the XE6 compiler generated some code 10% FASTER than Delphi 7…

Comments are closed.