Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.
I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:
type TFloatRec = record private Field : Double; public function RecGet : Double; inline; end; TMyClass = class private FRec : TFloatRec; public function Get : Double; virtual; end; function TFloatRec.Get : Double; begin Result:=Field; // here you could do a computation instead end; function TMyClass.Get : Double; begin Result:=FRec.RecGet; end;
Basically all you have are trivial functions that return the value of a floating-point field.
Given the above, for the TMyClass.Get method, the optimal codegen would look just like
fld qword ptr [eax+8] ret
Simple enough, eh? Yet here is what the Delphi XE compiler generates:
Unit1.pas.326: begin 0053A794 83C4F0 add esp,-$10 Unit1.pas.327: Result:=FRec.Get; 0053A797 83C008 add eax,$08 0053A79A 8B10 mov edx,[eax] 0053A79C 89542408 mov [esp+$08],edx 0053A7A0 8B5004 mov edx,[eax+$04] 0053A7A3 8954240C mov [esp+$0c],edx 0053A7A7 8B442408 mov eax,[esp+$08] 0053A7AB 890424 mov [esp],eax 0053A7AE 8B44240C mov eax,[esp+$0c] 0053A7B2 89442404 mov [esp+$04],eax Unit1.pas.328: end; 0053A7B6 DD0424 fld qword ptr [esp] 0053A7B9 83C410 add esp,$10 0053A7BC C3 ret
for the less-asm fluent, a direct pseudo-pascal translation of the above would be
var p : PDouble; temp1, temp2 : Double; begin p:[email protected]; temp1:=p^; temp2:=temp1; Result:=temp2; end;
And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third “temp3” Double (ie. it will perform even worse).
The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.
To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.
Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.
Ouch! You ought to file a QC report for this…
If you get rid of the inline, you’ll get even worse code:
function TMyClass.Get : Double;
begin
add esp,-$08
Result:=FRec.RecGet;
add eax,$04
call TFloatRec.RecGet
fstp qword ptr [esp]
wait
fld qword ptr [esp]
end;
pop ecx
pop edx
ret
function TFloatRec.RecGet : Double;
begin
add esp,-$08
Result := Field; // here you could do a computation instead
mov edx,[eax]
mov [esp],edx
mov edx,[eax+$04]
mov [esp+$04],edx
fld qword ptr [esp]
end;
pop ecx
pop edx
ret
All those memory moves come from some problems in the Delphi compiler about generating its FPU/x87 code:
– it uses plain memory moves for copying one floating point value to another (like if a double were an Int64);
– the x87 stack is separated from other data, using the x86 stack as a temporary storage space used for conversion;
– when a function is defined to return a floating-point value, the stack is used as temporary storage for the function body, then the value is loaded from the x86 stack into the x87 stack;
– the x87 code generator doesn’t share the optimization features of the x86 code generator;
– there is still “wait” no-op codes generated in Delphi (BC++ does not generate those wait since decades)…
That’s why most Delphi coders rely on BASM for years, when it deals with floating-point computation speed… or use an external optimized library written in C for fast calculation… remember how AggPas is very fast (faster than GDI+) but 4 time slower than the original Agg code in C…
I don’t know if the upcoming 64 bit compiler will be better about floating point type handling. AFAIR we were told something about using SSE for handling double types. Could be a good idea.
That looks quite shocking. Do you know know why the Delphi compiler would emit such poor code?