Archive

Posts Tagged ‘Optimization’

Delphi array constructors performance (or lack of)

February 18th, 2013

In Delphi you can initialize a dynamic array in two ways, either manually or via the Create magic constructor:

type
   TIntegerArray = array of Integer;

procedure Test;
var 
   a : TIntegerArray;
begin
   // magic constructor
   a := TIntegerArray.Create(1, 2);

   // manual creation
   SetLength(a, 2);
   a[0] := 1;
   a[1] := 2;
end;

The outcome in both cases is the same, are all things equal?

Some array initializations are more equal than others

The first method is less verbose in code, but quite a bit less efficient, if you check the CPU view, that becomes obvious

TestUnit.pas.32: a := TIntegerArray.Create(1, 2);
00511335 8D45F8           lea eax,[ebp-$08]
00511338 8B15F0125100     mov edx,[$005112f0]
0051133E E89576EFFF       call @DynArrayClear   // anybody knows why?
00511343 6A02             push $02
00511345 8D45F8           lea eax,[ebp-$08]
00511348 B901000000       mov ecx,$00000001
0051134D 8B15F0125100     mov edx,[$005112f0]
00511353 E87476EFFF       call @DynArraySetLength
00511358 83C404           add esp,$04
0051135B 8B45F8           mov eax,[ebp-$08]
0051135E C70001000000     mov [eax],$00000001
00511364 8B45F8           mov eax,[ebp-$08]
00511367 C7400402000000   mov [eax+$04],$00000002
0051136E 8B55F8           mov edx,[ebp-$08]
00511371 8D45FC           lea eax,[ebp-$04]
00511374 8B0DF0125100     mov ecx,[$005112f0]
0051137A E89576EFFF       call @DynArrayAsg

// Manual initialization
TestUnit.pas.35: SetLength(a, 2);
0051137F 6A02             push $02
00511381 8D45FC           lea eax,[ebp-$04]
00511384 B901000000       mov ecx,$00000001
00511389 8B15F0125100     mov edx,[$005112f0]
0051138F E83876EFFF       call @DynArraySetLength
00511394 83C404           add esp,$04
TestUnit.pas.36: a[0] := 1;
00511397 8B45FC           mov eax,[ebp-$04]
0051139A C70001000000     mov [eax],$00000001
TestUnit.pas.37: a[1] := 2;
005113A0 8B45FC           mov eax,[ebp-$04]
005113A3 C7400402000000   mov [eax+$04],$00000002

Now before you complain on the compiler capability, you’ve got to realize that the two ways of initializing a dynamic arrays are not equivalent:

  • the magic constructor creates an array, then assigns it, so the array variable is always in a well-defined state
  • the manual initialization mutates the array in several steps, so the array during the intermediate state is in an unfinished step

Of course, in the limited Test procedure, the compiler could figure out the array isn’t visible from the outside, and thus use the shorter form, but that’s an optimization that would apply only to a local variables.

A more generic optimization would be to have the compiler waive the temporary array when the array that is initialized isn’t referenced anywhere else (so intermediate states don’t matter), that’s possible given that dynamic arrays are reference-counted.

Overhead in detail

The final outcome is that using the Create magic constructor can incur quite a bit of overhead:

  • a DynArrayClear call (not sure why it’s there), that will release the previously assigned block of memory for the temporary array
  • a DynArraySetLength, that will allocate a new block of memory and zero it
  • a DynArrayAssign, that will trigger the release of the memory for the existing array (if it wasn’t empty), along with a bus lock for the reference count overhead
  • extra initialization and finalization for the temporary array

In a multi-threaded applications, all that extra memory management and bus locking is going to have a disproportionate impact on performance. If you test the above snippets in a multi-threaded environment, you’ll notice that when using the array constructor, execution quickly becomes single threaded, bottle-necking on the memory manager and bus locks.

The manual initialization only has a single DynArraySetLength call, and if the array is not empty, this may not result in a new block being allocated (as the existing memory block could just be resized in place). So if you initialize the same array variable more than once, the manual form can be quite cheap.

A better array initializer?

Now that I showed you the magic array Create constructor is no good, what if you still want something convenient? Well open arrays can come to the rescue:

procedure InitializeArray(var a : TIntegerArray; const values : array of Integer);
begin
   SetLength(a, Length(values));
   if Length(values)>0 then
      Move(values[0], a[0], Length(values)*SizeOf(Integer));
end;
...
InitializeArray(a, [1, 2]);

The above function won’t be as efficient as manual initialization: there is an extra function call and the values will be copied twice. However it eliminates all the extra memory management and bus locking, so will scale quite better in multi-thread, while being compact syntax and code-wise.

Note that for a managed type (String, Interface…) then System.Move can’t be used, you’ll need to use either asm hackery or a for-to-do loop with item-by-item assignment, which will incur a performance hit, and often make it non-competitive with the manual version.

Need even more speed?

In the grand scheme of things however, all the above approaches suffer from the SetLength call, which is quite complex (have a look at DynArraySetLength in the System.pas unit… and weep), so if you know there is a chance the dynamic array wasn’t resized,  in the manual version, you can gain by doing

if Length(a)<>Length(value) then
   SetLength(a, Length(Values));

Which can when the SetLength is waived, net you more than a mind boggling 10x speedup (ten times).
Ouch! Why doesn’t the RTL do that?

Well, it doesn’t do that because it can’t, as Delphi’s dynamic arrays are not some kind of hybrid half-way between a value type and a reference type, and SetLength is the key stone where all the hybridization happens (for more on the subject, see Dynamic Arrays as Reference or Value Type).

And FWIW, in DWScript, arrays are first-class reference types, which means they can have more capability, and their initialization syntax is also more compact, the above initialization is just:

a := [1, 2];

And if you’re using Smart Pascal and running it in Chrome V8 or node.js, well, let’s just say you’ll need to use all the above tricks for Delphi to come ahead performance-wise.

Tips , , , , ,

Introducing TRefCountedObject

July 9th, 2012

DWScript source code recently introduced a newcomer: TRefCountedObject.

This base class takes the place of TObject in dwsUtils, and is now present throughout the DWScript code. What it adds is, well, a manually reference-counted class.

If you’re not interested in the DWScript internals, you just need to know it allowed to achieve a 5% to 25% reduction in compiled scripts memory usage (depending on what a script does), along with a minor speedup in script execution speed, at the cost of a minor compile slowdown (which should disappear once I improve the current hash table approach). It also allowed to catch a handful of memory management issues that had gone undetected even by FastMM before (as they had no side-effect).

Using TRefCountedObject

The class itself is a simple affair:

TRefCountedObject = class
   private
      function GetRefCount : Integer; inline;
      procedure SetRefCount(n : Integer); inline;
   public
      function IncRefCount : Integer; inline;
      function DecRefCount : Integer;
      property RefCount : Integer read GetRefCount write SetRefCount;
      procedure Free;
end;

But you’ll notice it introduces its own Free method, which is implemented as

procedure TRefCountedObject.Free;
begin
   if Self<>nil then
      DecRefCount;
end;

So it doesn’t directly release the object but decreases the reference count.

In practice, when you introduce a new reference to an existing instance, you invoke IncRefCount, and when you remove such a reference, you invoke DecRefCount, or Free. That’s about it. It’s manual, so it’s not fool-proof, but it is simple, and the overhead is much lower than when using interfaces, and unlike when using interfaces, you still benefit from faster calling conventions (for non-virtual methods) and direct read/write for the simple properties.

Note that this is a generalization of a mechanism that was previously present only for constant unifications (TUnifiedConstExpr), and it’s purposes was to allow the same instance to be referenced in multiple places. TUnifiedConstExpr used a very low level hack to enforce that and wasn’t as flexible, TRefCountedObject is much higher level hack, but it’s still a hack, as to behave correctly, it means a TRefCountedObject should never be freed as a TObject (or TObject.Free will be invoked).

Recovery of the hidden TMonitor field

TRefCountedObject also has another trick up its sleeves, it uses the TMonitor hidden field to store the reference counter. From D2009 up to XE2, TMonitor is still buggy and isn’t really suitable for real-world use, so not being able to use TMonitor on a TRefCOuntedObject is no big loss, and it allows to recover some bytes that were previously wasted in all class instances.

TInterfacedSelfObject was also re-based on TRefCountedObject, and benefits from the same “recovery” of the TMonitor field bytes.

With the introduction of TRefCountedObject, a whole subset TVarExpr is also now getting unified as the constants were, which in practice allows to achieve the above-mentioned reduction in compiled-scripts memory usage.

News ,

Casting an Interface to a Class, the efficient way

July 4th, 2012

Delphi 2010 added support for the “as” to cast an interface reference to its implementation class.

Cast interface as class

type
   IFoo = interface ... end;
   TFoo = class (TInterfacedObject, IFoo) ... end;
...
var intf : IFoo;
var foo : TFoo;
...
intf := TFoo.Create;
...
foo := intf as TFoo; // get back the implementation class

However, if “as” can be convenient in certain scenarios, it’s alas not implemented very efficiently: the compiler and RTL go through several hoops to perform it (cf. this article by Arnaud Bouchez). One of those hoops f.i. gets slower the more interfaces are implemented by the underlying class.

For instance in this benchmark, the “as” loop takes 5.9 ms when operating on a class implementing 2 interfaces, and 7.1 ms (20% more) when operating on a class implementing 8 interfaces (benchmark code adapted from this one in the comments by Chris Rolliston)
Not visible in the benchmark is also the poor cache efficiency of that scanning, should you be dealing with an interface that is implemented by many different classes.

Potion of Speed-Casting

A faster way to go at it (about 4 to 6 times faster, even when not under stress), which is incidentally compatible with older Delphi versions, is to use something like

type
   IGetSelf = interface
      function GetSelf : TObject;
   end;
   IFoo = interface (IGetSelf)
      ...
   end;
...
procedure TFoo.GetSelf : TObject;
begin
   Result := Self;   
end;
...
foo := intf.GetSelf as TFoo;

and parent your Delphi interfaces to some base interface that provides a GetSelf or similar method, and implement it in a root class (in DWScript f.i., it is introduced by a TInterfacedSelfObject).

With the above code, a similar loop completes in 1.23 ms, constant-time, and doesn’t increase when classes implement many interfaces or if you have many classes implementing the same interface. So unlike “as“, it won’t fail you the more you stress it (I first bumped on the issue in a pathological case for “as” when multi-threading, where it ended up on top of profiling results for no good reason).

The limitation is that intf.GetSelf will fail if intf is nil (while “as” would just return a nil), though IME, when you’re casting back to the implementation, you’re likely to have filtered against nil far earlier in the code.

Another option would be Arnaud Bouchez’s ObjectFromInterface, which is constant-time and faster than “as”, but slightly slower than using an IGetSelf (about 7%), and you would be dealing with internal structures magic.

Beyond performance

A last benefit of the IGetSelf approach, beyond any performance considerations, is that it makes the cast part of the design.

Casting interfaces to classes is relevant only for Delphi-implemented classes and Delphi-oriented interfaces, going through an IGetSelf focuses the purposes and scope of those interfaces that are susceptible to be cast back to to classes, while “as” is more of a death-match trip-mine weapon, since you can invoke it on any interface.

Let’s not forget that casting an interface back to a class isn’t exactly a benign implementation choice: interfaces are often intended to isolate the implementation from the interfaces, if that isolation can be broken, that has to be a conscious design choice imho, more than an implementor’s shortcut.

As a bonus, using GetSelf allows to easily find where those cast are made in the code: mark the GetSelf method as deprecated in the IGetSelf, and the compiler will give you a complete lists of places where it’s used. So IGetSelf usage is easily diagnosable, and thus easily refactor-able. Try doing that with “as“…

Tips ,

OptimalCode – “Delphi Optimization Guidelines”

April 26th, 2012

If you recognize the title of this article by Robert Lee, then chances are you’ve been around Delphi for a while! :-)

Alas the optimalcode.com website and Robert Lee disappeared years ago without a trace, but the “Delphi Optimization Guidelines” (dating back from 2002-3003) has been safeguarded and preserved. Recently someone pointed to me that the mirror I had in my Links section had disappeared too…

I guess it’s my turn to host the sacred relics, so I’ve updated the link and placed not one, but three copies on this site:

The guide is, at the time of this writing, nearly 10 years old, so read it with that in mind!
That said many tips still apply to the 32bit Delphi XE2 compiler, and quite a few tips are valid regardless of compiler and programming language.

Tips , , ,

Getting Rid of the Middleman

March 28th, 2012

On this StackOverflow question David Heffernan asked about a hack I’m using in DWScript’s UnifyAssignString.

TStringListCracker

The code of the function looks like:

procedure UnifyAssignString(const fromStr : UnicodeString; var toStr : UnicodeString);
var
   i : Integer;
   sl : TUnifierStringList;
begin
   if fromStr = '' then
      toStr := ''
   else begin
      i := Ord(fromStr[1]) and High(vCharStrings);
      sl := vCharStrings[i];
      sl.FLock.Enter;
      i := sl.AddObject(fromStr, nil);
      toStr := TStringListCracker(sl).FList[i].FString; // HACK HERE
      sl.FLock.Leave;
   end;
end;

The non-hacky variant would be to have that line just be

toStr := sl[i];

but if you use that simpler form, you could see a performance drop by up to 45% (Delphi XE and XE2 32bit) or 38% (Delphi XE2 64bit).

Ouch! Why is that?

The hacky version

The hacky version compiles to a single UStrAsg, the internal function for string assignment which takes care of the reference-counting. So it’s just:

dwsUtils.pas.487: toStr:=TStringListCracker(sl).FList[i].FString;
00511594 8B0424           mov eax,[esp]
00511597 8B532C           mov edx,[ebx+$2c]
0051159A 8B14F2           mov edx,[edx+esi*8]
0051159D E8A659EFFF       call @UStrAsg

i having been returned by AddObject, no range-check is required.

(note that TStrings.AddObject is called directly because TStrings.Add is just an indirection to AddObject)

What about the “toStr := sl[i]” version?

The property is just syntax sugar on a getter method, so what is actually compiled is  “toStr := sl.Get(i)“, which itself, being a virtual call, looks like

toStr := sl.VMT[ TStrings_Get ]( sl, i );

However the virtual call to TStringList.Get on its own costs only a tiny bitsy of extra CPU cycles, though it also prevents inlining (at least, until the compiler gets de-virtualisation optimizations).

But there are other factors at work, the virtual call on its own can’t come close to explaining the performance differential, as after all, the rest of the code UnifyAsstringString is far from trivial (with a critical section and a binary search…)

First, it’s a function returning a string, which is a managed type, so it’s actually compiled to something like:

try
   temp := sl.getValue( i );
   UStrAsg( toStr, temp );
finally
   UStrClr( temp );
end;

Second, TStringList.GetValue itself is compiled to

TStringList.Get:
00448204 53               push ebx
00448205 56               push esi
00448206 57               push edi
00448207 8BF9             mov edi,ecx
00448209 8BF2             mov esi,edx
0044820B 8BD8             mov ebx,eax
0044820D 3B7330           cmp esi,[ebx+$30]
00448210 720F             jb $00448221
00448212 8B1544EA5100     mov edx,[$0051ea44]
00448218 8BCE             mov ecx,esi
0044821A 8BC3             mov eax,ebx
0044821C E8C3E3FFFF       call TStrings.Error
00448221 8BC7             mov eax,edi
00448223 8B532C           mov edx,[ebx+$2c]
00448226 8B14F2           mov edx,[edx+esi*8]
00448229 E81AEDFBFF       call @UStrAsg
0044822E 5F               pop edi
0044822F 5E               pop esi
00448230 5B               pop ebx
00448231 C3               ret

That’s quite a lot of code, a fair share of it is spent juggling registers because TStringList.Get is implemented as

function TStringList.Get(Index: Integer): string;
begin
   if Cardinal(Index) >= Cardinal(FCount) then
      Error(@SListIndexError, Index);
   Result := FList^[Index].FString;
end;

If the range check fails, Error triggers an exception and the last line won’t ever be reached, but the compiler can’t know that, so it compiles everything assuming the last line can be reached even after an Error, thus has to save the parameters, hence the stack juggling.

(the fix to the juggling? use an “else”)

Apart from that issue, there remains in TStringList.Get a range check and an extra UStrAsg call.

And UStrAsg itself includes an atomic incrementation instruction (bus locking)

lock inc dword ptr [edx-$08]

Even though modern CPUs handle them far more efficiently than their ancestors, atomic instructions still cost quite a bit more than their non-atomic counterparts, and the more multi-threaded you are, the more expensive they get.

Summary

So when all is said and done, the hacky version has:

  • 1 atomic instruction (1 UStrAsg)
  • 1 call

While the non-hacky version has

  • 3 atomic instruction (2 UStrAsg, 1 UStrClr)
  • 4 calls (2 UStrAsg, 1 UStrClr, 1 TStringList.Get)
  • 1 exception frame (free in 64bit, not free in 32bit)
  • higher register pressure in UnifyAssignString
  • 1 range check
  • a lot of register juggling (especially in 32bit)

So while “cracking” TStringList isn’t safe, depending on the circumstances, you can use it to achieve a boost for very little extra complexity.

Tips , , , , ,

SynEdit performance bragging rights?

December 29th, 2011

SynEdit LogoCommitted another round of speedups in the SynEdit SVN, and AFAICT SynEdit is now amongst the fastest text and highlighting editors out there!

To benchmark it yourself, if you don’t have a large text file hanging around, you can make a “meaningful” one easily: just go to Delphi’s source\rtl\win directory and enter the following command-line:

type *.pas > c:\winall.pas

That should give you f.i. with  Delphi XE an 11 MB file, with 280k lines. Using the following file, here are my rough findings, for an executable compiled with Delphi XE, by order of decreasing performance:

Editor Time to load Time to reach last line Total
SynEdit << 1 sec 0 sec << 1 sec
Delphi XE < 1 sec << 1 sec < 1 sec
EmEditor (11.0.2) 0 sec 1* sec 1* sec
Scintilla SciTe 0 sec 2 sec 2 sec
Eclipse (Indigo) 3 sec 0 sec 3 sec
Notepad (Win7) 3 sec 0 sec 3 sec
Notepad++ (5.9.6.2) 0 sec 16 sec 16 sec

*: EmEditor  does its line indexing work in a background task, so the 1 sec delay to reach the last line is only visible if you try to reach it just after having opened the file.

Once you’ve reached the last line at least once, entering text at the top or the bottom of the file, selecting blocks, copy-pasting etc. is instantaneous enough in all above editors. Memory usage is roughly comparable, except for Eclipse, which is the worst by far.

Notepad++ is using Scintilla, but performs much worse than SciTe (Scintilla minimalistic text editor/demo), so there must be a glitch somewhere in Notepad++ (which I personally didn’t expect, must have been a regression of some sort in recent versions).

SynEdit also seems to scale better than  all the others above on even larger files (as tested on 100MB+ files).

For bragging rights, what other fast editors do you know of? ;-)

 

 

News ,

Fixing TCriticalSection

November 30th, 2011

TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized.

This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads.

How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation.

There is thankfully a simple fix for that, use TFixedCriticalSection:

type
   TFixedCriticalSection = class(TCriticalSection)
      private
         FDummy : array [0..95] of Byte;
   end;

That’s it folks. This makes sure the instance size larger than 96 bytes, which means that it’ll be larger than the cache line in all current CPUs, so no serialization anymore across distinct critical section instances.

As a bonus, it also ends up using one of the larger, more aligned, FastMM bucket, which seems to improve critical section code performance by about 7%. The downside is you use more RAM… but how many critical sections do you really have?

* (11-12-01): as noted by Allen Bauer in the comments, the issue is fixed for TMonitor in XE2.

Tips ,

Delphi XE2 VCL Styles and 3D

October 24th, 2011

…or when the old/new VCL mule shows it can still kick!

I was asked how hard it would be to do yet-another-Cover Flow-clone with VCL+GLScene, and how that would stand vs using FireMonkey on Windows.

GLFlow : VCL + OpenGL

The new Delphi XE2 Styles allow to get a nice looking UI, allowing to mix 3D graphics more smoothly without grayish controls getting in the way:

GLFlow - Powered by Delphi XE2 & GLScene

You can get the source and a pre-compiled executable from googlecode for a more interactive experience or a peek at the source code.

Note that in the video below, image load times have not been edited away: it’s actually near instantaneous (unsurprisingly so, the FireFlow sample pics are small ones)

The pictures above are those from FireMonkey’s “FireFlow”‘ sample, which you’ll already have if you have Delphi XE2 and have updated your samples directory.
You can select any folder with JPG pictures (and compare vs FireFlow), for instance on pictures from a digital camera (like these).

Now for some highlights:

  • High frame-rate
    • even if you shake the slider or click on pictures to the far right/left
    • no need for a gaming 3D card, even Netbook GPUs are enough
  • Fast loading of images in a background thread
    • almost instantaneous for the FireFlow samples
    • you can still interact while the loading takes place
    • works around the old TJPEGImage locking bug (QC43018, QC55871…)
  • Can handle large JPEG images
    • takes advantage of built-in TJPEGImage facilities for quicker loading  & downsizing.
  • Can animate with dozens of pictures without slowdown
    • benefits from built-in frustum culling (doesn’t draw what isn’t visible)
  • No pixel shimmering thanks to anisotropic filtering
    • GPU-accelerated mipmap generation

The amount of code used in GLFlow is actually lower than in the FireFlow sample, despite the extra features. Beyond the loading phase, everything is handled by the GPU, and this performs well even on low-end integrated 3D chipsets (gaming hardware not required, even mere Atom Netbooks can handle it just fine). FireFlow on the other hand seems to require a fast GPU and a fast CPU (it can maximizes one core when animating, maybe an issue in the animation classes?).

Interestingly enough, apart from VCL styles, there is nothing truly “new”: it could have been done almost the same back in Y2K with GLScene and Delphi 5 (if less pretty).

For the cross-platform fans, compiling it for the Mac (with LCL instead of VCL)  is left as an exercise ;-)

I’ve placed the source on google code, as CoverFlow is well, trendy and pretty, and can come in handy. There are also quite a few low-hanging improvements I may improve upon later on (having multiple background loaders, further texture optimizations, cross-platform through LCL, adding text below the pics, etc.).

News, Tips , ,

Memory Manager Investigations

October 13th, 2011

André Mussche on Google+ investigated the performance of several Memory Managers for Delphi, in single-threaded & multi-threaded situations, with detailed results and charts on performance and memory usage. Great work and interesting findings!

His conclusions (which I share)

For single threaded or low memory profile applications, the default Delphi memory manager (FastMM) is the fastest you can get. If you don’t realloc a lot (strings?), TCmalloc [from Google perftools] is fast too.

For multi threaded apps, it’s not easy to decided what to use. ScaleMM2 is the fastest but not stable. TCmalloc is a good one, but uses a lot of memory. MSVCRT [Microsoft allocator in msvcrt.dll] looks scalable in simple multi-threaded tests, but in extended test like FastCodeMMChallenge it is disappointing: slower and uses a lot of memory!
JeMalloc (used by the latest FireFox) is disappointing in multi-threaded areas, but uses the same low memory as FastMM: maybe FF can be made faster by using FastMM? :-)

Additionally, Hoard was tested, though it performed “off the charts” (in a literal and bad way).

You can check André’s charts for yourself:

All in all, for single-threaded applications, or when you have few threads or limited thread-based memory management, FastMM is still king of the Hill, and not just of the Delphi Hill, both in terms of performance, memory usage and robustness.
Pierre le Riche can be proud of his baby ;-)

As for multi-threaded applications, ScaleMM, once stabilized, could well become the next undisputed King of the Hill, and not just of the Delphi Hill again.

I don’t know if Embarcadero are aware of the technical lead this offers to Delphi, this is something worth some marketing buzz and MM authors support surely?

 

 

 

News, Tips , ,

Delphi XE2-64bit: bottleneck in trigonometric functions?

September 22nd, 2011

Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value XE2-32 XE2-64
1.0 112 ms 86 ms
100 113 ms 125 ms
1e7 114 ms 3700 ms
1e14 128 ms 7600 ms

Choices, Choices, Choices

But… timing isn’t everything, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

  • just give up, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…
  • spit out something, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!
  • try to be accurate, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

Update: as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision may bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

Tips , ,