Archive

Posts Tagged ‘Optimization’

Optimistic Unicode case-insensitive CompareText

May 4th, 2011

In Unicode Delphi, post-Delphi 2009, there are two ways of making case-insensitive string comparisons, CompareText, which only does case-insensitivity in the ASCII range (non-accentuated characters), and the judiciously misnommed AnsiCompareText, which works on the whole Unicode range by calling into the Windows API.

Alas, AnsiCompareText is slow, very slow, as illustrated by TStringList in this post f.i., not just because Microsoft didn’t do a good job at implementing it, but also because the Unicode Consortium did its very best to scatter case-sensitive characters across the whole set… In addition you have special locale-based collation rules to deal with, which have to be applied independently from Unicode.

However, if you’re dealing with primarily latin-based strings, and looking at case-sensitive matching (rather than ordering), it’s possible to cut down significantly on the complexity of AnsiCompareText by replacing it with an “Optimistic UnicodeCompareText“, which will use an ASCII-based strategy like CompareText does, and fallback to the Windows API function when, and only when a non-ASCII character is encountered.

Of course, it won’t help with Cyrillic or Greek, but t will work well for most western-languages, where accentuated and special characters are infrequent (such as german or french).

You can find such a UnicodeCompareText function in DWScript‘s dwsUtils.pas unit, it’s being used to add support for Unicode identifiers in DWS, without incurring a performance penalty for all the pure ASCII identifiers.

Tips ,

The limitations of Delphi’s “inline”

February 8th, 2011

Sometimes, the most simple-looking code can cause the Delphi compiler to stumble.

I bumped on such a case recently, and simplified it to a bare-bones version that still exhibits the issue:

type
   TFloatRec = record
      private
         Field : Double;
      public
         function RecGet : Double; inline;
   end;

   TMyClass = class
      private
         FRec : TFloatRec;
      public
         function Get : Double; virtual;
   end;

function TFloatRec.Get : Double;
begin
   Result:=Field; // here you could do a computation instead
end;

function TMyClass.Get : Double;
begin
   Result:=FRec.RecGet;
end;

Basically all you have are trivial functions that return the value of a floating-point field.

Given the above, for the TMyClass.Get method, the optimal codegen would look just like

fld qword ptr [eax+8]
ret

Simple enough, eh? Yet here is what the Delphi XE compiler generates:

Unit1.pas.326: begin
0053A794 83C4F0           add esp,-$10
Unit1.pas.327: Result:=FRec.Get;
0053A797 83C008           add eax,$08
0053A79A 8B10             mov edx,[eax]
0053A79C 89542408         mov [esp+$08],edx
0053A7A0 8B5004           mov edx,[eax+$04]
0053A7A3 8954240C         mov [esp+$0c],edx
0053A7A7 8B442408         mov eax,[esp+$08]
0053A7AB 890424           mov [esp],eax
0053A7AE 8B44240C         mov eax,[esp+$0c]
0053A7B2 89442404         mov [esp+$04],eax
Unit1.pas.328: end;
0053A7B6 DD0424           fld qword ptr [esp]
0053A7B9 83C410           add esp,$10
0053A7BC C3               ret

for the less-asm fluent, a direct pseudo-pascal translation of the above would be

var
   p : PDouble;
   temp1, temp2 : Double;
begin
   p:=@FRec.Field;
   temp1:=p^;
   temp2:=temp1;
   Result:=temp2;
end;

And if TMyClass.Get is not virtual, but a static method with “inline”, you get the above with a third temp3” Double (ie. it will perform even worse).

The above trips to temporaries aren’t innocuous, because those temporaries are in the stack, and result in stalls as the CPU pipeline waits for the roundtrips to L1 memory cache to happen. In practice, a single of those stalls can take as much time as half a dozen floating operations.

To get rid of the temporaries, there are two options: you can manually inline everything (the RecGet & the Get) to get rid of the temporaries, of course, that doesn’t sit too well with encapsulation, or with virtual calls for that matter.

Or you can use inline-asm instead, a single instruction of asm being enough, and even with calls betweens the functions, it will be running circles around the Delphi compiler’s “inline” output.

Tips , ,

FieldByName, or why a Profiler is your friend

November 30th, 2010

I recently bumped on a post by François on FieldByName performance, and was bit surprised by the magnitude of speedups reported by Marco Cantu in a comment:

” I keep fixing similar code I see my clients use, and in some case the performance can increase 5 to 10 times, for large loops. Good you are raising this problem. ”

We have similar-looking code being used here in our datasets (which aren’t TDataset-related at all however), and yet, repeatedly looking up fields by name isn’t a performance issue (it hardly registers in the profiling results, even in worst case situations, like in-memory SQLite DBs).

By curiosity, I had a look at DB.pas… Suffice it to say that the VCL code make a good case study of why a Profiler is your friend, and what “out of thin air” optimization can lead to.

The FindField case of Unicode comparisons

The DB.pas code being copyrighted, I won’t post any excerpts here, but you can find it easily enough yourself, so I’ll just describe what happens.

FindField’s purpose is to find a TField by its name, in a case-insensitive fashion.
A naive implementation would thus look like  that:

for i := 0 to FFields.Count-1 do begin
   Result := FFields[i];
   if AnsiCompareText(Result.Name, FieldName) = 0 then
      Exit;
end;
Result := nil;

I then made a quick benchmark, consisting of a three cases:

  • “best” case consists in finding the first field
  • “worst” case consists in finding the 20th field
  • “all” case consists in finding 20 fields out of the 20 fields

Field names were like “MyFieldName1″, “MyFieldName2″, etc. up to “MyFieldName20″. You’ll note that the differencing characters are at the end of the string, so the situation is quite unfavorable in terms of string comparisons, but neutral if you were to hash those strings f.i.
Also keep in mind I’ve just got a recent CPU (at the date of writing), and the timings afterward are for 100k lookups. On a regular end-user machine, you could probably double or quadruple the figures.

With the naive implementation,  the “best” case performance is 19 ms, “worst” case 400 ms, and “all” is 210 ms.

This is quite lengthy, as with Unicode, case insensitive comparison (AnsiCompareText) is quite complex and expensive time-wise. There can be an obvious performance issue with FindField if used in a loop.

Case study of an optimization gone wrong

To cut down on that complexity, the VCL implementors chose to go for a hash. A risky choice to begin with: a hash has to be good enough to limit collisions, it has to be computed fast enough to actually bring a benefit, and last but not least, it results in sometimes complex extra code (and here you need a case insensitive hash, a plain old binary hash won’t do).

So the VCL code goes on to compute a hash for each of the fields, and alters the naive implementation above by checking the hash before performing the AnsiCompareText, doing the comparison only the hash matches. So far so good, eh? Well, here the trouble begins.

First, the VCL code is still facing one AnsiCompareText per FindField hit, plus an AnsiUpperCase (which is almost as expensive) to compute the hash.

Second, the TNamedItem.HashName implementation is a collection of “don’t”, look for yourself in the code:

  • it includes a custom GetStringLength inline, to cut down on the access to the string character count probably, access which happens twice in the implementation (and which a profiling would have revealed to be negligible, especially in light of the following)
  • since the introduction of FrankenStrings, a String can hold Ansi as well as UTF16, and the hash is computed on an UpperCase’d UTF-16, so you’ve got extra conversion code in there, in case it is Ansi, including dynamic allocation to serve as buffer for UnicodeFromLocaleChars, which is invoked no less than twice
  • in case the String is already UTF-16, a buffer is still dynamically allocated, and the AnsiUpperCase’d name copied to it, I guess that’s to preserve the “efficiency” of the loop that computes the actual hash…
  • then comes the hash computation loop, that was obviously where the optimization effort went, it works on a PChar, does a ROL and XOR, and it is certainly efficient, except that a mere profiling would have shown its efficiency didn’t matter, unless your field names are several hundred characters long…

The VCL implementation has a “best” case performance of 42 ms (2 times slower than naive), a “worst” case of 50 ms (8 times faster than naive), and “all” of 46ms (5 times faster than naive).

Thing is, you likely won’t often have 20 fields in your queries, and the VCL implementation needs at least 3 fields to pull ahead of the naive implementation. Given the size and complexity of the VCL code involved, I would say that’s quite an under-achievement.

Last but not least, if you profile the VCL code, you’ll see that HashName, a whole bunch of memory allocation and string management code from System.pas are quite stressed, given the above, that’s not too surprising, but that means performance in a multi-threaded situation will only get worse.

Doing it the efficient way

Let’s do it with the help of a profiler, and a bit of laziness.

Initially AnsiCompareText is the obvious, overwhelming culprit the profiler will point to in the naive implementation, there are two roads from that point:

  • optimizing AnsiCompareText, this is complex, involves quite a bit of code, and we’re lazy, remember?
  • the fastest AnsiCompareText is the one you don’t do, that’s the lazy road.

How to not do the AnsiCompareText?

One reason there are so many of them in the first place is that there is a loop on the fields. And when optimizing, loops are good, they mean you’ve got big O optimization potential, and big O optimization is how you achieve orders of magnitude speedups.

In this case, it’s a simple O(n) string search loop, for which one of the classic optimizations is a reduction to O(ln n) using binary search. That however requires an ordered list, and the Fields list isn’t sorted, and can’t be sorted.
So we need a good old fashioned index.

One such readily available index is good old TStringList, with Sorted set to True: place the field names in the Strings[], the TField object in the Objects[]. Use IndexOf() to find a field. That’s all. You have reduced the AnsiCompareText from O(n) to O(ln n).

// after filling up or altering the Fields
FIndex.Clear; // FIndex assumed created, set to sorted, case insensitive
for i := 0 to FFields.Count-1 do
   FIndex.AddObject( FFields[i].Name, FFields[i] );
...
// Find a field with
i := FIndex.IndexOf( fieldName );
if i >=0 then
   field := TField( FIndex.Objects[i] )
else field := nil;

With the above code, on a 20 fields situation, the best/worst/all cases benchmarks around 78 ms, which is coherent with the O(ln n) expectation (there are no best or worst case).

A quick look in the profile reveals AnsiCompareText is still the overwhelming bottleneck, instead of being the one in our code, it’s now the one in the TStringList.CompareStrings. The profiler tells us the AnsiCompareText is the key, no need to worry about optimizing Length() ;-)

How to go further from there?

We are O(ln n), could we go to O(1)? That would involve a hash list, which means a hash, a case-insensitive hash. We don’t have one handy, they’re complex, this is not lazy. Also a hash list can be costly memory-wise, we don’t want a huge setup for what could be a two-fields-dataset affair in practice.

The  fastest AnsiCompareText is the one you don’t do… so, do we really need AnsiCompareText? No.
Why? Because we only really need to index the names that are looked up, and FindField is troublesome when it’s invoked in a loop, looking up the very same name strings again and again, ad nauseum.

Rather than indexing the field names, we can index the field names that are actually looked up, but this time in a case sensitive TStringList, thus changing our index to a cache of sorts:

// after filling up or altering the Fields
FIndex.Clear; // FIndex assumed created, set to sorted, case sensitive
...
// Find a field with
k:=FIndex.IndexOf(FieldName);
if k>=0 then
   Result:=TField(FIndex.Objects[k])
else begin
   // not in index, find it with naive implementation
   Result:=nil;
   for i:=0 to FFields.Count-1 do begin
      Result:=FFields[i];
      if AnsiCompareText(Result.Name, FieldName) = 0 then
         Break;
   end;
   // add to index
   FIndex.AddObject(FieldName, Result);
end;

After benchmarking, however, there is no speedup… A look at the profiling results shows that TStringList.CompareStrings is still the bottleneck, this time because of AnsiCompareStr… is there no end to them slow Unicode functions???

Why is AnsiCompareStr slow? in part because in Unicode the same character can be encoded differently, and in part because the WinAPI implementation is just plain no good.

In our case however, the Unicode encodings details don’t matter, it’s a cache, the ordering is meaningless as long as it is consistent, so we can just subclass TStringList and override CompareStrings:

function TIndexStringList.CompareStrings(const S1, S2: string): Integer;
begin
   Result:=CompareStr(S1, S2);
end;

In Delphi XE, CompareStr() is fast, it is still based on Pierre le Riche’s FastCode version (but who knows what will happen to it in Delphi 64 where there is no BASM? but I digress…).

Wrapping it up

The new benchmark figures are now around 8 ms, in all cases. That’s five times faster than the VCL’s best case with a 20 fields dataset, and it scales nicely with field counts, as we’re still O(ln n). Just to illustrate:

  • with 3 fields: VCL takes 42 to 45 ms, our code 4 to 5.6 ms
  • with 100 fields: VCL takes 42 to 81 ms, our code 10 to 18 ms.

This is also achieved with a lot less code than the VCL, no asm or fancy tricks, and we achieved speedups of the same magnitude as what Marco Cantu reported, what François and all TDataset users have to labor for by optimizing on the spot every time they have a loop on a dataset.

Is there some more fat to be trimmed?

The final profiling of the benchmark look like that:

16.5% are lost to the overhead (a simple loop that calls FindField repeatedly), we can assume CompareStr is optimal enough,  the rest is spent in TStringList methods which are decently implemented.

As a bonus, you’ll notice there is no noteworthy stress on dynamic memory allocation, meaning things will scale all the better when multi-threading.

As an alternative to TStringList, you could wonder about the new generic TDictionary<>, if you’re feeling adventurous.
However, you wouldn’t be rewarded for your risk, as it’s slower than the solution exposed here (up to 5 times slower when there are few fields, and slightly slower when there are hundreds of fields). A look at the profiler shows it wastes most of its time… computing the hash. Its memory overhead is also quite higher and would probably bite in a multi-thread situation (I’ll let the astute reader figure out why the TStringList approach gets away with very little memory allocation).

The optimization is done.
You could of course go further, there are a couple low hanging percents to grab, but to really improve, it  would involve libraries or extra code that would go beyond the scope of an isolated optimization case such as this one, IMHO.

Tips , , ,

Optimizing for memory: a tighter TList

October 28th, 2010

One of the memory hogs when you have object trees or graphs can be good old TList and its many variants (TObjectList, TList<T>, etc.).

This is especially an issue when you have thousandths of lists that typically hold very few items, or are empty.
In the DWS expression tree f.i. there are quickly thousandths of such lists, for local symbol tables, parameter lists, etc.

How much does a TList cost in terms of memory?

A TList holding a single item already costs you:

  • 4 bytes for the field in the owner object
  • 20 bytes for the TList instance
    • 8 hidden bytes: Monitor + VMT pointer
    • 12 field bytes: data pointer + Count + Capacity
  • 4 bytes for the data

So that’s 28 bytes, with two dynamically allocated blocks which, and those dynamic allocations, depending on your allocation alignment settings, can cost you something like an extra dozen of bytes (even with FastMM).

What about the other TList variants?

  • TObjectList has an extra boolean field, which with alignment, costs an extra 4 bytes.
  • TList<T> has an instance size of 28 bytes, and a dynamic array storage with 8 hidden extra bytes (4 for length, 4 for the refcount).

Neither of these are better candidates for memory-efficient small lists.

Enter the TTightList

You can find TTightList in DWS’s dwsUtils unit.
For an empty list, the cost is 8 bytes, no dynamic memory, and for a list with a single item, 8 bytes still with no dynamic memory.
For a n-items list, the cost is 8 bytes plus one n*4 bytes dynamic block.

To achieve that, the TTightList makes use of two tricks:

  • it’s designed to be composed, and hosted as an object field
    • it’s a record-with-methods, not a class, but retains classic-looking use semantics (.Add, .IndexOf, .Clear etc.).
    • eliminates the need for a pointer to the instance in the host object
    • eliminates the dynamically allocated storage for the TTightList itself
  • if the Count is one, the array pointer itself points to the only item, rather than to a dynamically allocated block holding a pointer to the item.

The second trick is where we sacrifice a bit of performance, to save one dynamic allocation for lists holding a single item. Though if you benchmark the TTightList, you’ll see it holds its own fairly well against TList for the smaller item counts, which is what it was designed for.
That’s partly thanks to TList‘s own inefficiencies, and FastMM’s in-place reallocation (on which TTightLight relies, since it doesn’t maintain a capacity field).

Tips , , ,

Why no bytecode format?

October 26th, 2010

A compiled script, a TdwsProgram, can’t be saved to a file, and won’t ever be. Why is that?

This is a question that dates back to the first DWS, as it was re-asked recently, I’ll expose the rationale here.

  • DWS has a very fast compiler, it’s no performance problem to compile scripts instead of loading a binary representation that has to be de-serialized. How fast is it? See below.
  • DWS lets you define custom filters, that enable you to encrypt your scripts easily, if hiding the script source is what you were after with the bytecode.
  • DWS compiler/parser portion is quite light (currently less than 75kB), especially compared to the size of the Delphi libraries you’ll be using for the runtime. You probably won’t notice it in the EXE size once you expose more than a few trivial libraries.
  • Last but not least, when loading a binary representation of a script, you have to make sure all libraries are compiled into the application that loads and wants to execute the script, and that they are entirely backward-compatible with what was exposed to the script back when it was compiled. That is irrelevant when re-compiling.

How fast is the DWS compiler?

I did some quick benchmarking against PascalScript and Delphi itself.
I generated a script based on the following template:

    var myvar : Integer;
    begin
       myVar:=2*myvar-StrToInt(IntToStr(myvar));
    end;

The assignment line being there only once, 100 times, 1000 times, etc. The result was saved to a file, and the benchmark consisted in loading the file, compiling and then running it for DWS. For PascalScript, the times are broken down into compiling, loading the bytecode output from a file, and then running that bytecode. Disk size indicates the size of the generated bytecode.
All times are in milliseconds (and have been updated, see Post-Scriptum below):

For line counts expected for typical scripts (less than 1000), compared to PascalScript, the cost of not being able to save to a bytecode is a one-time hit in the sub-15 milliseconds range, on the first run.
It illustrates why it isn’t really worth the trouble maintaining a bytecode version for scripting purposes, and that’s also my practical experience.

For larger scripts, it is expected the execution complexity will dwarf the compile time: the benchmark code tested here doesn’t have any loops, anything more real-life will have loops, and will likely have a greater runtime/compiletime ratio.

What of Delphi?

For reference, I tried compiling the larger line counts versions with Delphi XE, from the IDE.

  • the 100k lines case took 3 minutes 27 seconds to compile (ouch!), obviously hitting some Delphi parser or compiler limitation. Runtime was 63 ms.
  • the 10k lines case in Delphi compiled in a more reasonable 2400 msec, and ran in 4 ms (50% faster than DWS).

What else? The DWS compiler has an initial setup cost higher than PascalScript, but as code size grows, it starts pulling ahead. That setup overhead will nevertheless bear some investigation ;-) .
Once compiled, the 10x execution speed ratio advantage of DWS vs PascalScript is consistent with other informal benchmarks.

Post-Scriptum

Gave a quick look at the setup overhead with SamplingProfiler, and found two bottlenecks/bugs. The outcome was the shaving off of 3 ms from the DWS compile times, ie. the compile times for the 1, 100 and 1000 lines cases are now 0.95 ms, 2.85 ms and 19.1 ms respectively.

Tips , , ,

Code Optimization: Go For the Jugular

May 6th, 2009

Code optimization can sometimes be experienced as a lengthy process, with disruptive effects on code readability and maintainability. For effective optimization, it is crucial to focus efforts on areas where minimal work and minimal changes will have to most impact, ie. go for the jugular


The Prey

I will illustrate this using SamplingProfiler in a small example, taken from a small library that deals with short vectors of varying length (but usually less than 10 dimensions), which I simplified, isolated & anonymized for the purpose of this article.

uses TypInfo;

type
   TDoWhat = (dwInc, dwDec);

procedure DoSomething1(var data : array of Integer; what : TDoWhat);
var
   i : Integer;
begin
   for i:=Low(data) to High(data) do
   begin
      case what of
         dwInc : Inc(data[i]);
         dwDec : Dec(data[i]);
      else
         raise Exception.Create('Unsupported: '+GetEnumName(TypeInfo(TDoWhat), Integer(what)));
      end;
   end;
end;


Get Meat into Belly

Before starting any kind of optimization, one has to define goals and limits, ie. figure out what “good enough” will be rather consider  “good enough” to be the state of the code one has grown tired of optimizing it!

The sample code above is quite straightforward and simple. It would of course be possible to blow this code to huge proportions for optimization’s sake. If you are after getting every last drop of CPU-cycle juice, and allow yourself to use every trick in the book, a fully optimized version could represent several thousandths of lines of code (I’m not exaggerating). If it’s your core business, it might be okay, but if it’s just a utility library, the increased maintainability issues could never be justified.

But since this article is intended more as an illustration than a discussion on the methodology, I’ll get straight to the buffalo (beef). For further reading on that subject, you can start from Big O Notation, Benchmarking and Software metrics articles in wikipedia, there are also whole books on the subject.


Stalking the Prey

Looking at the above code, the first obvious optimization that developers suggest seems to be taking the conditional out of the loop, resulting in several case-specific loops. On small vectors, this nets about a 30% speedup. For further speedups, the suggestions are typically to go for loop unrolling, asm, and other heavy-handed solutions that come with a significant development time and code complexity increase.

Of course, readers of this website will know better than to jump straight into the code and apply optimization recipes: they would run the code through a profiler first. And since we’re dealing with a single procedure, an instrumenting profiler would be of little help, so they would run Sampling Profiler instead, and would get to see something like this:

Going For The Jugular - Initial Profiling Results

In this run, only the dwInc case was stressed (line 37), and obviously the procedure spends less than 30% of its time doing what it was asked of, and most of its time (33%) on the “end“, ie. cleaning up, plus 8% setting up in “begin“. That’s 40%+ doing nothing but stack and setup/cleanup work!
The conditional in the loop that could have looked like the most worrying bit is eating a bit less than 20% of the time.

What is the source of all that begin/end work? Place a breakpoint on begin, run and hit Ctrl+Alt+C when your breakpoint is reached, go have a look at the CPU view, and you’ll see this:

Going For The Jugular - CPU view near "begin"

This is a fairly significant stack setup for such a small procedure, and those instructions with “fs:” at the bottom are the setting up of an (implicit) exception frame. An exception frame for what? if you haven’t guessed already, navigate your CPU view near the “end” line.

Going For The Jugular - CPU view near "end"

No wonder “end” was a bottleneck! The call to UStrArrayClr indicates that the exception frame is here to cleanup several strings… these strings are those of the raise Exception, one is the string returned by GetEnumName, the other is the result of the concatenation passed to Exception.Create.


Isolate and Kill

How to get rid of that exception frame? One typical way is to use “Exception.CreateFmt”, and pass only constant strings to it, but that is not possible here with the call to GetEnumName, which returns a string. The other way is to isolate the exception to its own (nested) procedure:

procedure RaiseUnsupported(what : TDoWhat);
begin
   raise Exception.Create('Unsupported: '+GetEnumName(TypeInfo(TDoWhat), Integer(what)));
end;

and call RaiseUnsupported in the “case else“. Doing so will move the exception frame to the new procedure, where it’s irrelevant in terms of performance.
This simple change nets us a 33% speedup, ie. we reclaimed most of the lost time in begin/end! We also gained a bit from the UStrArrayClr, which did essentially nothing since those strings it was used to clear weren’t defined (as long as we did not hit the exception).

Note that if you use a nested procedure for RaiseUnsupported, you can be tempted not to pass it the “what” parameter, but use directly the “what” from its parent procedure. However by doing so, you’ll have the compiler use a special stack setup (so that the nested procedure can access the parent procedure’s variables). This setup will be faster than the exception frame it replaces, but with it, begin/end would still be taking about 18% of the CPU time spent in the procedure.


Repeat Until Belly.Full;

Those first 33% were easily gained. Let’s go for another round of SamplingProfiler:

Going For The Jugular - Further Profiling Results

Things are more satisfying: the line performing the actual work is now taking up most of the CPU time. Second comes the case of line. For further speed improvements, we now need to move the conditional out of the loop:

procedure DoSomething3(var data : array of Integer; what : TDoWhat);

   procedure RaiseUnsupported(what : TDoWhat);
   begin
      raise Exception.Create('Unsupported: '+GetEnumName(TypeInfo(TDoWhat), Integer(what)));
   end;

var
   i : Integer;
begin
   case what of
      dwInc :
         for i:=Low(data) to High(data) do
            Inc(data[i]);
      dwDec :
         for i:=Low(data) to High(data) do
            Dec(data[i]);
   else
      RaiseUnsupported(what);
   end;
end;

We have increased the line count noticeably, but most of those extra lines are still cosmetic. What further makes it a reasonable trade-off is that the execution time has been reduced by 66% from the initial version, it now executes 3 times faster!

Are there any more easy gains to be had? Let’s run the last version through SamplingProfiler:

Going For The Jugular - Final Profiling Results

More than 92% of the execution time now goes to the loop and actual work. We got only a wee bit left for stack setup (line 96) and the case of (line 97). At this point, the above makes it clear that if you want to go faster you’ll have to increase the line count and code complexity significantly as you’ll need to replace the two-liner loops with something else, which is bound to be heavier (unrolling, SIMD, etc.)


Rest Under A Tree

Some quick final notes to conclude.

When moving an exception to a procedure, there are two things to keep in mind:

  • the exception will be triggered at another place in the code, to know where it was actually triggered, you’ll have to look up one step in your exception log stack trace… You do have an exception log stack trace in place, don’t you?
  • the compiler won’t “know” about the exception in the called procedure, so it will assume execution continues after your RaiseUnsupported, so you may want to place an Exit after it (which will never be reached), to avoid warnings and allow the occasional register optimization by the compiler.

In the final version, we gained more than the previous profiling run hinted at: the new code allowed the compiler to make better use of the registers. Ofttimes, getting the fat out of the way is all you need to see improvements.

If you check the CPU view, you’ll see everything is quite efficient now, but even then, using all the remaining tricks in the book could probably net noteworthy gains, just at a significant complexity increase. I didn’t try, but I would guess a 2x or 3x speed up should be about right.

If you were to need to go that route, SamplingProfiler could still help you there: on ASM code, you get profiling data down to the ASM instruction… but that’s food for another article.

Tips , , , , , ,

Knowing what and when to optimize…

April 20th, 2009
Comments Off

…is as important as knowing how to optimize.

In this thread on the Delphi forums Ante Bonic brought back to intention this excellent Delphi Optimization Guide in Delphi article by Robert Lee. The article has aged a bit, but many tips remain true with the Delphi 2009 compiler (sadly so).  Like many optimization articles, Robert’s focuses on mostly local optimization tips, which can draw in warnings like this one one by Anders Isaksson:

Optimization should be done after profiling, not before.

Which I couldn’t agree more with. But to be fair, Robert’s states so in his article, as do most authors of optimization articles. Recipes and local optimization tips are to be used after all algorithmic and data structures improvements have been taken advantage off.

If one can list tips and tricks for local optimization, do’s and don’ts that are true often enough to be good tips in many scenarios. However, it’s practically impossible to come up with a “reusable” list of tips for algorithms and data structures. Too many specifics can come together, even when the problems are similar, considerations of scale or reactivity can drastically influence architectural and algorithmic options.

Hence the most visible optimization recipes are often local optimization ones, but mostly because there are few global optimization recipes. You only have global optimization methodologies. But even these methodologies can usually be summarized with few words:

  1. Time, profile, analyze and confirm your bottlenecks.
  2. Improve algorithms & data structures.
  3. Exhaust 1 & 2 before looking at local optimizations, and then don’t forget 1.

To optimize efficiently, ie. not waste your time, you have to master the first point.
To optimize effectively, ie. not waste the machine time, you have to master the second.

And the third point you ask? It’s a razor’s edge, when applied effectively, it can be very efficient, with very few changes like in this case, but if not, it’s a good way to end up there. To be effective, local optimization has to be about taking care of hidden machinery, hidden shortcomings of the compiler, hidden algorithms and data-structures that get in the way.

I’ll close this post by quoting Robert Lee’s article on timing:

Timing code is generally called “profiling”. If you want to improve the performance of your code, you first need to know precisely what that performance is. Additionally, you need to re-measure with each change you apply to your code. Do not spend a single second twiddling code to improve performance until you have analytically determined exactly where the application is spending its time. I cannot emphasize this enough.

Tips , , , , , , ,

How familiar are you with code profiling?

March 30th, 2009
Comments Off

SamplingProfiler was initially released in the Delphi ASM newsgroup, and I’m curious about the audience of this website, so I’ve setup a small poll.

How familiar are you with code profiling and/or Delphi code optimization? Can you tell apart instrumenting and sampling profilers merely by their respective heisenbugs, or is that profiler business sounding like a TV series from the last century?

Poll - Familiarity with Profilers

News , , , , , ,