High Performance Delphi

Home Fundamentals Guide Code Links Tools Feedback

General Optimization Guidelines

Optimization involves not only the speed of your code, but also the speed with which you create and debug your code. This means that you are not doing yourself any favors by creating fast but incomprehensible code. Fortunately creating optimal code in Delphi rarely requires ugly code. In fact, optimal code is often elegant code as well. Additionally, within a given application it is likely that the same sort of techniques will be used frequently. Thus, you can essentially set the coding style to what gives the best performance when it matters.

Keep it simple

When it comes to the Delphi optimizer, complexity kills. Keep routines simple (no more than about 5 to 8 variables "in play"). Do not do too much in any single loop. Overloading a loop causes variable addresses, array indices etc. to be reloaded on each iteration. Loop overhead is actually quite low so it is often advantageous to split a complex loop into multiple loops or to move the innermost one or two loops to a separate routine. Done properly, this will have the added benefit of improving the readability of your code.

Strongly favor local variables

Local variables are those declared within a routine in addition to any parameters passed. Only local variables can be converted into register variables, and register variables equal speed. Consequently, it is often advantageous to copy data into a local variable prior to using it. Typically this is most advantageous when the variable is to be used within a loop. Thus, the overhead of copying is offset by speedy reuse of the copied data. This optimization is particularly useful if class members are used in a tight loop. Delphi tends to load the pointer / class member just prior to its use within the loop, adding a lot of unnecessary overhead.

There is one exception to this rule: arrays with elements of a simple type. If you have an array of constant size and constant data, making it global will save a register during calculations. Since saving a single register is not worth a lot, this should only be used for constant structures (conversion or transformation tables) where having a global structure makes some sense to begin with.

Keep the number of parameters low

Small but heavily used routines should not have more than three parameters, as that is the maximum that can be passed by register. By following this rule you maximize the use of registers and give the Delphi optimizer a better chance to improve your code. Note that class methods have a hidden parameter Self that is always passed implicitly so for these only two parameters are left.

Do not use nested routines

Nested routines (routines within other routines; also known as "local procedures") require some special stack manipulation so that the variables of the outer routine can be seen by the inner routine. This results in a good bit of overhead. Instead of nesting, move the procedure to the unit scoping level and pass the necessary variables - if necessary by reference (use the var keyword) - or make the variable global at the unit scope.

Pointer variables

A valuable technique is to take advantage of pointers. A lot of programmers shy away from pointers due to the potential for access violations, memory leaks and other low-level problems. However, pointers are a valuable tool in optimizing code in Delphi. Fortunately, this does not mean you have to convert all your data references to pointers. What it does mean is that you should take advantage of pointers as temporary references into your data. These temporary variables will typically be optimized into register variables. Consequently, you really are not adding any machine code, you are only providing a clue for the compiler that it needs to hold on to the intermediate address. You can use pointers much the same way as you would use a with statement. That is, use them to simplify complicated or redundant addressing in complex data structures. In the case of with statements, this is exactly what the compiler does internally. For example:

with Structure1.Structure2[i] do
begin
  ...
end;

at the compiler-level becomes:

InnerStructure := Structure1.Structure2[i];  // if it is a class
InnerStructure := @Structure1.Structure2[i];  // some other type
begin
  ...  // references to InnerStructure
end;

Linked lists vs. Arrays

Finding the trade-off between linked lists and arrays is a classic design problem. On older computers (Pentium and before) integer multiplication was a slow operation. Since multiplication is the key to accessing arrays this shifted the performance balance towards Linked lists in some cases. Is it random access or sequential access? Obviously, if you truly need random access then an array is the way to go for anything more than about 5 elements (this is a rule of thumb based on experimentation). For sequential access or quasi-sequential access the short answer is that arrays are better for simple element types and linked lists are better for larger types.

Multiplication on the Pentium II is now much much faster. Consequently, array access is always faster.

Types of Arrays

In Delphi, arrays come in many flavors: Static, Dynamic, Pointer and Open. Static arrays are the classic Pascal array type (A: array[0..100] of Byte). Dynamic arrays are the new array type introduced with Delphi 4 (A: array of Byte). Pointer arrays are simply pointers to Static arrays, however, the actual number of elements may not match that of the array boundaries. Finally Open arrays look like dynamic arrays but are exclusively used as parameters of routines. The underlying implementation of all these arrays varies quite substantially. From an efficiency viewpoint Static and Pointer arrays are the best choice, followed by Open, then Dynamic arrays. However, Static arrays are often too inflexible, and Pointer arrays can create painful management issues. Fortunately, the various types are convertible. For arrays without a fixed size, the current best choice is to manage them as Dynamic arrays, and convert them to Pointer arrays as needed.

Dynamic arrays are much like huge strings (AnsiStrings) in that the variable is actually a pointer to the array's first element. Thus, converting a Dynamic array to a pointer arrays is simply an assignment. While the length (size) of the Dynamic array is stored just before the first element, using High or Length on a Dynamic array generates a function call rather than some compiler "magic" to extract the length - in fact, High calls Length. Consequently, do not repeatedly get the size of the array via these functions. Get it once (in the routine) and save it.

Both Dynamic and Open arrays incur a fair amount of overhead when used as parameters unless you use the const or var modifiers. Note that, just like class parameters, a const modifier on a Dynamic array only prevents you from changing the entire array, not from modifying its content.

Let's finish with an example for converting the various array types:

type  
  TDoubleArray = array of Double;
  TStaticDoubleArray = array[0..0] of Double;
  PDoubleArray =^ TStaticDoubleArray;
  
function Sum(const X: TDoubleArray): Double;  
var  
  P: PDoubleArray;  
  i: Integer;  
begin  
  P := Pointer(X);  
  Result:=0;
  for i := 0 to Length(X)-1 do  
    Result := Result + P[i];  
end;

Exceptions

Do not use exceptions just to jump out of a bit of code, or as a catch-all on input errors. They add overhead with both the try..finally block and with throwing the exception itself. Use the break, continue, or exit statements to do unusual flow control, and validate inputs (like pointers) as early as possible, but outside any loop.

Use type-casting rather than `absolute`

A technique sometimes used to avoid typecasting is to "overlay" a variable with another of a different type by using the absolute keyword. However, this prevents the variable from becoming a "fast" register variable. It is better to type-cast and save the original variable into a new variable. For example:

procedure DoSomething(s: PChar);
var
  ByteArray: PByteArray absolute s;
begin
  ...

should be changed to or written as:

procedure DoSomething(s: PChar);
var
  ByteArray: PByteArray;
begin
  ByteArray := PByteArray(s);
  ...

Working with Sets

There are two compiler magic functions called Include and exclude, that are quite efficient for adding and subtracting single elements form sets. Thus, you should use these instead of " s:=s+[a];" sort of statement. In fact, it is efficient enough that a small number of repeated uses of Include or exclude can still be better than the above notation.

Pentium II specific bottlenecks

It has occurred to me that while many of the techniques presented here are based upon how Pentium II processors bottleneck, I have never actually stated how this works. The long and detailed version can be found in Intel's documentation and in Agner Fogs Pentium Optimization Guide. Here I present a quickie version slanted towards Delphi's compiler output. Having a general understanding of this process will may help you decide what needed optimizing and what does not.

First off, the Pentium II is a superscalar pipelined processor with out-of-order execution capabililities. Basically that means that each instruction gets "executed" in steps and can march along one of a few different channels. Specifically, each instruction has to be loaded, executed and retired with the out-of-order buffer acting as a sort of "waiting room" between the load and execution steps. This seems simple enough, but the complications start to build once the multiple channels part is added in, because not all channels can handle all instructions. There are 3 loading channels, one of which can take anything while the other two can only handle "simple" instructions. There are 5 execution channels (called ports by Intel), one is general purpose integer, another is general purpose integer plus floating point, the third handles address math, and the fourth and fifth load and store data. The retiring step also has 3 channels. There is also the issue of latency. That is, many instructions take longer than 1 cycle to execute.

So what does all this mean? Well, it means you can bottleneck in a whole bunch of different ways. Basically, any channel of any step can be a bottleneck if too many instructions that require that specific unit are encountered. Thus, while the CPU can theoretically process 3 instructions per cycle, it may be limited to 2 or 1 or even less due to the mix of instructions currently being executed. The out-of-order "waiting room" helps with this situation by allowing instructions not dependant on a currently executing operation to go around any that may be waiting for a specific port and execute on a different port. This helps with the small bottlenecks where there is a temporary backup on a given port. However, it does nothing for the large scale backups. For instance, executing a large series of floating point operations, say a loop around a complex math expression, will typically be constrained by the limitation of there being only one FP capable port. Thus, the throughput drops to 1 instruction/cycle. Now the loop also has some other overhead instuctions associated with it (incrementing the loop variable and jumping) These instructions incur essentially zero cycles since they can be fitted in around the backed up FP port.

In Pascal terms, this means that any tight, repetative operation will probably be constrained by only one aspect of the operation. It might be Floating point as described above, or it might be integer math or memory addressing. In any case, the only optimizations that will have any impact are those that go after that main aspect. Pruning the ancillary code back will have no effect.

Inside the `For` statement

Implementing For statements is one of the more complex jobs the compiler has to deal with. This is in part, because the compiler goes to great pain to avoid integer multiplication, which was quite slow on CPUs before the Pentium II. Thus, For loops are deconstructed into pseudocode that looks something like this:

Original Loop:

  For i:=m to n do
  	A[i]:=A[i]+B[i];

Becomes:

  PA:=@A[m];
  PB:=@B[m];
  counter=
  m-n+1; ifcounter>0 then
    repeat
	  PA^:=PA^+PB^
	  inc(PA);
	  inc(PB);
	  dec(Counter);
	until counter=0;

There are other configurations, but this is the most common, and it is the one that causes problems. The problem stems from the fact that the variable i appears nowhere in the deconstructed version. However, when stepping through this code in the debugger, watching i will show the value of the new variable most similar to i, which is counter. This has sent many a programmer into fits of hysteria thinking that their loop is being executed backwards. It is not. The debugger is merely misinforming you.

This example also illustrates the substantial overhead associated with for loops. Notice that three variables need to be incremented on each iteration, and that there is a fair amount of initialization code. In some cases, this overhead is lessened. For instance if m and n were compile-time constants, Or if m=0, and A and B were already pointers that were not used again in the code, then overhead code would be reduced.

Interfaces

This is just a beginning pass at the performance implications of using interfaces. Basicaly, an interface is implemented as a cross between a string and a class. Interfaces, like strings are reference counted which means every time you create or copy one, and every time an interface variable goes out of scope there is some overhead. Thus, treat interface variables more like you would string variables than object variables. (Pass by const where possible, watch out for using too many temp variables, etc.) Internally, interfaces behave something like an object with all virtual methods only worse. There are in fact two layers of indirection. So treat them accordingly.

Another note on the subject of interfaces:

Hallvard Vassbotn: For any interfaced global variable, the compiler automatically adds another global variable that contains the address of this first variable. When external units access the variable, it is used through the implicit pointer variable. The reason for this is to support packages and is done even if you are not using packages. (See article in The Delphi Magazine issue 43 for more details.)

Optimization Techniques

Keep an open mind

Optimization is best approached as a top down issue. The most powerful concept in optimization can be stated as "If it takes too long to figure out the answer, then change the question." The best improvements in performance will always come from changes made at the design and algorithmic level. By the time you get down to coding specifics, your options are quite limited. Unfortunately, like the rest of design, it is rather difficult to break down this high-level optimization into a nice set of rules. Nonetheless, the first thing to do if performance needs improvement, is to look at the complete problem at hand, starting optimization at the top and then working down.

Time your code

Timing code is generally called "profiling". If you want to improve the performance of your code, you first need to know precisely what that performance is. Additionally, you need to re-measure with each change you apply to your code. Do not spend a single second twiddling code to improve performance until you have analytically determined exactly where the application is spending its time. I cannot emphasize this enough.

Code alignment

Be aware that the exact positioning of your code and its layout in the executable module can affect its timing. The reason for this is that there are penalties for jumping to "sub-optimal" address alignments. There is very little you can do to influence this alignment (this is a linker task), except to be aware of it. While it is possible to insert spacing code into a unit, there is on guarantee that your alignment efforts will be permanently rewarded since 32 bit Delphi aligns only on 4 byte (DWord) boundaries. Thus, the next change to any routine in any unit above the current routine may shift your code. This problem can result in speed penalties as great as 30% in tight loops. However, in more typical loops the problem is substantially less. Also note that this can make timing code and therefore optimization difficult since seemingly innocuous shifting in the code can affect the performance. Consequently, if you make a change that you "know" should increase performance but it does not, it may be the shifts in code alignment are hiding improvements in your code.

Utilize the CPU window

You do not need to be an assembler programmer to take advantage of the CPU window. At the very least it will give you an idea of the underlying complexity involved in each statement. Very often you can estimate the effectiveness of a particular optimization technique by simply counting the number of instructions produced for a given operation. For instance, many references to the ebp register (as in mov eax,[ebp-$04]) within a loop are an indication that variables are continually reloaded. These reloadings are unnecessary and thus are a prime target for optimization.

In Delphi 4.0, the CPU window is readily accessed from the main menu. However, both version 2.0 and 3.0 of Delphi also have hidden CPU windows. To allow access to these you need to add an entry in the registry, using the registry editor "RegEdit.exe":

[HKEY_CURRENT_USER\Software\Borland\Delphi\2.0\Debugging]
"EnableCPU"="1"

[HKEY_CURRENT_USER\Software\Borland\Delphi\3.0\Debugging]
"EnableCPU"="1"

Unroll small loops

Loop unrolling is a classic optimization technique, and it can be easily done in Delphi. However, it is only worth doing on fairly small loops. Unrolling essentially consists of doing what was originally multiple iteration operations within a single pass through the unrolled code. This reduces the relative loop overhead. Since the branch prediction mechanism on Pentium II CPUs does not perform well (read: causes penalties) on very tight loops, unrolling might be beneficial there, too.

In Delphi, the best way to unroll is usually with a while loop. For example:

i := 0;
while i < Count do
begin
  Data[i] := Data[i] + 1;
  Inc(i);
end;

becomes:

i := 0;
while i < Count do
begin
  Data[i] := Data[i] + 1;
  Data[i+1] := Data[i+1] + 1;
  Inc(i, 2);
end;

The downside to loop unrolling is that you have to worry about what happens when Count is not divisible by the factor two. You typically handle it this way:

i := 0;
if Odd(Count) then
begin
  Data[i] := Data[i] + 1;
  Inc(i);
end;

while i < Count do
begin
  Data[i] := Data[i] + 1;
  Data[i+1] := Data[i+1] + 1;
  Inc(i, 2);
end;

You can unroll by whatever factor you want. However, the diminishing marginal return on unrolling by progressively larger values coupled with the growing complexity of the code makes unrolling by more than a factor of 4 rather uncommon.

Eliminate conditionals within loops

It is common for there to be if statements within a loop with the conditionals for the statement based on the loop index. Frequently, these can be removed from the loop by unrolling the loop or splitting the loop into two loops. An example of the former would be when statements must be executed every other iteration. An example of the latter would be when statements are executed on a specific iteration.

Reduce the number of looping conditionals

A common coding structure is to loop while some condition is true and while the loop index is less than some value. If the loop is small - and it often only consists of incrementing the loop index - then the bulk of the loop's execution time is spent evaluating the loop conditionals. It is sometimes possible to reduce the number of these conditionals by making one condition happen when the other would have. Take the example of scanning for the occurrence of a specific character in a string:

i := 1;
l := Length(s);
while ((i <= l) and (s[i] <> c)) do
  Inc(i);
...

The two conditionals can be combined by placing the desired character in the last position in the string:

i := 1;
l := Length(s);
lastc := s[l];
s[l] := c;
while s[i] <> c do
  Inc(i);
s[l] := lastc;
...

This results in nearly a 2x speed improvement. This optimization requires some forethought to ensure that there is an empty space available at the end of the data. Strings and PChars always have the null at the end that can be used. Also this technique, may cause undesired side-effects in a multi-threaded environment due to changing the data being scanned. This technique also works well with unrolling, as it often simplifies the problems associated with needing partial iterations. See FindMax as an additional example of this technique.

Make the default path, the "no-jump" path

This technique dates way back. However, there is still a glimmer of truth in it. The original reason (jumping took a lot of time) really isn't the problem now. The problem now is more related to code alignment and branch prediction. On the alignment side, if you do not jump than there is no problem with alignment. On the branch prediction side, a branch is not even considered for prediction until it is actually taken once.

Take advantage of break, exit and continue

These flow control statements are often derided as being "bad programming". However, they do have their place, especially in performance optimizing code. The need for these statements typically arises when some condition is not determined until the middle of a loop. Often they are avoided by adding boolean variables and additional conditionals to transfer control. However, these additions cost execution time, and can often make the code look more, rather than less, complex.

Resorting to assembler

Do not attempt to use assembler to improve performance on Pentium II CPUs. This is somewhat controversial, but it is a pretty good rule of thumb. The out-of-order execution capabilities of Pentium II class CPUs pretty much eliminate any advantage you might gain by recoding your algorithm in assembler. In several tests, I have found that assembler vs. optimally coded Pascal rarely exceeds a 10% difference. There are always exceptions to this rule (for instance this Alpha Blending code) and even times where arguably assembler is cleaner than Pascal, but the point is that you should not just jump to assembler when code seems too slow.

On the other hand, Pentium CPUs can often benefit from assembler coding. Improvement factors of around two are not uncommon. However, optimal coding for the Pentium processor can easily result in rather non-optimal code on other processors. This applies to floating point code in particular. If you choose to pursue this path then study Agner Fog's assembler optimization manual carefully.

`For` versus `While` loops

Loops with a pre-determined number of iterations can be implemented either with a For loop or a While loop. Typically, a For loop would be chosen. However, the underlying implementation of the For loop is not as efficient as that of the While loop in some instances (See Inside the For statement). If the contents of the loop involve no arrays or only single-dimensional arrays with elements of sizes 1, 2, 4, or 8 bytes, the code generated for a While loop will be more efficient and cleaner than for the comparable For loop. On the other hand, multi-dimensional arrays or arrays with elements of other sizes as those listed above are better handled by For loops. It is often possible to convert one of the latter into the former, typically by using pointers. This approach is likely to increase the efficiency of the code.

Additionally, using a While loop appears to reduce the complexity factor. Thus it may be possible to trade For loop usage against splitting up a routine. An example of this is the row reduction step of Gauss Elimination. The optimal configuration with For loops is to factor out the two innermost loops into a separate routine. With While loops however, all three loops can be kept together.

Of course there has to be an exception. If the index is never used within the loop then For is usually a better choice. Also give For as shot if both loop bounds are compile-time constants.

Note that for a While loop to be as efficient as possible, the loop condition should be as simple as possible. This means that, unlike For loops, you need to move any calculation of the iteration count out of the While statement and into a temporary variable.

Large Memory requirement problems

On Pentium II class CPUs it is often the case that cache or memory bottlenecks are the main optimization problem, especially if the data set being manipulated is large. If this is the case, then an entirely different strategy is in order. Focus on reducing the memory requirements and on reducing the number of passes through the data. In other words, pack it tight and do as much as possible with it before moving on. This may be at odds with some of the other suggestions presented here, so it is necessary to determine which factor is more rate limiting by experimenting with different implementations and profiling these. A good indication that a cache and/or memory bottleneck is dominating is when apparent improvement in the code being executed does not increase the performance.

`Case` statement optimization

Case statements are implemented as follows: First,the compiler sorts the list of enumerated values and ranges. This means that the placement individual cases within the Case statement is irrelevant. Next, The compiler uses a sort of binary comparison tree strategy along with jump tables to test the cases. The decision between jump table and comparison tree is based on the "density" of the enumerated cases. If the density is sufficiently high a jump table will be generated. If the density is too low then the list will be split approximately in half (with ranges counting as 1 element in the list rather than as the number of values spanned). The process then starts over on each of the sub-branches. That is, density check then either generate jump table or split. This continues until all the cases are handled.

So what optimization possibilities exist? Basically, it boils down to this. Case statements are quite well optimized, but not perfect. The "splits" on the binary comparison tree can come at awkward places. Consequently, if you have groups of consecutive values interspersed with gaps, it is better to make each range of consecutive values its own Case statement and then make an overall Case statement with the underlying ranges each being a single case. This works because ranges won't be split but sequential cases will be. The impact of this is that it tends to create jump tables covering each entire sub-range. An Example:

Before:

  Case x of
    100 :DoSomething1;
    101 :DoSomething2;
    102 :DoSomething3;
    103 :DoSomething4;
    104 :DoSomething5;
    105 :DoSomething6;
    106 :DoSomething7;
    107 :DoSomething8;
    200 :DoSomething9;
    201 :DoSomething10;
    202 :DoSomething11;
    203 :DoSomething12;
    204 :DoSomething13;
    205 :DoSomething14;
    206 :DoSomething15;
    207 :DoSomething16;
    208 :DoSomething17;
    209 :DoSomething18;
    210 :DoSomething19;
  end;

After:

  Case x of
    100..107 : 
	  case x of
        100 :DoSomething1;
        101 :DoSomething2;
        102 :DoSomething3;
        103 :DoSomething4;
        104 :DoSomething5;
        105 :DoSomething6;
        106 :DoSomething7;
        107 :DoSomething8;
	  end;
	200..210 :
	  case x of
        200 :DoSomething9;
        201 :DoSomething10;
        202 :DoSomething11;
        203 :DoSomething12;
        204 :DoSomething13;
        205 :DoSomething14;
        206 :DoSomething15;
        207 :DoSomething16;
        208 :DoSomething17;
        209 :DoSomething18;
        210 :DoSomething19;
      end;
  end;

Also, Case statements do not have any provision for weighting by frequency of execution. If you know that some cases are more likely to be executed than others. You can use this information to speed the execution. The way to achieve this is to cascade if and Case statements to prioritize the search order. An Example:

Before:

  Case x of
    100 :DoSomething1;
    101 :DoSomethingFrequently2;
    102 :DoSomething3;
    103 :DoSomething4;
    104 :DoSomething5;
    105 :DoSomething6;
    106 :DoSomething7;
    107 :DoSomething8;
  end;

After:

  if x=101 then
    DoSomethingFrequently2
  else
    Case x of
      100 :DoSomething1;
      102 :DoSomething3;
      103 :DoSomething4;
      104 :DoSomething5;
      105 :DoSomething6;
      106 :DoSomething7;
      107 :DoSomething8;
    end;

Moving and zeroing memory

The built-in methods supplied with Delphi for moving memory and filling it with zeros are Move and FillChar respectively. These routines are based around the rep movsd and rep stosd assembler instructions, which are fairly efficient. However, there is some extra cleanup code associated with each of the routines that can reduce their efficiency, especially when working on smaller amounts of memory. Additionally, there are special data alignment considerations on Pentium II CPU's that can have a substantial effect.

The first pass solution to these issues is simply to use a plain loop to do the task. This is especially effective if the data elements being handled are 32 or 64 bit or the structure involved is only partially zeroed/moved (e.g. sub-section of a matrix). However, the loop approach is less effective for arrays of large records or for smaller elements like byte or word. You can unroll the loop and use typecasting to further improve the situation but this complicates the code substantially and only results in a small improvement.

At this point it is time to start looking at a specialized routine. The first issue is the boundary size of the structure. Working with 32bit quantities is always the fastest approach. However, if the structure is not evenly divisible by 4 bytes this can be a problem. The solution used by Move and FillChar is to round down the size to the nearest dword (4 byte) boundary, then copy the remainder separately. As was mentioned, this extra overhead can be costly on smaller structures. However, many structure are in fact evenly divisible even if they do not at first appear to be. All memory allocations are rounded up to the next dword boundary. Thus a string of 2 characters is really 4 bytes long. It is usually faster to copy or zero this extra data than to avoid it. Obviously care must be taken when using this shortcut and it should be well documented.

Dealing with the data alignment issue is more complicated, and more relevant only on larger structures. I will skip the description and simply show the code:

procedure ZeroMem(A:PDataArray; N:integer);
var
  i,c:integer;
  B:PDataArray;
begin
  B:=Pointer((integer(A)+15) and $FFFFFFF0);
  c:=integer(@A[N])-integer(B);
  fillChar(A^,N*SizeOf(TData)-c,#0);
  fillChar(B^,c,#0);
end;

This will align on a 16 byte boundary by skipping over some of the data. Of course the skipped part must be properly dealt with, thus the two calls to fillChar. Obviously, this is not the fastest approach since you have now got the overhead of fillchar*2, but it does illustrate the technique. For maximum speed, this is one of those cases where you have to resort to assembler.


procedure ZeroMem32(P:Pointer;Size:integer);
// Size=number of dword elements to fill
// assumes that Size>4
asm
  push edi
  mov ecx,edx
  xor edx,edx
  mov dword ptr [eax],edx
  mov dword ptr [eax+4],edx
  mov dword ptr [eax+8],edx
  mov dword ptr [eax+12],edx
  mov edx,eax
  add edx,15
  and edx,-16
  mov edi,edx
  sub edx,eax
  shr edx,2
  sub ecx,edx
  xor eax,eax
  rep stosd
  pop edi
end;

The move version is very similar. The Dest pointer is the one aligned:


procedure MoveMem32(Src,Dest:Pointer;Size:integer);
// Size=number of dword elements to fill
// assumes that Size>4
asm
  push edi
  push esi
  push ebx
  mov ebx,[eax]
  mov [eax],ebx
  mov ebx,[eax+4]
  mov [eax+4],ebx
  mov ebx,[eax+8]
  mov [eax+8],ebx
  mov ebx,[eax+12]
  mov [eax+12],ebx
  mov ebx,edx
  add ebx,15
  and ebx,-16
  mov edi,ebx
  sub ebx,edx
  shr ebx,2
  sub ecx,ebx
  lea esi,[eax+4*ebx]
  rep movsd
  pop ebx
  pop esi
  pop edi
end;

Global Data revisited

There is a case where using a global structure is especially advantageous, namely two-dimensional (or rather double-indexed) arrays with elements of a simple type and where access is non-sequential for both indices. By making the data structure global, both indices can be applied simultaneously to access the structure, thereby avoiding additional instructions to combine the indices.

`While` loops revisited

An additional technique that can be applied to While loops operating on arrays that saves on CPU registers is to shift around all the references to the index variable so that you can count from a negative number toward zero. This frees the register that would have been needed to hold the iteration count. For example:

  i := 0;
  while i < Count do
  begin
    Data[i] := Data[i] + 1;
    Inc(i);
  end;

becomes:

type
  TRef = array[0..0] of TheSameThingAsData;
  PRef = ^TRef;
var 
  Ref: PRef;
... 
  Ref := @Data[Count];
  i := -Count;  // Assign NEGATIVE count here
  while i < 0 do  // and count UP to zero
  begin
    Ref[i] := Ref[i] + 1;
    Inc(i);
  end;

Pointer variables revisited

In addition to the reduced dereferencing discussed above, using a pointer variable can also serve to increase the "priority" of an already existing pointer variable. In the example shown below, taken from a Sub-String replacement routine, the PChar variable pSub1 was being reloaded within the loop (this could be seen in the CPU Window). By assigning it to pTemp and then using pTemp within the loop, the loading was shifted outside the loop, saving instruction cycles.

pTemp := pSub1;  // increases "priority" of pSub1
while iStr[k] = pTemp[k] do
  Inc(k);

Avoid checking methods pointers with `assigned`

Checking method pointers for nil is a common operation typically associated with calling events. Unfortunately, if assigned(FSomeEvent) then ... does a 16bit compare of the high word of the code address for the method pointer. This is rather odd and completely unnecessary, and I can only guess that it is some sort of holdover from 16bit Delphi 1. The workaround is to check the code address directly ( if assigned(TMethod(FSomeEvent).code) then .... This is a bit ugly and so you may only want to follow it in particularly time critical sections.

Controlling the size of enumerated types

If you use enumerated types (such as TSuits=(Diamonds,Hearts,Clubs,Spades) ) include the {$MinEnumSize 4} (or {$Z4}) directive to force all enum variables to be 32bit. If you have compatibility issues you can simply turn it on for the type declarations of interest. For instance:

type
{$Z4}
  TSuits=(hearts,clubs,diamonds,spades);
{$Z1}

Utilization of this directive is especially effective for enumerated types greater than 256 elements. These result in word sized variables which are quite slow.

Virtual methods

It should not be surprising that virtual methods incur more overhead than static methods. Calling a virtual method requires two pointer dereferences and an indirect call which, the method has a couple of parameters approximately doubles the total call overhead. However, there is the potential for much more severe penalties. The indirect call can suffer from what amounts to branch misprediction which is a fairly stiff penalty on Pentium II processors. The penalty is incured each time the target of the call changes. Thus, calling virtual method within a loop where the method might change on every iteration could see a substantial number of penalties. The workaround is essentially to sort the method calls.

Example:

  TBaseClass=class
  public
    procedure VMethod; virtual;
    procedure SMethod;  
  end;

  TDerivedClass=class(TBaseClass)
    procedure VMethod; override;
  end;

  TDerived2Class=class(TBaseClass)
    procedure VMethod; override;
  end;

implementation

type 
  TArray=array[0..100] of TBaseClass;
 
procedure DoStuff;
var
  b: integer;
  j: integer;
  A:TArray;
begin
  A[0]:=TBaseClass.Create;
  b:=0;
  for j := 1 to 99 do    
  begin
    b:=(1+random(2)+b) mod 3;  // mix em up
    case b of    
      0: A[j]:=TBaseClass.Create;
      1: A[j]:=TDerivedClass.Create;
      2: A[j]:=TDerived2Class.Create;
    end;    
  end;
  for j := 0 to 99 do
  	A[j].VMethod;
  for j := 0 to 99 do    
  	A[j].SMethod;
end;

Sorting the calls is somewhat complicated, an example is shown below:

Type
  TSomeVirtualMethod=procedure of object;
  TSomeMethodArray=array[0..100] of TSomeVirtualMethod;

var
  SomeMethodArray:TSomeMethodArray;

//    Initialization pass
  for i:=0 to Count-1 do
    SomeMethodArray[i]:=Item[i].SomeVirtualMethod;


//  Do something passes
  for i:=0 to Count-1 do
    SomeMethodArray[i];

This, by itself, saves an underwhelming 1 clock cycle per call, but you can sort the array by the code that's called (using TMethod) to minimize the oh so painful branch prediction failure that can really dominate this kind of method calling. Additionally, if the base class method is some sort of do nothing routine it could be eliminated from the procedure list entirely.

// initialization pass

for i:=0 to Count-1 do
begin
  Hold:=ClassArray[i].SomeVirtualMethod;
  if TMethod(Hold).Code<>@TBaseClass.SomeVirtualMethod then
  begin
    j:=0;
    while (j>ArrayCount) and (longint(TMethod(Hold).Code)<SomeMethodArray[j]) do
      inc(j);
    for k:=ArrayCount-1 to j do
      SomeMethodArray[k+1]:=SomeMethodArray[k];
    SomeMethodArray[j]:=Hold;
    inc(ArrayCount);
  end;
end;

This obviously isn't for the faint of heart, and is only useful in certain situations, but it could be a big time saver in cases where there are many objects, but only a few versions of the method and the method is relatively small or empty.

Home Fundamentals Guide Code Links Tools Feedback

General Optimization Guidelines

Contents

Style Guidelines

Coding Style versus efficiency

Keep it simple

Strongly favor local variables

Keep the number of parameters low

Do not use nested routines

Pointer variables

Linked lists vs. Arrays

Types of Arrays

Exceptions

Use type-casting rather than absolute

Working with Sets

Pentium II specific bottlenecks

Inside the For statement

Interfaces

Optimization Techniques

Keep an open mind

Time your code

Code alignment

Utilize the CPU window

Unroll small loops

Eliminate conditionals within loops

Reduce the number of looping conditionals

Make the default path, the "no-jump" path

Take advantage of break, exit and continue

Resorting to assembler

For versus While loops

Large Memory requirement problems

Case statement optimization

Moving and zeroing memory

Global Data revisited

While loops revisited

Pointer variables revisited

Avoid checking methods pointers with assigned

Controlling the size of enumerated types

Virtual methods

Home Fundamentals Guide Code Links Tools Feedback

Use type-casting rather than `absolute`

Inside the `For` statement

`For` versus `While` loops

`Case` statement optimization

`While` loops revisited

Avoid checking methods pointers with `assigned`