Style Guidelines
Use 32 bit variables whenever possible
In 32 bit code, such as generated by Delphi 2 and later, things
just get better when the values being manipulated have a size of
32 bits. 16 bit variables (Word, ShortInt, WideChar) are especially
slow as they require the processor to temporarily slip into 16 bit
mode to work with them. This can double the time it takes to work
with these values. 8 bit variables (Byte, SmallInt, Char) are not
as bad, especially, if you do not mix their usage with 32 bit values.
However, they can still cause the inclusion of additional instructions
in order to zero out the rest of the 32 bit register.
If you must use a smaller type for compatibility, convert it to
32 bit as soon as possible, and back to the smaller size (if necessary)
just prior when it is needed. You do this simply by assigning it
to a 32 variable.
Avoid ordinal subranges
One of the advantages of Pascal has traditionally been its strong
typing, and so the ability to create special subrange types and
enumerations has been part of this. Unfortunately, subranges and
enumerations can cause trouble when attempting to optimize for performance.
The problem lies in the fact that the underlying variable type choosen
hold a subrange or enumeration variable is based on the size of
the subrange. For example, enumerations with less than 256 elements
or subranges with boundary values ranging between 0 and 255 will
be stored as byte. This can lead to trouble in that the underlying
variable size may not be handled as efficiently. For instance, consider
the following subrange:
type
TYear=1900-2000;
Variables of type TYear will saved as 16bit quantities. As already
discussed, 16bit variables are particularly slow.
Optimization Techniques
Play around with adding temporary variables to split up complex
expressions
Typically, cramming everything into a single expression is the
best way to optimize, but not always. At some point, the expression
will become so complex that the compiler will be forced to break
it up on its own. But frequently you can do a better job than the
compiler. Try it!
Integer multiplication
Prior to the Pentium II, integer multiplication was quite expensive.
With the arrival of the Pentium II however, integer multiplication
has dropped down to the same one-cycle execution time as most other
instructions. Additionally, the compiler will avoid doing multiplication
by a constant if the same can be accomplished by utilizing addition,
shifting and the lea instruction (mentioned below). Thus, you need
to take your target processor into account when choosing whether
to use multiplication or use some other equivalent method.
Comparison of a variable against multiple ordinal constants
This topic sounds heavy but only boils down to statements like
these:
if (x > = 0) and (x < = 10) then
DoSomething;
if (((c > = 'a') and (c < = 'z')) or
((c > = '0') and (c < = '9'))) then
DoSomething;
In each case, there is only a single variable and it is compared
to multiple constants. It may be slightly more efficient and arguably
clearer when the above code is expressed as:
if x in [0..10] then
DoSomething;
if c in ['0'..'9', 'a'..'z'] then
DoSomething;
The improvement in efficiency depends upon the likelihood of, for
instance x bein within the range versus being out of the range.
If it is more likely to be within the range then the set notation
is better. Efficiency of the set notation increases as the number
of sub-ranges increases. However, there is an inherent limitation
on sets that limit them to 256 elements. This restricts this usage
to values between 0 and 255 for integer types. For the full range
of integers you can use this notation:
case x of
0..10: DoSomething;
end;
case c of
'a'..'z',
'0'..'9' : DoSomething;
end;
which produces equivalent code, but is not as elegant.
Advanced note: This operation may use an additional CPU register.
movzx vs xor/mov
A common requirement is to load values smaller than 32 bits in
to a register. Since they do not overwrite the entire register it
is necessary to zero out the register first. Alternatively, you
can use the built in instruction movzx (move with zero extend).
On Pentiums and before this instruction was slower than using xor
reg,reg/mov reg,{value}. However, The PII has streamlined this instruction
so that now it is prefered over the xor/mov combination. Note that
the compiler chooses between these two options based on a set of
rules that is apparently fairly complicated, as I have yet to figure
them out.
Utilizing the LEA assembler instruction
There is an assembler instruction called LEA (Load
Effective Address) that can do
a couple operations at once. The only way to consciously take advantage
of this instruction in Delphi is with array notation. For it to
be fully effective, the array variable itself must be used again
after the desired "trick" location. For example the following
snippet is from a routine that calculates
the length of a Pchar (StrLen) string (i.e. the position of
the first #0 character). A total of four characters are
processed at a time. Notice the usage of q in the calculation of
r2.
function StrLenPas(tStr: PChar): integer;
var
p: ^Cardinal;
q: PChar;
bytes, r1, r2: Cardinal;
begin
...
q := PChar(p^); // load 4 characters into q
r2 := Cardinal(@q[-$01010101]); // subtract 1 from each char (utilizing LEA)
r1 := Cardinal(q) xor $80808080; // check top bit (q must be used again)
bytes := r1 and r2; // distinguish between chars>127 and zero.
inc(p);
...
end;
Performance of large integer types
If you need to work with integers larger than can fit in longint,
you have a couple of options (int64 , comp ,
double , and extended ). Three of these
types are actually floating point types. Consequently they are not
completely interchangeable. Only the relatively new int64
is completely handled as an integer. comp is sort of
a hybrid in that it is stored as an 8 byte integer (the same as
int64 ) but all operations are performed as floating
point. Borland has officially designated comp as obsolete,
and instead favors int64 . However, as can be seen in
the following table, comp enjoys a substantial performance
lead in some circumstances. Extended and double
can also be used to operate on large intergers although care must
be taken to ensure that the lack of periodic rounding doesn't accumulate
to the point of changing the answer. Shown below are the measured
CPU Pentium II cycles for each operation with random values in the
range (0 < x < 2^63). "Ovhd" refers to the overhed
associated with making a function call and assignment with this
type. LongInt is included for comparison purposes only.
ovhd add mult div
Longint 2 1 1 4.7
Comp 40 4.3 4.4 34
int64 19 2.6 26.2 804
double 25 3.1 1.3 35.8
extended 43 4.1 3.2 34.4
Note: that the out-of-order execution capabilities of the Pentium
II make precise timing measurements nearly impossible for individual
operations. Consequently, the cycle counts shown above should only
be considered approximate.
Aside from the horrid division performance for int64
there is no obvious best choice. Of the three Floating point based
types double is best but it actually has slightly fewer
digits (15 vs 19). int64 is better for addition than
comp but worse otherwise.
So what to do. Well the best answer is to blend a bit. Use int64
as the base type. Division is easily handled by using trunc(Int64A/Int64B)
instead of Int64A div Int64B . Getting the best performance
is somewhat more complicated. Since comp and int64
have the same format, converting between formats is free. Using
this you can force what would have been an integer based multiplication
to a floating point one. This is shown below:
var
A,B,C,D:int64;
CA:comp absolute A;
CC:comp absolute C;
begin
// Result:=A*B*C*D; // Original expression
Result:=round(CA*B*C*D); // blended version
end;
The usage of CA above forces the calculation to be
done in floating point. Note that only one floating point type is
needed to force the entire term into floating point. However, if
there are multiple terms they each need a floating point variable:
Result:=round(CA*B+CC*D) . Also note that round
is used instead of trunc as was used in division. while
it would be possible to convert back into integer within an expression,
it will typically not result in a speed increase unless two or three
additions can be done in for each round .
|