First look at XE2 floating point performance

With XE2 now officially out, it’s time for a first look at Delphi XE2 compiler floating point performance (see previous episode).

For a first look I’ll reuse a Mandelbrot benchmark, based on this code Mandelbrot Set in HTML 5 Canvas. What it tests are double-precision floating-point basic operations (add, sub, mult) in a tight loop, there is relatively little in the way of memory accesses (or shouldn’t be, to be more accurate).

You can find the source code see there, it compiles pretty much straight away in XE2 (just comment out the asm for Win64).

NOTE: when this article was originally posted, I had stumbled upon an XE2 Trial version “trap” (or feature?) which basically deactivated Win64 optimizations as defined through the project options. Kenji Matumoto pointed the issue, and this is an updated article where I used {$O+} in the code to “force” optimizations. The outcome is a *much* prettier picture, I’m happy to say! Reservations from the initial articles are gone, good job Embarcadero!

edit 05/09, after further tests, I’m adding one reservation single-precision floating point doesn’t look so hot. More on the subject there.

Benchmark results

Without further ado, here are the raw figures on my machine for the 480 x 480 case, keep in mind the Delphi versions do NOT use Canvas.Pixels[], but direct memory access in an array:

Execution time in milliseconds, lower is better

Or if you prefer hard figures:

Delphi XE2 – 32 bits: 193 ms
Delphi XE2 – 64 bits: 67 ms — fastest Delphi
Delphi XE: 196 ms
FireFox 6: 121 ms
Chrome 13: 74 ms
(out of competition: XE 32bit hand-made assembly: 57 ms)

So what gives?

XE2 32bit compiler still uses the old FPU code, the performance delta with XE is minimal and could just be an alignment issue (pseudo-random, since the compiler doesn’t pro-actively align). Let’s hope the SSE2 codegen will be retrofitted in XE3.
XE2 64bit compiler get a nice boost from using SSE2, allowing it to catch up and overtake all JavaScript JITters.
Chrome V8 makes a good showing in this benchmark, but loses the crown, native Delphi is back on top!

A peek under the hood

What does the compiler generate for the two following lines?

x := x0 * x0 - y0 * y0 + p;
y := 2 * x0 * y0 + q;

Once you pop up the CPU view, you’ll see:

FMandelTest.pas.193: x := x0 * x0 - y0 * y0 + p;
00000000005A1452 660F28C4         movapd xmm0,xmm4
00000000005A1456 F20F59C4         mulsd xmm0,xmm4
00000000005A145A 660F28CD         movapd xmm1,xmm5
00000000005A145E F20F59CD         mulsd xmm1,xmm5
00000000005A1462 F20F5CC1         subsd xmm0,xmm1
00000000005A1466 F20F58C2         addsd xmm0,xmm2
FMandelTest.pas.194: y := 2 * x0 * y0 + q;
00000000005A146A 660F28CC         movapd xmm1,xmm4
00000000005A146E F20F590DA2000000 mulsd xmm1,qword ptr [rel $000000a2]
00000000005A1476 F20F59CD         mulsd xmm1,xmm5
00000000005A147A F20F58CB         addsd xmm1,xmm3

And further down the code, the compiler makes use of xmm8, so it’s really aware of the 16 xmm registers you have in x86-64, and finally keeps floating point values in registers, something the 32bit compilers (both XE & XE2) don’t do.

To what does it lose to the hand-made asm version? Well, a handful of minor things:

even though it used up to 9 xmm registers, it didn’t use a 10th, leaving some memory access
with more careful allocation, it could have fit everything in 8 xmm registers, which would have cut unnecessary traffic
it zeroes registers with a move from memory, and didn’t do constant unification or propagation.

Still those are mostly nitpickings compared to the massive issues of the old FPU code compilation (which, alas XE2 – Win32 still suffers from).

Conclusion

Support for SSE2 in XE2 64bit compiler consists in a significant step ahead for Delphi floating point performance. XE2 32bit is still same old.

If you’re doing heavy floating point maths, XE2 64bit compiler is a simple ticket to much better performance.

Hopefully in Delphi XE3 they will retrofitting the SSE2 codegen into the 32bit compiler, but ad interim it should quell all the critics about “we don’t need no 64bit”, well, if you do any significant floating-point maths, Delphi XE2 64bit is a must!

20 thoughts on “First look at XE2 floating point performance”

Nice benchmark!
Actually I hope they will optimize the compiler as a fix in XE2 too (XE2 64bit compiler is just working, and not optimizing yet, as you can see…)

And what if you optimize by hand? (with asm)

@André
Added the hand-made ASM timing (still not optimal though), in 32bit, it’s 57ms here, or 30% faster than Chrome, twice as fast as XE2’s SSE.

“Performance is a feature”

You are showing the defect and the solution. Good article.

Did you turn on the compiler’s optimization flag?

The following dump is my result with optimize ON.

FMandelTest.pas.191: x := x0 * x0 – y0 * y0 + p;
000000000056F012 660F28C4 movapd xmm0,xmm4
000000000056F016 F20F59C4 mulsd xmm0,xmm4
000000000056F01A 660F28CD movapd xmm1,xmm5
000000000056F01E F20F59CD mulsd xmm1,xmm5
000000000056F022 F20F5CC1 subsd xmm0,xmm1
000000000056F026 F20F58C2 addsd xmm0,xmm2
FMandelTest.pas.192: y := 2 * x0 * y0 + q;
000000000056F02A 660F28CC movapd xmm1,xmm4
000000000056F02E F20F590DA2000000 mulsd xmm1,qword ptr [rel $000000a2]
000000000056F036 F20F59CD mulsd xmm1,xmm5
000000000056F03A F20F58CB addsd xmm1,xmm3
FMandelTest.pas.193: x0 := x;
000000000056F03E 660F29C4 movapd xmm4,xmm0
FMandelTest.pas.194: y0 := y;
000000000056F042 660F29CD movapd xmm5,xmm1
FMandelTest.pas.195: r := x * x + y * y;
000000000056F046 66440F28C0 movapd xmm8,xmm0
000000000056F04B F2440F59C0 mulsd xmm8,xmm0
000000000056F050 660F28C1 movapd xmm0,xmm1
000000000056F054 F20F59C1 mulsd xmm0,xmm1
000000000056F058 F2440F58C0 addsd xmm8,xmm0
000000000056F05D 66440F29C0 movapd xmm0,xmm8

Weird, I’ve added a notice, but I just can’t get the compiler to optimize, I’ve rebuilt, checked, unchecked, it still comes out the same here.

I’m on XE2 Trial, is that a bug of that version?

Hi, Eric.
I am using “real” Architect version, not Trial.
Hmm, Does Trial version disable optimization?

@Kenji Matumoto
Weird, I can get the Win32 bit code optimization turned on & off, and it works on the generated output. Win64 is the problematic one.

Try add {$O+} in your source code directly. like

>{$O+}
>procedure TForm20.ComputeMandelDelphi;

Also you try to “Clean up” your project.

It would be nice to report this issue to official Embarcadero forum.

@Kenji Matumoto
With {$O+}, optimizations get turned on, but still can’t get code optimized without it.

I posted in non-tech, it would be nice to know if that {$O+} is a workaround against a Trial limitation, or if there is a random bug that could show up in regular versions as well.

It seems that for 32-bit version EMB tries to remain compatibility with old processors, as even Pentium MMX meets the minimal requirements for Win2000 and WinXP, and it lacks of SSE support.

@Eric
New result is good!. Yes, good job Embarcadero, good job Eric!

How fast is a handtuned BASM version using oldstyle FPU code (non-sse)?

Regards
Dennis

This is reason enough for me to use XE2 64 bit. Finally we have got some decent floating point performance!

ohh, I was waiting for this post so much!

Eric, might worth putting a note for early readers of this post, they should force a refresh of the page to get the bar-chart updated with the new 64 values.

@François
Good point,.I’ve now changed the image, refresh shouldn’t be required anymore.

“XE2 32bit compiler still uses the old FPU code, the performance delta with XE is minimal and could just be an alignment issue (pseudo-random, since the compiler doesn’t pro-actively align). Let’s hope the SSE2 codegen will be retrofitted in XE3.”

There is no way embarcadero will do any optimizations for the 32bit compiler now.

@JED – Hopefully the huge gulf between 64 bit and 32 bit along with the embarrassing (IMO) comparisons to JavaScript will shame them into updating the 32 bit performance. It’s been harder and harder over the years to justify using Delphi, who needs anyone showing management that JavaScript is faster than Delphi. There are still BOATLOADS of 32 bit servers out there that could really use better floating point performance for our server side processes.

+1 for the plea to improve the 32-bit compiler
They don’t even need to generate SSE instructions, just make sure 8 byte values are properly aligned. That way it’s not so hard to code some inline SSE assembly by hand.

oh 64bit perormance is exciting
i’m wating 4 dobai tour to see other good news

Comments are closed.

DelphiTools

DWS, Profiler and other Pascal tools

First look at XE2 floating point performance

Benchmark results

A peek under the hood

Conclusion

20 thoughts on “First look at XE2 floating point performance”

Benchmark results

A peek under the hood

Conclusion

Related posts

20 thoughts on “First look at XE2 floating point performance”