Archive

Archive for the ‘Tips’ Category

A look at the 3D side of FireMonkey

October 6th, 2011

This post was actually written sometime ago, alas XE2 Update 1 didn’t change much.

I’ve been looking at FireMonkey 3D side, by that I mean strictly the 3D side, not the UI components, or the 2D. Here are some observations, most born from maintaining and developing 3D software in C++ and later with GLScene, and with an eye to eventually porting some of GLScene code to FireMonkey (after all, most of GLScene’s code is actually linear algebra stuff, mesh manipulations, file format imports, etc. and not OpenGL-specific).

Note that everything below is fixable, most of it quite trivially, but Embarcadero has to lead by baking it into the framework, in the core classes & components. If they don’t and you implement it yourself, you’ll end up having to duplicate huge portions of the framework. Such a duplication will be a PITA when they realize they need it in the framework, and implement it… in a fashion that will likely be incompatible with yours.

Also note that implementing some of these will require either interface-breaking changes, still possible now IMHO, as FMX is still young, or hacks/workarounds later on (what happened with GLScene, might as well avoid that if they can).

Scope

FireMonkey is officially pitched at “Business” 3D (as opposed to 3D games), which isn’t that far from what GLScene is used for, as even if GLScene is used for gaming purpose, its bread and butter was as much business as it was gaming (cf. the galleries here and here).

Assuming we’re restricting the scope to real-time rendering engines, what differentiates a game engine for a 3D engine? In terms of pure functionality and capability, there is little specific, Unreal Engine f.i. encompasses a broad range of visualization and UI applications, the most differentiating factor is what the engine processes:

  • A 3D game engine typically sits at the end of an assets tool-chain, and handles “ready to use” meshes, textures, shaders and other assets. The tool-chain is supposed to pre-compile and prepare everything, so that the game engine only has to deal with the rendering and interactivity.
  • A business 3D engine on the other end sits at a higher level, it has to handle raw assets, which come out of simulations, data crunching, image libraries, etc. and do what’s needed to render them in robust, quality fashion, while handling gracefully a variety of situations and corner cases.

In the case of FireMonkey, target platforms are mobile devices, iPad and business machine GPUs: all these are rather low-end hardware, in terms of capability, performance and available memory. In other words, FMX can’t rely on having a powerful GPU with plenty of super-fast video RAM, but rather has to deal with paltry integrated chipsets which share RAM with the CPU.

Scene Graph

Like GLScene, FMX is based on a scene-graph, and more precisely a variant with roots and concepts from the ancient 3DStudio (DOS era), you can see cut down versions of them in FMX: the Camera/Target approach and the Dummy being the most obvious. Similar concepts exist in other scene graphs, but typically with different terminology and slightly different (cf. Ogre, Blender, OSG…).

The FMX scene-graph was however simplified/crippled in several ways:

  • the primary design-time orientation is absolute angles, that’s problematic because rotations aren’t commutative in 3D space, and there are such things as gimbal lock to cause trouble. If relative rotations are practical beyond simple demos, for real world use, absolute angles are not, you need well defined orientation, which means vectors (ie. matrix) or quaternions.
  • the camera model has been over-simplified, leaving out such key aspects as field of view, depth of view and near plane bias. While the first two are key for obvious reasons, the near plane bias is just as important. Because of the maths behind the depth buffer, it is the single most governing factor to numerical accuracy of the depth buffer (and minimize artifacts known as Z-fighting).
  • the scene graph is rendered hierarchically (see below).

The absolute angles orientation existed in GLScene from very early on, and over the years, grew to become a major source of frustration for users, ending up with tutorials dedicated to explaining why they were frustrating, and why you should move away from them.
Unless you’re an airman or accustomed to working in a roll/pitch/turn environment, rotations in 3D just won’t always behave as you expect they should. It was kind of a let down to see this mistake repeatedso prominently in FMX.

The hierarchical scene graph rendering was actually GLScene’s original approach, it’s one that feels quite simple and natural, but it also grew over the years to be a factor holding back the library, and had to be worked/hacked around in different ways. Once again, it was a disappointment to see that FMX was based on GLScene’s old approach.

A better solution is to separate the rendering from the scene graph, this is useful and even required in various scenarios:

  • required to render semi-transparent objects, all techniques for handling opacity require a non-pure -hierarchical rendering
  • facilitates handling of visibility culling, ie. not rendering what isn’t visible, because it’s off-camera, beyond the field of view, occluded by opaque objects, etc.
  • required for deferred shading and other multi-pass approaches (either for shading, shadow volume, etc.)

As a consequence, FireMonkey can’t even render scene graphs containing semi-transparent objects correctly if you don’t manually order them… That’s a major letdown.

Materials and Textures

This is another major roadblock, FMX is using approaches similar to the original GLScene one, that proved problematic and later had to be worked around.

The default material is very limited, and defined into the objects. A better approach (like later GLScene) would be to have them in a material library (which is to materials what an action list is to an action, it allows reuses and centralizes them). With Delphi XE2 property editors, this could have been done in a painless and convenient fashion.

The standard texture model is too limited too, not only are textures not shared (they should live in a library) they also lack basic properties. Sharing textures is key when you don’t have a lot of video memory, or when that memory is slow and not dedicated (like on an iPad, or a business PC). You also need at the very least to be able to control texture wrapping/clamping and texture filtering (mipmap generation and trilinear filtering at the very least).
The lack of mipmapping and anisotropic filtering have implications on performance and rendering quality, the lack of texture built-in texture compression support has performance implications. The COLLADA Viewer demo f.i. exhibits aliasing and pixel shimmering issues because of it.

Provision for 3D textures should also be made, those are almost useless in games, but useful in a business 3D engines (f.i. in medical visualization).

Shaders are for all practical purposes handled as if FMX was a pure game engine: you need pre-compiled shaders, you don’t have a unified cross-platform high-level shader compiler (like Cg) or generator component (writing shaders by hand gets old very fast, as it quickly becomes a combinatorial problem).

All in all, materials aren’t well supported by FMX at design-time, you’re left with having to write your own code to manage material libraries. That just isn’t what you would expect from a general purpose or business 3D engine.

Meshes

Once again, FMX goes for a limited approach similar to that of early GLScene versions: having a specific mesh object for each mesh format, and no standardized mesh format (well there is an embryo of one, but it’s too limited, and the COLLADA viewer skirts it f.i, the 3DS demo before it skirted it too). This is compounded by the scene-graph and materials limitations: the mesh object has to handle its own rendering and its own materials.

This is problematic because it means any form of advanced mesh-based algorithms have to be written against specific mesh objects. This impacts everything mesh-related from imports/exports, manipulations, optimizations, animations (skeletal animations, morphs, etc.) to rendering (extracting silhouettes, BSP, bounding boxes, occlusion etc.) to interaction (collision testing, etc.).

Cadencing

When doing animations, be it a simple following of a spline path or simulations, you need to refresh the display periodically, after having all the scene elements updated.

Typically that involves a so called “game timer”, which triggers at a fixed frequency (usually that of the display, or a fraction of it), along with a frame stepping/progression logic that can handle frame skipping (so that you don’t end up having the UI lag the user when the hardware can’t keep up). You also need a time reference, preferably global and not looping.

Well, FMX only has an embryo of such an architecture, and it is not pervasive. Also the cross-platform time reference (TPlatform.GetTick) returns a single precision floating-point value that can loop… double ouch. Might as well take the opportunity of a new framework to Do It Right.

In GLScene, cadencing wasn’t in initially, and adding it after the fact took time, especially to make it pervasive, pity FMX didn’t build it right into the framework.

Conclusion

All the above points are fixable, but they’re also fundamental missing aspects for doing 3D with FireMonkey, if you don’t want to replicate huge portions of the framework (cf. the COLLADA Viewer sample).

They mean that if you want to achieve anything beyond a few poorly texture objects, you’ll need to design and write a lot of custom code rather than rely on the framework… with obvious implications of obsolescence and compatibility issues whenever FMX finally gets the features in standard.

Tips ,

TdwsRttiConnector: read anything, write anything

October 3rd, 2011

…or to be more accurate, many things the Delphi RTTI can reach ;-)

Raw RTTI connectivity for DWS

The new TdwsRttiConnector components exposes a new RttiVariant type to DWScript, which can be used to dynamically read and write fields/properties, or call methods of any RTTI-exposed Delphi type. You can use it to update or “bind” component properties, with the full power of DWS to draw upon in case of need. For instance:

For instance:

var f := ConnectForm('Form1');
f.Label1.Caption := 'Hello ' + f.Edit1.Text;

with ConnectForm() being a function that allows connecting to any TForm registered in the main TApplication, its implementation looks like

var
   c : TComponent;
begin
   c:=Application.FindComponent(Info.ParamAsString[0]);
   if not (c is TForm) then
      Info.ResultAsVariant:=Null
   else Info.ResultAsVariant:=TdwsRTTIVariant.FromObject(c);
end;

So essentially, to expose any instance as an RttiVariant, you just have to pass the result  TdwsRTTIVariant.FromObject() to a script, be it via as function result, a TdwsUnit-based instance or variable, directly via IInfo, etc. choose your poison.

This offers a very lightweight approach to exposing instances to script, the drawbacks being of course that everything is resolved at runtime (no compile-time checks), and that everything goes through RTTI, which limits performance vs “heavier” forms of class and instance exposure in DWScript, and of course, breaks sandboxing.

Cooked RTTI Connectivity

To get some compile-time checks and leverage strong typing, you can however use the new DWS Connector qualifiers, which look like generics, f.i. if you were to go for a strict version of the above sample:

var f := RttiVariant<Forms.TForm> := ConnectForm('Form1');
var lbl := RttiVariant<StdCtrls.TLabel> := f.Label1;
var edt := RttiVariant<StdCtrls.TEdit> := f.Edit1;
lbl.Caption := 'Hello ' + edt.Text;

In that updated version, the script will be type-checked at compile-time, the only dynamic checks remaining being the binding one (to connect ‘Form1′ by name), but you could choose to remove it by exposing the instance directly to the script. All the rest is type-checked, and if for instance a user mistyped and extra ‘s’:

lbl.Captions := 'Hello ' + edt.Text;

then he would get a compile error about TLabel not having a Captions property. In the unqualified version, that would be a runtime error upon executing the offending line.

This being scripts, it still means you’ll have to type-check at runtime, but you can compile all the scripts present in the application to check for errors, the actual forms and components do not have to be up and running!

News, Tips ,

Taming the Chrome Web Store

September 30th, 2011
Comments Off

Chromium LogoWell, “taming” is probably too ambitious given the jungle that the Chrome Web Store is, especially as this post is restricted to publishing a standalone DWScript/JavaScript app into the Web Store in a few simple steps. Interestingly enough, it seems that publishing Metro apps for Windows 8 will follow a similar process, according to the developer preview.

I’ll use the Flock Demo as an illustration, and turn it into a packaged app, that lives in Chrome and can be access off-line.

Bare minimum requirements

First prepare a folder for the app, in that folder you’ll need all your sources and resources (html, js, images, etc.), in our case, that’s just

  • flock.htm
  • jquery.js

For more complex cases, you are allowed to have sub-folders to neatly arrange everything.

The extra required Chrome package files are, at a bare minimum:

  • manifest.json
  • 16×16 icon (PNG format)
  • 128×128 icon (PNG format)

The manifest.json I used is the following

{
  "name": "DWScript Flock Demo",
  "version": "1.5",
  "description": "Interactive Canvas demonstration for DWScript",
  "app": {
    "launch": {
      "local_path": "flock.htm"
    }
  },
  "icons": {
    "16": "icon_16.png",
    "128": "icon_128.png"
  }
}

Note that the specs for manifest.json are rather “raw”, and if you get something wrong, you’ll likely fail your submission, and each failed attempt, if it’s not caught during upload, will be result in an error at install time. When you re-upload you need to specify a new version number (that’s why it’s at 1.5 in the above snippet!).

Testing and publishing the packaged app

To minimize issues, you can test your app as an unpackaged app in Chrome, go to tools/extensions, then activate the “developer mode”. You’ll then be able to specify the folder in which you placed your app. This also allows to run everything with the developer tools (debugger, etc.).

Once you’re confident enough, you can upload to the Web Store, for that just zip your folder and submit it. The zip can hold sub-folders, but make sure your manifest is at the root of the zip. Be aware there is $5 one-time, lifetime fee per Web Store developer account (not per app).

Then you’ll have the choice of publishing to testers or to end users, that choice is currently final, ie. if you publish to testers, you’ll have to create another web store entry for the user version. Note that there is a small delay between publishing an update, and the update being available for installation.

Finally, all that’s left is a bit of marketing, preparing a screen-shot or two, a YouTube video, a promotional tile, etc. If you’re on a budget, you can use CamStudio to grab the video.

See DWScript Flock Demo on the Chrome Web Store.

As a packaged app, it’ll be available in Chrome from the “Applications” bar of a new tab (Ctrl+T), you can also view them in the context menu.

Tips , ,

Delphi XE2-64bit: bottleneck in trigonometric functions?

September 22nd, 2011

Taylor Series and Angle Reduction

In Delphi XE2 64bit, SSE2 is used to compute the trigonometric functions (cos, sin, etc.), and they are computed through what looks like Taylor series (with double-precision literals being coded in hexadecimal, likely to minimize compiler precision issues).

However Taylor series only work for small values, so when you have a large angle value, it has to be reduced in the 0 .. 2PI range, which typically involves a form of floating-point Euclidian division or exponent reduction. For typical SSE2 implementations, this means that computing a trigonometric function for a high angle value is slower, as this reduction has to be performed, typically, you’re looking at something like a 25% slowdown tops.

Bottleneck

That said, here comes iga2iga2 (in the comments) and Ville Krumlinde, which both noticed to a performance issue in Delphi XE2 64bit, especially when facing other compilers. In XE2 64 bit, the reduction is performed through a loop and a fixed-step reduction, which means that the greater the angle value, the slower it gets.

Here are some timings on a sin/cos benchmark (hundreds of thousandths of calls):

Angle value XE2-32 XE2-64
1.0 112 ms 86 ms
100 113 ms 125 ms
1e7 114 ms 3700 ms
1e14 128 ms 7600 ms

Choices, Choices, Choices

But… timing isn’t everything, when computing trigonometry for very large angles, you quickly run into numerical precision issues, and then, you basically have three options:

  • just give up, that’s actually what the FPU does in 32bits, f.i. look at the value of Sin(1e22) in Delphi 32bit, it’s… 1e22. Which is obviously not a valid sine value! And you’ve been living with that potential issue for all your 32bit life…
  • spit out something, anything, under the assumption that if the user went for such an angle, it was garbage, so garbage in, garbage out, no one will notice it, you didn’t see me do it… you can’t prove anything anyway!
  • try to be accurate, damn the timings, damn garbage in, damn the torpedoes, full precision ahead! That’s what XE2-64 is doing. I haven’t checked in details, but XE2 approach seem to be based on this approach: “argument reduction, for huge arguments: good to the last bit“, and it gets Sin(1e22) right.

Just try for Sin(1e22) in your favorite environment, the correct value is -0.8522, Delphi XE2 64bit Gets It Right, where other environments may just flash a bunch of random decimals to fool your eyes.

Update: as pointed by Daniel Bartlett in the comments, the AMD LibM library provides a much faster and similarly accurate implementation of sin/cos and other functions.

So, what gives?

If you’re after raw accuracy, you’ll have to pay for the extra execution cycles to avoid the garbage out. However, chances are, your code doesn’t have anywhere near the numerical accuracy to avoid garbage in, so no matter the precision in the reduction, you’ll still just get garbage out. And if your code was running in 32bit, chances are you had some huge garbage out already, due to the FPU giving up.

If you’re not after accuracy, f.i. if you’re just using sine/cosine for time-based animations, the extra computing precision may bite you, for no benefit, so you’re better off performing the reduction yourself, before calling sin/cos, using whatever low-precision implementation you wish.

In the long run, it might be preferable for Delphi to just adopt the GIGO approach, and keep the high precision implementations for a high precision maths library: in most situations, they won’t avoid GO because of GI, so it might be best to blend with the rest (in benchmarks).

Tips , ,

Taming HTML5 verlets with Object Pascal

September 19th, 2011

Click the image below for a little real-time verlet integration animation.

And the source code for that little bit of HTML5 animation is Object Pascal, Delphi Web Script flavor, and is posted below.

It shows off the inline method implementations (you can use classic implementation style, and if that code were in a unit, you would have to), along with the extended Variant syntax. If you look at the HTML source, it also shows off output from an early version of the obfuscator/minifier.

FireFox6 and IE9 handle it fairly well, but Chrome just laughs at it and taunts you to bring it on. Runs on all major mobile platform too, at a reduced framerate of course, but not in slide-show mode (and on Android, FireFox Mobile is king of the hill).

<meta http-equiv="X-UA-Compatible" content="IE=Edge"/>
<canvas id="canvas" width="300" height="400"></canvas>
<script>
<%pas2js

type 
   TVerletPoint = class
      private
         FX, FY : Float;
         FOldX, FOldY : Float;

      public
         procedure SetPosition(aX, aY : Float);
         begin
            FX := aX; FOldX := aX;
            FY := aY; FOldY := aY;
         end;

         constructor Create(aX, aY : Float);
         begin
            SetPosition(aX, aY);
         end;

         procedure Update;
         begin
            var dx := FX - FOldX;
            var dy := FY - FOldY;
            FOldX := FX;
            FOldY := FY;
            FX := FX + dx;
            FY := FY + dy;
         end;

         function DistanceTo(p : TVerletPoint) : Float;
         begin
            Result := Sqrt(Sqr(p.FX-FX)+Sqr(p.FY-FY));
         end;

         property X : Float read FX write FX;
         property Y : Float read FY write FY;
   end;

type
    TVerletStick = class
      private
         FDistance : Float;
         FA, FB : TVerletPoint;

      public
         constructor Create(pA, pB : TVerletPoint);
         begin
            FA := pA;
            FB := pB;
            FDistance := pA.DistanceTo(pB);
         end;

         procedure Apply;
         begin
            var dx := FA.X - FB.X;
            var dy := FA.Y - FB.Y;
            var d := Sqrt(dx*dx+dy*dy);
            var f := (d - FDistance) * 0.5 / d;
            FA.X := FA.X - dx * f;
            FB.X := FB.X + dx * f;
            FA.Y := FA.Y - dy * f;
            FB.Y := FB.Y + dy * f;
         end;

         property A : TVerletPoint read FA;
         property B : TVerletPoint read FB;
   end;

const cCOLS = 30;
const cROWS = 25;
const cSPACING = 10;
var points : array of TVerletPoint;
var sticks : array of TVerletStick;

procedure InitGrid;
var
   r, c : Integer;
   point : TVerletPoint;
begin
   for r:=0 to cROWS-1 do begin
      for c:=0 to cCOLS-1 do begin
         point:=new TVerletPoint(c*cSPACING, r*cSPACING);
         points.Add(point);
         if c>0 then
            sticks.Add(new TVerletStick(points[r*cCOLS+c-1], point));
         if r>0 then
            sticks.Add(new TVerletStick(point, points[(r-1)*cCOLS+c]));
      end;
   end;
end;

procedure UpdateGrid(canvas : Variant);
var
   i, j : Integer;
   stick : TVerletStick;
begin
   for i:=0 to points.High do begin
      points[i].Y := points[i].Y + 0.1; // gravity
      points[i].Update;
   end;
   for i:=1 to 3 do begin
      points[0].X := 0;
      points[0].Y := 0;
      points[cCOLS-1].X := (cCOLS-1)*cSPACING;
      points[cCOLS-1].Y := 0;
      for j:=0 to sticks.High do begin
         sticks[j].Apply;
      end;
   end;

   canvas.clearRect(0, 0, 300, 400);
   canvas.beginPath();
   for i:=0 to sticks.High do begin
      stick:=sticks[i];
      canvas.moveTo(stick.A.X, stick.A.Y);
      canvas.lineTo(stick.B.X, stick.B.Y);
   end;
   canvas.stroke();
end;

InitGrid();

procedure Animate(stepFunc : procedure(canvas : Variant)); external;

Animate(UpdateGrid);

%>

var canvas = document.getElementById("canvas");
var context = canvas.getContext('2d');
canvas.strokeStyle = "black";
canvas.lineWidth = 0.8;

function Animate(f) {
    setInterval(
        function () {
            f(context);
        },
        10);
}

</script>

Source loosely based on Alex Nino’s.

Tips , ,

What innocuous-looking unit tests can uncover…

August 17th, 2011
Comments Off

I’ve recently been adding DWScript snippets to Rosetta Code, using them as unit tests as well.

Quite a few of Rosetta Code’s tasks consist in mathematical tasks, and I was wondering, how many math tests do you really need?

Well, quite a few! While implementing the Lucas-Lehmer test, it ended up hitting the precision boundary quite sooner than it should theoretically had, given that DWScript’s Integer is actually a 64bit integer.

Some investigations in the CPU view later turned out that the Delphi compiler did not  generate the proper CPU instructions for Sqr() in the case of integers, which DWS was relying upon. Apparently this has been QC’ed many times since Delphi 5, but still exists to this day in Delphi XE. The issue is now worked around in the SVN.

Fixed for XE2? Let’s hope, there may still be time…

 

 

Tips , ,

A Fistful of TMonitors

May 31st, 2011

…or why you can’t hide under the complexity carpet ;-)

As uncovered in previous episodes, one of the keys behind TMonitor performance issues is that it allocates a dynamic block of memory for its locking purposes, and when those blocks end up allocated on the same CPU cache line, the two TMonitor on the same cache line will end up fighting for the cache line, resulting in a drastic drop of performance and thread contention. The technical term for that behavior is false sharing.

A quick fix that can come to mind would be to force the allocation of TMonitor’s blocks early on, so that the blocks don’t end up contiguous, and hope and pray that in more complex situations, this will happen automagically.

Alas, that’s a fragile solution, for instance if you take the code in the link mentioned above, you’ll find it doesn’t work all that well:

  • run the same untouched test on different CPUs with larger cache lines or different cache associativity, and the contention can be back
  • instantiate a different class than TInterfaceList, or subclass it and add a few fields to it, and the contention is back

Why is that?

First, different CPU have different cache lines and associativity, so if you have cache-line size dependent code, you need to ask Windows about it. See for instance “How do I determine the processor’s cache line size?“.

Second, you don’t have control on how contiguous dynamic memory will be. FastMM f.i. is a bucket-based allocator, blocks that fall in the same bucket size will be allocated in sequence, in the previous code, with the empty TInterfaceList, you’ll have (optimistically*) allocated something like:

  • TInterfaceList instance 1
  • TMonitor 1 dynamic data
  • TInterfaceList instance 2
  • TMonitor 2 dynamic data

Which makes both monitor’s dynamic data non-contiguous, and if that’s enough to have both TMonitor’s data end up on different cache lines, the test will fly. But if you don’t have some other dynamic data that is of the appropriate size? the TMonitor’s data will still be contiguous…

*: in practice, even if the same buckets are involved, there is no guarantee the memory order will be the above, as FastMM recycles buckets, so the exact order can depend on the order in which previously allocated buckets of the same size were freed.

Note that if in your application’s code, you don’t have any other dynamic data that happens to fall in just the same bucket size as TMonitor’s data, all your TMonitor are likely to be contiguous (and even more so if you tend to allocate stuff first, and then run it, without manually pre-allocating the TMonitors).

In the above code, raw TInterfaceList instances are 24 bytes in size, and happen to fall in the same bucket as TMonitor’s 28 bytes data (the 32 bytes bucket).

With a linear garbage-collected allocator, similar contiguousness issues can appear after a garbage collection’s compaction, even if linear allocation was used initially and separated the blocks.

An interesting weakness can  also be exposed: a TMonitor’s data (inherently shared) can end up sitting in the middle of thread-specific dynamic data, resulting in another form of false sharing. In that case, TMonitor will not fight with another TMonitor for the cache line, but with your own code and dynamic data.

Why is TRTLCriticalSection not as vulnerable?

After all TRTLCriticalSection is only 24 bytes in size, and thus, smaller than a cache line?

Well it benefits from being a record, and thus usually not dynamically allocated on its own, but as part of a larger structure/object, which reduces the risks of it being on its own cache line (though if you’re not careful, you can easily end up with false sharing with the other owner object’s fields f.i.).

Note that TCriticalSection dynamically allocates the space for a TRTLCriticalSection, and thus can partially exhibit the false sharing issues that can plague TMonitor’s dynamic data.

Conclusions

The only way to be safe from false sharing, is to allocate large enough blocks, so that you guarantee they use a distinct cache line. In TMonitor’s case, the fix would be to allocate a larger block, rather than a small 28 bytes block as is currently the case.

Ideally, TCriticalSection instances should also be made larger, so their only drawback compared to TRTLCriticalSection would be the (rather negligible) virtual call overhead.

Multi-threading is hard, when you spot a simple problem in a simple test, don’t try to hide it under the complexity carpet, fix it while it’s still simple ;-)

Tips ,

Once upon a time in a thread…

May 26th, 2011

Last episode in the TMonitor saga. In the previous episode, Chris Rolliston posted a more complete test case, for which he got surprising results (including that a Critical Section approach wouldn’t scale with the thread count). Starting from  his code I initially also got similar surprising results.

edit: apparently the “crash” part of the TMonitor issues have been acknowledged by the powers that be, and a hotfix could be on the way, though it points back to QC 78415, an issue reported in 2009, ouch. Guess those 4 bytes per instance haven’t seen much use…

Revised Test with Stable Results

I simplified his code (see below), by dropping the usage of several RTL classes and features, and went for a straightforward implementation, in the process, the oddities went away as far as Critical Section is concerned, and partially so as far as TMonitor goes…
The results can be summarized by this chart:

This was measured on a quad-core, as you can see the Critical Section version stays flat until the number of threads gets greater than the core count, at which point, there is a small ramping arising from the workload taking its toll. TMonitor is a different story, if the revised test doesn’t exhibit the poor scaling I was finding in my previous test, there is still a ramping,  as well as a wild jump once there are more threads than cores.

Which RTL class or what exactly was the source of the behavior in Chris’s original code, I don’t know. One possible cause pointed by Krystian in a former comment could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start is major factor.

Note that TMonitor allocates its own small block for its locking purposes, distinct from the object instance, and AFAICT there are no provisions in case those blocks end up in the same cache line, though I’m not convinced yet that’s the issue we’re seeing here, this could be a source of contention.

edit: Krystian posted some sample code with cache-line collision avoidance, with it TMonitor becomes much more linear, though half as fast as a CS, and there are still occasional spurious slowdowns showing up in the timings.

Test Code

Here is the test code used for the above, if you test on your machine, make sure you have selected the high performance profile in Windows Power options, and that you don’t have any implicit affinity settings kicking in on the executable.

You can call the above code from a form where you’ll have dropped a TMemo to use as log, as I’m assuming you don’t want to slum it in a command line executable ;-)

const
   cCountdownFrom = $FFFFFF; //increase if necessary...
   cMaxThreads = 10;

type
   TTestThread = class(TThread);

   TTestThreadClass = class of TTestThread;

   TCriticalSectionThread = class(TTestThread)
      FCriticalSection: TRTLCriticalSection;
      procedure Execute; override;
   end;

   TMonitorThread = class(TTestThread)
      procedure Execute; override;
   end;

procedure RunTest(log : TStrings; const testName : String; threadCount : Integer;
                  threadClass : TTestThreadClass);
var
   i : Integer;
   threads : array of TThread;
   tstop, tstart, freq : Int64;
begin
   SetLength(threads, threadCount);

   for i:=0 to threadCount-1 do
      threads[i]:=threadClass.Create(True);

   QueryPerformanceCounter(tstart);

   for i:=0 to threadCount-1 do
      threads[i].Start;
   for i:=0 to threadCount-1 do
      threads[i].WaitFor;

   QueryPerformanceCounter(tstop);
   QueryPerformanceFrequency(freq);

   log.Add(Format('%s: %d thread(s) took %.1f ms',
                  [testName, threadCount, (tstop-tstart)*1000/freq]));

   for i:=0 to threadCount-1 do
      threads[i].Free;
end;

procedure TCriticalSectionThread.Execute;
var
   counter : Integer;
begin
   InitializeCriticalSection(FCriticalSection);

   counter:=cCountdownFrom;
   repeat
      EnterCriticalSection(FCriticalSection);
      try
         Dec(counter);
      finally
         LeaveCriticalSection(FCriticalSection);
      end;
   until counter<=0;

   DeleteCriticalSection(FCriticalSection);
end;

procedure TMonitorThread.Execute;
var
   counter : Integer;
begin
   counter:=cCountdownFrom;
   repeat
      System.TMonitor.Enter(Self);
      try
         Dec(counter);
      finally
         System.TMonitor.Exit(Self);
      end;
   until counter<=0;
end;

procedure RevisedChrisTest(log : TStrings);
var
   i, j : Integer;
begin
   for i:=1 to 3 do begin
      log.Add('*** ROUND '+IntToStr(i)+' ***');
      for j:=1 to cMaxThreads do begin
         RunTest(log, 'TCriticalSection', j, TCriticalSectionThread);
         RunTest(log, 'TMonitor', j, TMonitorThread);
      end;
   end;
end;

Tips ,

TMonitor woes

May 25th, 2011

Primoz Gabrijelcic recently reported a possible bug with TMonitor, in the more advanced side of TMonitor.

However, when experimenting with it for DWS, I bumped on issues in the basic usage scenarios too, and reverted to using critical sections. It seems that as of Delphi XE, short of a patch, TMonitor is just a waste of 4 bytes per object instance.

One of the basic usage of TMonitor I’m referring to would be to replace a critical section:

TMonitor.Enter(someObject);
try
   // protected code
finally
   TMonitor.Exit(someObject);
end;

However, TMonitor is having a problem with the above, and you can quickly run into situations where everything gets serialized, even when there is no need to. Let’s look at a minimal thread:

type
   TMyThread = class(TThread)
      Count : Integer;
      Obj : TObject;
      procedure Execute; override;
   end;

procedure TMyThread.Execute;
begin
   while not Terminated do begin
      System.TMonitor.Enter(Obj);
      try
         Dec(Count); // or do something else
         if Count<=0 then Break;
      finally
         System.TMonitor.Exit(Obj);
      end;
   end;
end;

Assuming you create two instances of the above thread class, which are working on two different “Obj” instances, the two threads should be able to run in parallel, as they don’t operate on the same memory structures at all, right?
Well, if you use plain old critical sections, they will, but if you use TMonitor like in the above code, they won’t, they’ll just run serialized, and all but the first thread will suffer from severe contention, which hints that a race condition is hiding somewhere in TMonitor’s code…

Tips ,

Delphi for JavaScript

May 10th, 2011

A while back, I posted of FireFox 4 JavaScript engine running around Delphi when it came to floating point performance on the Mandebrot set, since then, Chrome got updated to version 11, and further raised the bar by beating FireFox by about 20% in that benchmark. That’s no mean feat: current generation JavaScript engines run not just faster than Delphi, but also .Net and a slew of other compilers, native or not, when it comes to floating point. Only state of the art native compiler still resist.

The figures for Delphi 64 are still unknown, but it’ll face a challenge merely matching the floating point performance of JavaScript, and if the VCL’s TCanvas hasn’t been revamped from the ground up, chances are that out of the box, Delphi 64 won’t be able to beat the HTML5 Canvas on performance (not to mention in features, where HTML5 Canvas is also leading by a few miles).

Jon L Aasenden is investigating an Object Pascal For JavaScript (OP4JS), with mobile devices in sight (if you don’t already know about it, you may also want to check PhoneGap). My experiments with the mobile WebKit that powers iPhone & Android browsers have been very positive, though some library are still a bit bloated for current hardware, using CSS3, HTML5 & libraries like XUI, it’s possible to design some excellent interactive UIs, in reasonably little time. Given the rate of improvements, in 1-2 years, libraries like jQuery Mobile should run smoothly on all the hardware being sold.

WebGL Aquarium

And add to that upcoming goodies, like WebGL, and JavaScript + HTML5 is step by step, with little fuss, despite all its shortcomings, becoming a universal platform with high performance potential. One could only wish JavaScript weren’t a dynamic language, but hey, after all, the x86 instruction set became prevalent despite its shortcoming too, and will still be serving in the 64bit era for the foreseeable future.

Even on the Windows desktop, it is IMHO becoming increasingly questionable to base your UIs on anything else than HTML5 & CSS, the alternatives are not only more proprietary, but either looking responsive but dated (like unskinned VCL, WinAPI controls), or outright messy and sluggish (WPF).

Chromium LogoRight now, ChromiumEmbedded allows you to embark Webkit + Chrome V8 engine, which will work across the board with no update or dependency issues (unlike IE9), using Henri Gourvest’s Delphi ChromiumEmbedded, you can integrate it into your Delphi applications, and use it as an alternative to VCL-based controls for many aspects of an application’s UI.

Tips ,