Exploring 3D Point Clouds for FMX

With Delphi 12.Community Edition released not too long ago, it was an opportunity to test run what’s possible with FMX on the 3D side.

And while Delphi FireMonkey has 3D support, it’s not really a core feature. For a test run, the simplest of all 3D entities, a point cloud can serve as a decent stress test.

Point Clouds are now ubiquitous with the advent of 3D capture devices, which go from LIDAR to RGB-D cameras and all kinds of AI augmented approaches. So there is a plethora of 3D point cloud datasets out there.

For this article I used VILLA DONDI DELL’OROLOGIO by SUPERNOIA ®. It is a photogrammetry dating back from when the villa was abandoned. At 4.6 millions colored points cloud, it is decently detailed. I used CloudCompare to convert the file to a simpler text format.

Introducing FMXutils

I’ve placed the source code utilities in a new side-project with a creative name FMXutils. At the moment the code is primarily for Windows DX11, with limited testing on Android for GLSL support. Because one of the first limitations of the Community Editions is you get no Linux support, so no luck there. And I do not have a Mac, so no Metal either.

Besides support for TPointCloud3D, FMXutils exposes the Direct3D shader compiler, some utilities for TVertexBuffer, shaders or scene, as well as a set of interposer classes to make measuring framerate more practical.

With that out of the way, let’s get back to our villa!

Rendering a Cloud Point

One way to render a cloud point is to render… points. With MFX that means invoking Context.DrawPoints and calling it a day. If the points are clustered together, that’s fine. But if they’re spaced out, or you’re zooming it, things get a bit sparse and you will “see through”.

First good news is that the framerate that can be achieved with FMX is similar to that of CloudCompare!
On my Iris Xe (integrated GPU), this is about 60 FPS for 4.6 million points. Not too shabby.

Rendering a Cloud of Quads

To give pixels a little bit more substance, one way is to render them as squares. In 3D terms, that means quads that always face the camera.

Since FMX doesn’t have a Context.DrawQuad, you need to specify them yourself.
Each points will thus become 4 vertices and 2 triangles.

This results in a vertex buffer that is four times larger and an index buffer that is six times larger, since triangle strips are not supported. Technically speaking, we duplicate the point coordinates 4 times and use the vertex shader to “move” those points to the corner of a camera-facing square.

Distant points will remain points, but closer points will be rendered as larger squares. You will no longer see through walls and foliage.

Yes, the framerate took a hit. We’re down to 24 FPS.

For smooth rendering, 60 FPS is often considered desirable. This means the “point budget” for 60 FPS would be about 1.7 million points.

With slightly more complex shaders, we can make the squares into discs. This gives a less pixelated look.

Rendering a Cloud of Gaussians

Another smoother way to render the points as small semi-transparent gaussian blobs, like the one to the right.

However, to achieve correct blending, you need to render the blobs from back to front.
This means depth-sorting all the points, farthest first from the point of view of the camera.
For each frame.

We take another significant performance hit there. We’re now down to 8.7 FPS.
That’s a “point budget” for 60 FPS of 660k points.

Note that at this point we’re not too far from brute-force rendering 3D gaussian splatting, which can be summarized as using a sort of “stretched blobs”. Of course 3DGS usually ups the ante with heavy precomputation and scene hierarchies.

Performance Bottlenecks

Where are the limiting factors ?

When rendering quads/points/discs, the performance profile looks like this:

The bottleneck is unambiguously memory copies in DoDrawPrimitivesBatch.

Even when the cloud point vertices are static, FMX does not have any mechanism to take advantage of that. So at every render, the whole vertex and index buffers are copied to the GPU.

(Side-note: GetMXCSR here is only a side-effect of my “ghetto” file loader which relies too much on TryStrToFloat)

Even if your points are changing at every frame, and you’re changing them CPU-side, that’s still overhead because you’ll be writing them to CPU memory, and FMX will copy them to GPU memory… While you could be writing them to GPU memory directly.

When rendering gaussian blobs, depth sorting and preparing the sorted quads take a significant amount of time… but not that much more than copying them to the GPU:
As long as the sorting is CPU-side, we could only shave-off the “PointsToQuads” here if we had direct GPU buffer access in FMX. We could probably get a boost by pushing partially sorted points to the GPU, by using an octree for instance. This would allows us to leverage some parallelism in addition to visibility culling.

As far as I could tell, there are too many private vars in FMX.Context.DX11 to be able to introduce support for direct GPU buffers support through sub-classing. There may be “enough” that is exposed to piggyback, but some of the innards would need to be reimplemented… Anyone attempted that ?