TCriticalSection (along with TMonitor*) suffers from a severe design flaw in which entering/leaving different TCriticalSection instances can end up serializing your threads, and the whole can even end up performing worse than if your threads had been serialized.
This is because it’s a small, dynamically allocated object, so several TCriticalSection instances can end up in the same CPU cache line, and when that happens, you’ll have cache conflicts aplenty between the cores running the threads.
How severe can that be? Well, it depends on how many cores you have, but the more cores you have, the more severe it can get. On a quad core, a bad case of contention can easily result in a 200% slowdown on top of the serialization. And it won’t always be reproducible, since it’s related to dynamic memory allocation.
There is thankfully a simple fix for that, use TFixedCriticalSection:
type TFixedCriticalSection = class(TCriticalSection) private FDummy : array [0..95] of Byte; end;
That’s it folks. This makes sure the instance size larger than 96 bytes, which means that it’ll be larger than the cache line in all current CPUs, so no serialization anymore across distinct critical section instances.
As a bonus, it also ends up using one of the larger, more aligned, FastMM bucket, which seems to improve critical section code performance by about 7%. The downside is you use more RAM… but how many critical sections do you really have?
* (11-12-01): as noted by Allen Bauer in the comments, the issue is fixed for TMonitor in XE2.
Hi Eric
Thanks for sharing this.
What do you mean by “seems” in
“seems to improve critical section code performance by about 7%”
Can you detail that a little more further please?
Regards
Thank’s for sharing Eric. And the TMonitor is still not working properly in XE2. See QC101114 (http://qc.embarcadero.com/wc/qcmain.aspx?d=101114). Sometimes I wish that Embarcadero would make their unit testing public.
That’s the speedup I’ve been seeing on a couple of tests, but I haven’t investigated more than that as it’s a minor gain, and IME the Critical Section’s own performance is rarely an issue (of course, that’s as long as there aren’t side-effects like the one mentioned here)
That’s an “interesting” fix. Only half jokingly, you could dynamically increase the FDummy size based on the current date, to keep up with cache sizes of new CPUs.
Or just use a real memory wall, I guess.
Regarding how many one might use, the HotSpot compiler has about 100 distinct locks, which seems to me to be a realistic upper end case for typical software. Most applications will be substantially simpler than HotSpot WRT concurrency.
@Craig Stuntz
Well, it’s also theoretically possible to query the actual size (http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size and http://blogs.msdn.com/b/oldnewthing/archive/2009/12/08/9933836.aspx list a few tricks), and then tinker the InstanceSize field in the VMT of TCriticalSection.
…but that’s probably overkill 🙂
IIRC the cache line size doesn’t grow that much with time, in early days, there were some CPUs with cache lines of 512 bytes, but in the x86 world, the norm seems to be 64 bytes, with in some cases 128 bytes (coupled cache lines), and ARM world is using 32 or 64 bytes.
With 96 bytes of fill, the instance size becomes 128, which should be comfortably large enough, and aligned-enough to avoid wasting L1 cache for the unused bytes space on CPU with smaller cache line sizes.
Nice fix Eric, I wonder if this could be achieved using class helper somehow, so, until this issue is resolved, including just one unit in the uses clause would suffice… thoughts?
@Dorin Duminica
Class helpers can’t affect instance size AFAIK.
The same problem would exist when using TRTLCriticalSection instead of TCriticalSection, correct?
@Eric
I don’t think it’s possible, but if it would, then it would solve a lot of head aches, probably a better alternative would be to define a unit which contains your fix:
type
TCriticalSection = class(syncobjs.TCriticalSection)
private FDummy : array [0..95] of Byte;
end;
and use that unit wherever you need to use critical sections, but then syncobjs shouldn’t be used at the same time as syncobjs in the same unit… damn… always a trade off 🙁
@Darian Miller
TRTLCriticalSection is a structure and less vulnerable if used in an object, since it’ll be allocated alongside the object, and with the other fields of the object around it which can play the same role as the fill array. If you have several TRTLCriticalSection in the same object, then you need to space them out.
Of course if you have a TRTLCriticalSection in a small object, the issue happens, and you really don’t want to have a packed array of TRTLCriticalSection 😉
Great! Thanks for the tip. Do you already have a test case sample app to demonstrate this issue that you could post? This sounds like something I’ve been chasing for a long while.
TMonitor in XE2 will now always allocate a block the cache line size or the size of the TMonitor, whichever is greater.
Implemented in OmniThreadLibrary revision 1051. Thanks!
@Allen Bauer
Nice to know, but you guys really need to start maintaining more than one version at once, like the rest of us… Since TMonitor is internal stuff, it can’t be fixed in older versions, so that forbids its use from any cross-version libraries, regardless of the bugfixing that happens in the latest Delphi version.
And there are still issues in XE2 (cf. Leif Uneus post above) that make it a risky proposition, and I’m afraid still more issues will be uncovered when it’s stable enough to be experimented on in production code…
@gabr
It was actually your recent post that reminded me of this issue 😉
@Darian Miller
Just create a bunch of threads, each with its own TCriticalSection instance and loop on the critical section. If you instantiate the critical sections in sequence, there is a high likelyhood they’ll be instantiated contiguously and the issue will popup reproducibly in your test (in a real-world application, that can be far more nasty, as the issue will appear to be random, and thus can be very hard to track down)
We do allow users to rebuild the System unit and the overall RTL, in which they can move fixes from newer versions. I realize that is not ideal, but that is the best I can offer at this point.
As for the other problem mentioned, the following will fix it:
In TMonitor.RemoveWaiter() change:
Last := FWaitQueue;
to:
Last := PWaitingThread(@FWaitQueue);
I know this isn’t an excuse, but this is the reality of the situation; the number of individuals and departments involved with pushing out any update, new and/or old versions is prohibitively large. A fix and rebuild of the product is probably only about 20% of the total effort in an update release. The other 80% are relatively fixed in terms of costs, so it makes far more economical sense to do more fixes within a given update release in order to better amortize those fixed costs.
As a particular version ages in the market, the number of applicable fixes and overall numbers of users quickly falls below the proverbial bar. Yes, this sucks for some and in an ideal world things would be different.