Previous: Mini-Pool sample code.
Scaling it Up
Sometimes the one-instance pool won’t be enough in a multi-threaded situation, or if the work classes can be allocated recursively (i.e. when you need several at the same time), at which point the mini-pool may progressively degrade to toe the original allocation/de-allocation bottleneck (always measure nonetheless, you may be surprised by the resilience)
The InterlockedExchangePointer strategy is however about twice more efficient than a critical-section approach, and less prone to contention and serialization effect, so here are a few directions to make the mini-pool scale, without introducing much complexity:
- use a threadvar for the vPool variable, so that you get one mini-pool per-thread. You’ll have to clean up the pool manually though when your thread terminates.
- use several variables and a statistic index, based on GetThreadID or an InterlockedAdd, this can work quite well in both threaded and non-threaded applications, and the cleanup can still happen in the unit finalization
For the second approach, if you want to minimize contention, declare your pool like
type
   TCacheLinePointer = record
      Ptr : Pointer;
      Padding : array [1..CACHE_LINE_SIZE-SizeOf(Pointer)] of Byte;
   end;
var
   vPool : array [0..POOL_SIZE-1] of TCacheLinePointer;
with CACHE_LINE_SIZE at 64 or 128, POOL_SIZE a power of two, and then instead of accessing vPool directly, use vPool[index], with index computed like
index := (GetThreadID and (POOL_SIZE-1));
While the improvement is all statistical and not guaranteed, the code stays very simple, lock-free, can scale like there is no tomorrow, and guaranteeing the pool’s cleanup is much simpler than with threadvar.
Generic Troubles
Now, the temptation might be to make a generic version of that code, but alas, it’s one case where the idea doesn’t seem to turn out so well, as you need to be able to cleanup the instance before returning it to the pool (call to Clear in the code above).
With Generics, it’ll mean an interface constraint, which means reference-counting, which will interfere with the pooling and release… An alternative would be to have all your work classes derive from a base class, but that can be limiting as your generic would be not-so-generic, or might involve other forms of overhead if you wrap the work class, which could negate the benefits of the mini-pool.
What would really work in the above case is a template 🙂




“What would really work in the above case is a template”
shameless plug (oldie but goldie):
http://www.dummzeuch.de/delphi/object_pascal_templates/english.html
Have you thought about a single linked list? Via InterlockedExchangePointer one could set the pointers to the next item thread safe. Such a list could grow as large as needed. But you’ll need an Item-Record to hold the pointer and the worker-object. Those items may also need some recycling effort (with a second list).
Single linked list (stack) isn’t as trivial to get right in a lock-free fashion (cf. http://www.boyet.com/articles/lockfreestack.html). If you can build the Next reference pointer into your work class, it can be efficient, but if you have to use an item-record, you’ll be adding extra management which could easily end up defeating all your gains.
Another issue is that even if it is lock-free, it won’t scale that well in multi-core environments, as all cores will be hitting the same cache line (the one with your stack top pointer), so you can still end up with implicit serialization at the memory level (the threadvar solution avoids that, and the hash solution statistically mitigates it).
IME the mini-pool approach is often enough, and it’s simple to get right, while more sophisticated approaches have more sophisticated pitfalls. And anyway, once your code is adapted to alloc & return to a pool, it’s easy to change the pool implementation, so you won’t have wasted time if you tried a mini-pool first 🙂
Very interesting.
Another option, especially if your object does not refer to any reference-counted class, is to define some record/object types with methods, allocated on stack, or within the main object instance.
What I like in your approach is that it started from real profiling, not from a vague guess.
Perhaps using “inline” functions may increase your code speed, especially with inlined asm for x86 platform.
But I’m not convinced that GetThreadID will be much faster than a threadvar.
Both will access the FS: segment: AFAIK GetThreadID uses a kind of “threadvar”, with a reserved TLS slot.
But we use this GetThreadID + hash table for the logging features of mORMot, with pretty amazing speed.
The GetThreadID approach main benefit is that you don’t have problems with freeing the pool. With a threadvar you have to include the pool(s) cleanup(s) as part of every thread’s cleanup, which isn’t very practical.
Using an interface to handle the cleanup automatically isn’t practical because you then need to handle the threadvar initialization for each thread… So that just exchanges a cleanup problem for an initialization one.
Hi Eric,
Nice post.
Just a hint what my approach is defining thread workers over piece of “undefined” data.
About data init, clear, free I use following approach when defining my thread workers queue:
aMsgSysQueueControl
.DedicatedThreads(1, True)
.Name(‘TMyTask ‘ + fContentManager.Name + ‘ Queue’)
.QueueSize(1000)
.NewTimer(1, 1000, OnTimerEvent)
.NewMessage(MSG_SET_CONTENT, SizeOf(Pointer), 100, OnInitContent, OnClearContent, OnFreeContent)
.ListenTo(Self, MSG_SET_CONTENT, OnContent);
At:
.NewMessage(MSG_SET_CONTENT, SizeOf(Pointer), 100, OnInitContent, OnClearContent, OnFreeContent)
This defines 100 bucks of pre-defined slots for “data” sized at Pointer, on which OnInitContent function is called. Of course pre-defined slots can grow on demand.
On finalization OnFreeContent is used.
Thread can be stopped before all data (or work) in the queue is processed, and this data can be complex and need external finalization).
Than everyone can post in the queue a peace of data or work without actually caring about initialization, clearing and finalization of the data, which tends to be fast using this approach.
For internal locking and thread-safety I use Conditional Variables used with Slim reader/writer locks. Good and very fast combination ran in user mode.