The first phase of dynamic arrays refactoring mentioned previously is now complete, with dedicated implementations in place for all the base types. A single interface (IScriptDynArray) is now the only common ground.
- Memory usage is reduced to 1/3 on average, performance is improved up to 2x (without JIT)
- Array indexes have thus been bumped from Integer to NativeInt, as arrays of more than 2 billions elements have become more practical
The current state has been frozen in Release snapshot DWScript v21.3.11 on the GitHub mirror.
Decoupled implementations
While the dedicated implementations are mostly similar right now, generics were not used:
- First because the goal was to decouple, and a generic would have introduced coupling again.
- Second because implementations are expected to wildly diverge.
The implementation divergence is most visible right now for the “array of boolean” type: it is implemented though TBits, with a memory usage of 1 bit per element. This means “array of boolean” will now use 192 times less memory !
TBits will eventually be replaced by a more direct implementation, with bit shifting support (for inserts/deletes). Eventually whole-array Boolean operations will be introduced.
Arrays of String & Interface (which also means arrays of script objects or script dynamic arrays) will probably stay as Delphi-dynamic arrays for the medium term.
Virtual Memory for large arrays
Arrays of Integer and Floats will likely diverge from the Delphi-dynamic-array implementation to a lower level one, which would be able to leverage highly-aligned virtual memory.
This would facilitate AVX use, as AVX instructions work on high-ly aligned memory.
But in 64bits, this would allow a leveraging the huge virtual address space, by pre-reserving virtual memory, so that an array can be grown/reallocated without needing to be copied.
JIT Compiler
The 64bit JIT has partial support for the new dynamic arrays, but this is quite limited at the moment.
The 32bit JIT does not support them yet, so array operations will be interpreted.
Note that most memory managers, e.g. FastMM4 does in-place reallocation if possible using VirtualRealloc on Windows.
Yes, FastMM4 does pre-allocate a 25% margin and attempts to realloc in place if the next virtual segment is empty, but it will still have to move data around as there is no low-level VirtualRealloc in Windows.
I am thinking of taking advantage of the large virtual memory space in x64 to reserve in very large chunks for very large arrays, f.i. reserve (but not commit) 256 MB for a “still 16 MB” but growing array.
Take a closer look at the FastMM4, in the ReallocateLargeBlock function. It indeed allocate 25% more, but if it needs to grows the block, it first tries to reallocate the memory block using VirtualAlloc, and only allocate + move the bigger chunk only if it false (see comment “Can another large block segment be allocated directly after this segment?”).
About large arrays, this is what FastMM4 does somewhat: the large chunks are allocated in reverse order (bottom-down) than medium (therefore small) blocks, to ease reallocation.
Note that “reserving” memory has a cost, even if the memory it not actually used. The TLB have to be maintained, so it should be used on purpose. And a process with GB of reserved memory is not a good practice, even on x86_64. For a server process, it may be reported as aggressive and unstable by most monitoring tools.
Our https://github.com/synopse/mORMot2/blob/master/src/core/mormot.core.fpcx64mm.pas unit uses mremap() on Linux for reallocating large blocks.
Pity there is no Windows equivalent… though I’ve read complaints that mrmap is often slower than manually copying data around, what’s your experience on it?
The VirtualAlloc expansion strategy of FastMM4 works if the virtual space is not fragmented, and if only one array or allocation is growing, if you have several large arrays growing at the same time, it will not succeed much.
In practice when there is only one big allocation growing, one of the first thing that is done is to pre-allocate, because that’s easy to do in the code.
The reallocation issues happen IME when several structures are growing at the same time, as pre-allocating all of them is much less trivial, and there are often smaller (but still largish) objects being allocated all along.
Another strategy I’m wishing to try is memory mapped file, and using MapViewOfFile to perform reallocs. I suspect it may run in the same performance issues as mremap, but it’s probably worth a try for very large reallocs.