Archive

Posts Tagged ‘Delphi’

What innocuous-looking unit tests can uncover…

August 17th, 2011
Comments Off

I’ve recently been adding DWScript snippets to Rosetta Code, using them as unit tests as well.

Quite a few of Rosetta Code’s tasks consist in mathematical tasks, and I was wondering, how many math tests do you really need?

Well, quite a few! While implementing the Lucas-Lehmer test, it ended up hitting the precision boundary quite sooner than it should theoretically had, given that DWScript’s Integer is actually a 64bit integer.

Some investigations in the CPU view later turned out that the Delphi compiler did not  generate the proper CPU instructions for Sqr() in the case of integers, which DWS was relying upon. Apparently this has been QC’ed many times since Delphi 5, but still exists to this day in Delphi XE. The issue is now worked around in the SVN.

Fixed for XE2? Let’s hope, there may still be time…

 

 

Tips , ,

A Fistful of TMonitors

May 31st, 2011

…or why you can’t hide under the complexity carpet ;-)

As uncovered in previous episodes, one of the keys behind TMonitor performance issues is that it allocates a dynamic block of memory for its locking purposes, and when those blocks end up allocated on the same CPU cache line, the two TMonitor on the same cache line will end up fighting for the cache line, resulting in a drastic drop of performance and thread contention. The technical term for that behavior is false sharing.

A quick fix that can come to mind would be to force the allocation of TMonitor’s blocks early on, so that the blocks don’t end up contiguous, and hope and pray that in more complex situations, this will happen automagically.

Alas, that’s a fragile solution, for instance if you take the code in the link mentioned above, you’ll find it doesn’t work all that well:

  • run the same untouched test on different CPUs with larger cache lines or different cache associativity, and the contention can be back
  • instantiate a different class than TInterfaceList, or subclass it and add a few fields to it, and the contention is back

Why is that?

First, different CPU have different cache lines and associativity, so if you have cache-line size dependent code, you need to ask Windows about it. See for instance “How do I determine the processor’s cache line size?“.

Second, you don’t have control on how contiguous dynamic memory will be. FastMM f.i. is a bucket-based allocator, blocks that fall in the same bucket size will be allocated in sequence, in the previous code, with the empty TInterfaceList, you’ll have (optimistically*) allocated something like:

  • TInterfaceList instance 1
  • TMonitor 1 dynamic data
  • TInterfaceList instance 2
  • TMonitor 2 dynamic data

Which makes both monitor’s dynamic data non-contiguous, and if that’s enough to have both TMonitor’s data end up on different cache lines, the test will fly. But if you don’t have some other dynamic data that is of the appropriate size? the TMonitor’s data will still be contiguous…

*: in practice, even if the same buckets are involved, there is no guarantee the memory order will be the above, as FastMM recycles buckets, so the exact order can depend on the order in which previously allocated buckets of the same size were freed.

Note that if in your application’s code, you don’t have any other dynamic data that happens to fall in just the same bucket size as TMonitor’s data, all your TMonitor are likely to be contiguous (and even more so if you tend to allocate stuff first, and then run it, without manually pre-allocating the TMonitors).

In the above code, raw TInterfaceList instances are 24 bytes in size, and happen to fall in the same bucket as TMonitor’s 28 bytes data (the 32 bytes bucket).

With a linear garbage-collected allocator, similar contiguousness issues can appear after a garbage collection’s compaction, even if linear allocation was used initially and separated the blocks.

An interesting weakness can  also be exposed: a TMonitor’s data (inherently shared) can end up sitting in the middle of thread-specific dynamic data, resulting in another form of false sharing. In that case, TMonitor will not fight with another TMonitor for the cache line, but with your own code and dynamic data.

Why is TRTLCriticalSection not as vulnerable?

After all TRTLCriticalSection is only 24 bytes in size, and thus, smaller than a cache line?

Well it benefits from being a record, and thus usually not dynamically allocated on its own, but as part of a larger structure/object, which reduces the risks of it being on its own cache line (though if you’re not careful, you can easily end up with false sharing with the other owner object’s fields f.i.).

Note that TCriticalSection dynamically allocates the space for a TRTLCriticalSection, and thus can partially exhibit the false sharing issues that can plague TMonitor’s dynamic data.

Conclusions

The only way to be safe from false sharing, is to allocate large enough blocks, so that you guarantee they use a distinct cache line. In TMonitor’s case, the fix would be to allocate a larger block, rather than a small 28 bytes block as is currently the case.

Ideally, TCriticalSection instances should also be made larger, so their only drawback compared to TRTLCriticalSection would be the (rather negligible) virtual call overhead.

Multi-threading is hard, when you spot a simple problem in a simple test, don’t try to hide it under the complexity carpet, fix it while it’s still simple ;-)

Tips ,

Once upon a time in a thread…

May 26th, 2011

Last episode in the TMonitor saga. In the previous episode, Chris Rolliston posted a more complete test case, for which he got surprising results (including that a Critical Section approach wouldn’t scale with the thread count). Starting from  his code I initially also got similar surprising results.

edit: apparently the “crash” part of the TMonitor issues have been acknowledged by the powers that be, and a hotfix could be on the way, though it points back to QC 78415, an issue reported in 2009, ouch. Guess those 4 bytes per instance haven’t seen much use…

Revised Test with Stable Results

I simplified his code (see below), by dropping the usage of several RTL classes and features, and went for a straightforward implementation, in the process, the oddities went away as far as Critical Section is concerned, and partially so as far as TMonitor goes…
The results can be summarized by this chart:

This was measured on a quad-core, as you can see the Critical Section version stays flat until the number of threads gets greater than the core count, at which point, there is a small ramping arising from the workload taking its toll. TMonitor is a different story, if the revised test doesn’t exhibit the poor scaling I was finding in my previous test, there is still a ramping,  as well as a wild jump once there are more threads than cores.

Which RTL class or what exactly was the source of the behavior in Chris’s original code, I don’t know. One possible cause pointed by Krystian in a former comment could be that instances can end up in the same cache line, though that doesn’t explain everything, it could be a start is major factor.

Note that TMonitor allocates its own small block for its locking purposes, distinct from the object instance, and AFAICT there are no provisions in case those blocks end up in the same cache line, though I’m not convinced yet that’s the issue we’re seeing here, this could be a source of contention.

edit: Krystian posted some sample code with cache-line collision avoidance, with it TMonitor becomes much more linear, though half as fast as a CS, and there are still occasional spurious slowdowns showing up in the timings.

Test Code

Here is the test code used for the above, if you test on your machine, make sure you have selected the high performance profile in Windows Power options, and that you don’t have any implicit affinity settings kicking in on the executable.

You can call the above code from a form where you’ll have dropped a TMemo to use as log, as I’m assuming you don’t want to slum it in a command line executable ;-)

const
   cCountdownFrom = $FFFFFF; //increase if necessary...
   cMaxThreads = 10;

type
   TTestThread = class(TThread);

   TTestThreadClass = class of TTestThread;

   TCriticalSectionThread = class(TTestThread)
      FCriticalSection: TRTLCriticalSection;
      procedure Execute; override;
   end;

   TMonitorThread = class(TTestThread)
      procedure Execute; override;
   end;

procedure RunTest(log : TStrings; const testName : String; threadCount : Integer;
                  threadClass : TTestThreadClass);
var
   i : Integer;
   threads : array of TThread;
   tstop, tstart, freq : Int64;
begin
   SetLength(threads, threadCount);

   for i:=0 to threadCount-1 do
      threads[i]:=threadClass.Create(True);

   QueryPerformanceCounter(tstart);

   for i:=0 to threadCount-1 do
      threads[i].Start;
   for i:=0 to threadCount-1 do
      threads[i].WaitFor;

   QueryPerformanceCounter(tstop);
   QueryPerformanceFrequency(freq);

   log.Add(Format('%s: %d thread(s) took %.1f ms',
                  [testName, threadCount, (tstop-tstart)*1000/freq]));

   for i:=0 to threadCount-1 do
      threads[i].Free;
end;

procedure TCriticalSectionThread.Execute;
var
   counter : Integer;
begin
   InitializeCriticalSection(FCriticalSection);

   counter:=cCountdownFrom;
   repeat
      EnterCriticalSection(FCriticalSection);
      try
         Dec(counter);
      finally
         LeaveCriticalSection(FCriticalSection);
      end;
   until counter<=0;

   DeleteCriticalSection(FCriticalSection);
end;

procedure TMonitorThread.Execute;
var
   counter : Integer;
begin
   counter:=cCountdownFrom;
   repeat
      System.TMonitor.Enter(Self);
      try
         Dec(counter);
      finally
         System.TMonitor.Exit(Self);
      end;
   until counter<=0;
end;

procedure RevisedChrisTest(log : TStrings);
var
   i, j : Integer;
begin
   for i:=1 to 3 do begin
      log.Add('*** ROUND '+IntToStr(i)+' ***');
      for j:=1 to cMaxThreads do begin
         RunTest(log, 'TCriticalSection', j, TCriticalSectionThread);
         RunTest(log, 'TMonitor', j, TMonitorThread);
      end;
   end;
end;

Tips ,

TMonitor woes

May 25th, 2011

Primoz Gabrijelcic recently reported a possible bug with TMonitor, in the more advanced side of TMonitor.

However, when experimenting with it for DWS, I bumped on issues in the basic usage scenarios too, and reverted to using critical sections. It seems that as of Delphi XE, short of a patch, TMonitor is just a waste of 4 bytes per object instance.

One of the basic usage of TMonitor I’m referring to would be to replace a critical section:

TMonitor.Enter(someObject);
try
   // protected code
finally
   TMonitor.Exit(someObject);
end;

However, TMonitor is having a problem with the above, and you can quickly run into situations where everything gets serialized, even when there is no need to. Let’s look at a minimal thread:

type
   TMyThread = class(TThread)
      Count : Integer;
      Obj : TObject;
      procedure Execute; override;
   end;

procedure TMyThread.Execute;
begin
   while not Terminated do begin
      System.TMonitor.Enter(Obj);
      try
         Dec(Count); // or do something else
         if Count<=0 then Break;
      finally
         System.TMonitor.Exit(Obj);
      end;
   end;
end;

Assuming you create two instances of the above thread class, which are working on two different “Obj” instances, the two threads should be able to run in parallel, as they don’t operate on the same memory structures at all, right?
Well, if you use plain old critical sections, they will, but if you use TMonitor like in the above code, they won’t, they’ll just run serialized, and all but the first thread will suffer from severe contention, which hints that a race condition is hiding somewhere in TMonitor’s code…

Tips ,

Delphi for JavaScript

May 10th, 2011

A while back, I posted of FireFox 4 JavaScript engine running around Delphi when it came to floating point performance on the Mandebrot set, since then, Chrome got updated to version 11, and further raised the bar by beating FireFox by about 20% in that benchmark. That’s no mean feat: current generation JavaScript engines run not just faster than Delphi, but also .Net and a slew of other compilers, native or not, when it comes to floating point. Only state of the art native compiler still resist.

The figures for Delphi 64 are still unknown, but it’ll face a challenge merely matching the floating point performance of JavaScript, and if the VCL’s TCanvas hasn’t been revamped from the ground up, chances are that out of the box, Delphi 64 won’t be able to beat the HTML5 Canvas on performance (not to mention in features, where HTML5 Canvas is also leading by a few miles).

Jon L Aasenden is investigating an Object Pascal For JavaScript (OP4JS), with mobile devices in sight (if you don’t already know about it, you may also want to check PhoneGap). My experiments with the mobile WebKit that powers iPhone & Android browsers have been very positive, though some library are still a bit bloated for current hardware, using CSS3, HTML5 & libraries like XUI, it’s possible to design some excellent interactive UIs, in reasonably little time. Given the rate of improvements, in 1-2 years, libraries like jQuery Mobile should run smoothly on all the hardware being sold.

WebGL Aquarium

And add to that upcoming goodies, like WebGL, and JavaScript + HTML5 is step by step, with little fuss, despite all its shortcomings, becoming a universal platform with high performance potential. One could only wish JavaScript weren’t a dynamic language, but hey, after all, the x86 instruction set became prevalent despite its shortcoming too, and will still be serving in the 64bit era for the foreseeable future.

Even on the Windows desktop, it is IMHO becoming increasingly questionable to base your UIs on anything else than HTML5 & CSS, the alternatives are not only more proprietary, but either looking responsive but dated (like unskinned VCL, WinAPI controls), or outright messy and sluggish (WPF).

Chromium LogoRight now, ChromiumEmbedded allows you to embark Webkit + Chrome V8 engine, which will work across the board with no update or dependency issues (unlike IE9), using Henri Gourvest’s Delphi ChromiumEmbedded, you can integrate it into your Delphi applications, and use it as an alternative to VCL-based controls for many aspects of an application’s UI.

Tips ,

Delphi 64 beta update

April 27th, 2011

The Delphi 64 beta which used to be an April’s fool, and a few days later turned out to incidentally be official after all, might finally begin next week.

Answering a question by Tobias Giesen in the non-tech forum, Tim Del Chiaro (from Embarcadero Product Marketing) wrote:

> I applied with XE serial, but haven't heard back yet.
> Can I still hope to get beta access?

You should see an invite later this week when we start sending them out.

So apparently the announcement was made somewhat early by Embarcadero, and they haven’t yet started sending invites, but things could be happening sometime next month.

Links: Embarcadero’s page on Delphi 64-bit compiler preview and beta sign-up.

News

Delphi 64 beta official after all

April 5th, 2011

Did the attention that my little April 1st post drew help set things in motion a little earlier than planned?

It’ll be denied, it wasn’t seen, and it can’t be proven, it’s just that the slides and the video look a bit rushed, but anyway, on a just released Delphi 64 sneak preview page on the Embarcadero website, there  is a now a “join the beta” link, were you can apply for the beta. However the beta isn’t open, and Delphi XE users will apparently have priority.

As for the preview keypoints themselves, Marco Cantu has an executive summary.

It seems the beta will be under the usual NDA terms, so it may be some more time before more details filter out.

News

Delphi 64 open beta now available!

April 1st, 2011

In a bold (or unintentional?) move, a Delphi 64 beta has been released on the Embarcadero servers, there is no official announce yet, so hit the link below while it’s still there, it may not last long!

Delphi 64 (maybe open?) beta download

..alas this was just for april’s fool… Let’s hope the next announcement about Delphi 64 availability will happen before April 2012!

edit 04/04: coincidence? They’ll deny it, no one saw them do it, and anyway no one can’t prove anything, but the following announces have just been made:
Delphi 64-bit Compiler Sneak Preview and Beta – Official Announcement
Delphi 64-bit Compiler Preview and Beta Program
And Marco Cantu has the exec summary: Delphi 64 bit Sneak Preview
With none other than David “let’s-have-a-drink” Intersimone as host for Embarcadero page for “Pulsar” beta being http://www.embarcadero.com/products/delphi/64-bit

News

Kudos to the Firefox 4 TraceMonkey team!

March 24th, 2011

I’ve been quite impressed with the JavaScript floating point performance in FireFox 4, which puts the Delphi compiler to shame. See for yourself this fractal rendering demo:

Mandelbrot Set in HTML 5 Canvas

I’ve made a version of the same code in Delphi XE (source + pre-compiled executable, 331 kB ZIP), and on my machine here, for the 480×480 resolution, where FireFox 4 gets the default view rendered in 124 ms, where the “regular” Delphi version, which is limited to the old FPU, takes about 200 ms

It takes manually SSE-enhanced Delphi code to get back on top with a 87 ms render time. It’s quick non-optimized scalar SSE code sure, and could likely be improved, but the point remains that without asm, Delphi XE’s native compiler trails TraceMonkey in the floating point department…

So Embarcadero, how is that Delphi 64 version coming? is it properly SSE-enabled?

News , ,

FieldByName, or why a Profiler is your friend

November 30th, 2010

I recently bumped on a post by François on FieldByName performance, and was bit surprised by the magnitude of speedups reported by Marco Cantu in a comment:

” I keep fixing similar code I see my clients use, and in some case the performance can increase 5 to 10 times, for large loops. Good you are raising this problem. ”

We have similar-looking code being used here in our datasets (which aren’t TDataset-related at all however), and yet, repeatedly looking up fields by name isn’t a performance issue (it hardly registers in the profiling results, even in worst case situations, like in-memory SQLite DBs).

By curiosity, I had a look at DB.pas… Suffice it to say that the VCL code make a good case study of why a Profiler is your friend, and what “out of thin air” optimization can lead to.

The FindField case of Unicode comparisons

The DB.pas code being copyrighted, I won’t post any excerpts here, but you can find it easily enough yourself, so I’ll just describe what happens.

FindField’s purpose is to find a TField by its name, in a case-insensitive fashion.
A naive implementation would thus look like  that:

for i := 0 to FFields.Count-1 do begin
   Result := FFields[i];
   if AnsiCompareText(Result.Name, FieldName) = 0 then
      Exit;
end;
Result := nil;

I then made a quick benchmark, consisting of a three cases:

  • “best” case consists in finding the first field
  • “worst” case consists in finding the 20th field
  • “all” case consists in finding 20 fields out of the 20 fields

Field names were like “MyFieldName1″, “MyFieldName2″, etc. up to “MyFieldName20″. You’ll note that the differencing characters are at the end of the string, so the situation is quite unfavorable in terms of string comparisons, but neutral if you were to hash those strings f.i.
Also keep in mind I’ve just got a recent CPU (at the date of writing), and the timings afterward are for 100k lookups. On a regular end-user machine, you could probably double or quadruple the figures.

With the naive implementation,  the “best” case performance is 19 ms, “worst” case 400 ms, and “all” is 210 ms.

This is quite lengthy, as with Unicode, case insensitive comparison (AnsiCompareText) is quite complex and expensive time-wise. There can be an obvious performance issue with FindField if used in a loop.

Case study of an optimization gone wrong

To cut down on that complexity, the VCL implementors chose to go for a hash. A risky choice to begin with: a hash has to be good enough to limit collisions, it has to be computed fast enough to actually bring a benefit, and last but not least, it results in sometimes complex extra code (and here you need a case insensitive hash, a plain old binary hash won’t do).

So the VCL code goes on to compute a hash for each of the fields, and alters the naive implementation above by checking the hash before performing the AnsiCompareText, doing the comparison only the hash matches. So far so good, eh? Well, here the trouble begins.

First, the VCL code is still facing one AnsiCompareText per FindField hit, plus an AnsiUpperCase (which is almost as expensive) to compute the hash.

Second, the TNamedItem.HashName implementation is a collection of “don’t”, look for yourself in the code:

  • it includes a custom GetStringLength inline, to cut down on the access to the string character count probably, access which happens twice in the implementation (and which a profiling would have revealed to be negligible, especially in light of the following)
  • since the introduction of FrankenStrings, a String can hold Ansi as well as UTF16, and the hash is computed on an UpperCase’d UTF-16, so you’ve got extra conversion code in there, in case it is Ansi, including dynamic allocation to serve as buffer for UnicodeFromLocaleChars, which is invoked no less than twice
  • in case the String is already UTF-16, a buffer is still dynamically allocated, and the AnsiUpperCase’d name copied to it, I guess that’s to preserve the “efficiency” of the loop that computes the actual hash…
  • then comes the hash computation loop, that was obviously where the optimization effort went, it works on a PChar, does a ROL and XOR, and it is certainly efficient, except that a mere profiling would have shown its efficiency didn’t matter, unless your field names are several hundred characters long…

The VCL implementation has a “best” case performance of 42 ms (2 times slower than naive), a “worst” case of 50 ms (8 times faster than naive), and “all” of 46ms (5 times faster than naive).

Thing is, you likely won’t often have 20 fields in your queries, and the VCL implementation needs at least 3 fields to pull ahead of the naive implementation. Given the size and complexity of the VCL code involved, I would say that’s quite an under-achievement.

Last but not least, if you profile the VCL code, you’ll see that HashName, a whole bunch of memory allocation and string management code from System.pas are quite stressed, given the above, that’s not too surprising, but that means performance in a multi-threaded situation will only get worse.

Doing it the efficient way

Let’s do it with the help of a profiler, and a bit of laziness.

Initially AnsiCompareText is the obvious, overwhelming culprit the profiler will point to in the naive implementation, there are two roads from that point:

  • optimizing AnsiCompareText, this is complex, involves quite a bit of code, and we’re lazy, remember?
  • the fastest AnsiCompareText is the one you don’t do, that’s the lazy road.

How to not do the AnsiCompareText?

One reason there are so many of them in the first place is that there is a loop on the fields. And when optimizing, loops are good, they mean you’ve got big O optimization potential, and big O optimization is how you achieve orders of magnitude speedups.

In this case, it’s a simple O(n) string search loop, for which one of the classic optimizations is a reduction to O(ln n) using binary search. That however requires an ordered list, and the Fields list isn’t sorted, and can’t be sorted.
So we need a good old fashioned index.

One such readily available index is good old TStringList, with Sorted set to True: place the field names in the Strings[], the TField object in the Objects[]. Use IndexOf() to find a field. That’s all. You have reduced the AnsiCompareText from O(n) to O(ln n).

// after filling up or altering the Fields
FIndex.Clear; // FIndex assumed created, set to sorted, case insensitive
for i := 0 to FFields.Count-1 do
   FIndex.AddObject( FFields[i].Name, FFields[i] );
...
// Find a field with
i := FIndex.IndexOf( fieldName );
if i >=0 then
   field := TField( FIndex.Objects[i] )
else field := nil;

With the above code, on a 20 fields situation, the best/worst/all cases benchmarks around 78 ms, which is coherent with the O(ln n) expectation (there are no best or worst case).

A quick look in the profile reveals AnsiCompareText is still the overwhelming bottleneck, instead of being the one in our code, it’s now the one in the TStringList.CompareStrings. The profiler tells us the AnsiCompareText is the key, no need to worry about optimizing Length() ;-)

How to go further from there?

We are O(ln n), could we go to O(1)? That would involve a hash list, which means a hash, a case-insensitive hash. We don’t have one handy, they’re complex, this is not lazy. Also a hash list can be costly memory-wise, we don’t want a huge setup for what could be a two-fields-dataset affair in practice.

The  fastest AnsiCompareText is the one you don’t do… so, do we really need AnsiCompareText? No.
Why? Because we only really need to index the names that are looked up, and FindField is troublesome when it’s invoked in a loop, looking up the very same name strings again and again, ad nauseum.

Rather than indexing the field names, we can index the field names that are actually looked up, but this time in a case sensitive TStringList, thus changing our index to a cache of sorts:

// after filling up or altering the Fields
FIndex.Clear; // FIndex assumed created, set to sorted, case sensitive
...
// Find a field with
k:=FIndex.IndexOf(FieldName);
if k>=0 then
   Result:=TField(FIndex.Objects[k])
else begin
   // not in index, find it with naive implementation
   Result:=nil;
   for i:=0 to FFields.Count-1 do begin
      Result:=FFields[i];
      if AnsiCompareText(Result.Name, FieldName) = 0 then
         Break;
   end;
   // add to index
   FIndex.AddObject(FieldName, Result);
end;

After benchmarking, however, there is no speedup… A look at the profiling results shows that TStringList.CompareStrings is still the bottleneck, this time because of AnsiCompareStr… is there no end to them slow Unicode functions???

Why is AnsiCompareStr slow? in part because in Unicode the same character can be encoded differently, and in part because the WinAPI implementation is just plain no good.

In our case however, the Unicode encodings details don’t matter, it’s a cache, the ordering is meaningless as long as it is consistent, so we can just subclass TStringList and override CompareStrings:

function TIndexStringList.CompareStrings(const S1, S2: string): Integer;
begin
   Result:=CompareStr(S1, S2);
end;

In Delphi XE, CompareStr() is fast, it is still based on Pierre le Riche’s FastCode version (but who knows what will happen to it in Delphi 64 where there is no BASM? but I digress…).

Wrapping it up

The new benchmark figures are now around 8 ms, in all cases. That’s five times faster than the VCL’s best case with a 20 fields dataset, and it scales nicely with field counts, as we’re still O(ln n). Just to illustrate:

  • with 3 fields: VCL takes 42 to 45 ms, our code 4 to 5.6 ms
  • with 100 fields: VCL takes 42 to 81 ms, our code 10 to 18 ms.

This is also achieved with a lot less code than the VCL, no asm or fancy tricks, and we achieved speedups of the same magnitude as what Marco Cantu reported, what François and all TDataset users have to labor for by optimizing on the spot every time they have a loop on a dataset.

Is there some more fat to be trimmed?

The final profiling of the benchmark look like that:

16.5% are lost to the overhead (a simple loop that calls FindField repeatedly), we can assume CompareStr is optimal enough,  the rest is spent in TStringList methods which are decently implemented.

As a bonus, you’ll notice there is no noteworthy stress on dynamic memory allocation, meaning things will scale all the better when multi-threading.

As an alternative to TStringList, you could wonder about the new generic TDictionary<>, if you’re feeling adventurous.
However, you wouldn’t be rewarded for your risk, as it’s slower than the solution exposed here (up to 5 times slower when there are few fields, and slightly slower when there are hundreds of fields). A look at the profiler shows it wastes most of its time… computing the hash. Its memory overhead is also quite higher and would probably bite in a multi-thread situation (I’ll let the astute reader figure out why the TStringList approach gets away with very little memory allocation).

The optimization is done.
You could of course go further, there are a couple low hanging percents to grab, but to really improve, it  would involve libraries or extra code that would go beyond the scope of an isolated optimization case such as this one, IMHO.

Tips , , ,