Just committed to the SynEdit SVN a few enhancements:
- much improved performance for long & large files,
still not quite notepad++-class just yet, but my profiler tells me there are many juicy low-hanging candies left
- improvements to the TSynGeneralSyn highlighter (single & double quote mode, token callback, and a few other niceties)
- DWScript syntax highlighter has been moved to the SynEdit SVN, the copy in the DWScript SVN will no longer be the primary reference
edit 27/12: committed another optimization, AFAICT, when working on large files, SynEdit is now faster than the Delphi IDE code editor and *way* faster than Scintilla/notepad++ 220.127.116.11 (with and without a syntax highlighter like DWS’s)
- Fixed gathering of samples in Monte-Carlo multi-threading mode.
- Fixed crash when closing a results page with Ctrl+F4.
- No longer restores to maximized or minimized state.
As you may be aware, for and unknown reason SamplingProfiler doesn’t work under Windows 7, the technical details as to what doesn’t work under Windows 7 are in this stackoverflow question:
If you’ve any idea of the answer, or know someone who might know the answer, any help would be appreciated! The documentation on the subject is very limited both on the Microsoft site and on the Internet…
edit: following a suggestion by Dan Bartlett on stackoverflow, it seems the issue was an alignment constraint introduced (or enforced?) in Win! I’ve compiled a beta version with the change, grab it here: SamplingProfiler-win7-beta.zip (679 kB), if fix is confirmed, I’ll make a full release. A full release is now available.
DWScript includes a debugging facility, in the form of the IDebugger interface. The TdwsSimpleDebugger component implements that interface and can be used to simply surface the events.
You activate a debugger by merely attaching it to a compiled script (a TdswProgram), you’ll then get notified of debugging events. Note that the events will be invoked from the thread the scripts runs in, so if you’ve got some UI updates involved f.i., you’ll have to handle the cross-thread synchronization.
- StartDebug: invoked when the program execution (under debug) starts.
- DoDebug: invoked for each instruction by executed.
- StopDebug: invoked when the program execution (under debug) ends.
- EnterFunc: invoked when the script enters a function.
- LeaveFunc: invoked when the script enters a function.
For aborting script execution, you have the standard TdwsProgram.Stop method.
Standard debugging tasks
Breakpoints can be implemented by merely checking the position of the instruction you get in DoDebug against a reference of breakpoints (you can use a TBits for that). You can then suspend or check conditions (see below).
To suspend execution, just don’t return from DoDebug and the script will effectively be suspended.
When stepping, DoDebug will fire on instructions, so if you step from DoDebug to DoDebug, and the user placed two instruction in the same script lines f.i., you’ll step twice on the same script line. If that isn’t desirable, you can filter and step only if the source line changed.
Step into and step over can be implemented with the help of the EnterFunc/LeaveFunc notifications.
If you want to evaluate a symbol, you can use Expr.Prog.Table.FindSymbol(), with Expr being the expression you got passed in the DoDebug. That will find the symbol in the current context (the current method if you’re in a method f.i.). You can also find a symbol from the root by going through the Root property.
Note that FindSymbol will return any symbol, not just variables (TDataSymbol), so you can use this to evaluate functions (with side-effects) if you wish. Data symbols are stored in the stack, so the relevant part for you will be a data symbol’s stack address (Addr).
The stack is accessible via Expr.Prog.Stack, you’ll need it to evaluate data symbols. You can of course also use it to modify a variable (just write to the variable’s stack location).
DWS 2.1: If you want the call stack, the most pragmatic approach is IME to do it via EnterFunc/LeaveFunc, the script call stack structure exist, but they aren’t really geared towards ease of use when debugging. So you can just maintain your own simplified call stack.
DWS 2.2+: StackTrace is available directly in the execution, both as a raw expression CallStack array, or via CallStackToString, as a textual version. You also have access to any script Exception call stack via the StackTrace method.
Note on calls to Delphi-functions
Keep in mind the debugger can only operate on the script code. If the script has invoked a Delphi function, and your execution is stuck there, the debugger won’t help. You’ll have to handle suspension in your Delphi code.
That’s a reason why here it’s considered a good practice to wrap calls to Delphi code with a safety net, as failure, crashes, incorrect data are the norm when scripting. Raw exposure of Delphi (or external) functions can often be problematic when the user is writing and debugging his scripts.
Using the debugger for profiling
You can easily make use of IDebugger for profiling purposes via DoDebug and EnterFunc/LeaveFunc. Instrumenting is implicitly there, so it’s just a matter of performing the timings.
In our mini-IDE for scripting here (not open-source), a sampling profiler is always active when running code: it periodically suspends the script, notes the current expression, call stack and resumes execution ASAP. For scripting purposes, a sampling frequency of 100 Hz is more than enough IME, and it’s low-frequency enough to have no measurable impact on the script execution time.
Another cheap profiling tool is to add a function call counter (via EnterFunc), with the two combined, you can pretty much identify all bottlenecks at a glance.
Finally, another tool in your chest can be to implement an execution monitor with the debugger, like the one in SamplingProfiler, which can be useful not just at the IDE level, but also at runtime. A typical use is to make a watchdog screen, where you can have an overview of what all the running scripts are doing: useful to diagnose in-production slowdowns, see if they’re all stuck on the same database access, shared resources, or whatever.
” I keep fixing similar code I see my clients use, and in some case the performance can increase 5 to 10 times, for large loops. Good you are raising this problem. ”
We have similar-looking code being used here in our datasets (which aren’t TDataset-related at all however), and yet, repeatedly looking up fields by name isn’t a performance issue (it hardly registers in the profiling results, even in worst case situations, like in-memory SQLite DBs).
By curiosity, I had a look at DB.pas… Suffice it to say that the VCL code make a good case study of why a Profiler is your friend, and what “out of thin air” optimization can lead to.
The FindField case of Unicode comparisons
The DB.pas code being copyrighted, I won’t post any excerpts here, but you can find it easily enough yourself, so I’ll just describe what happens.
FindField’s purpose is to find a TField by its name, in a case-insensitive fashion.
A naive implementation would thus look like that:
for i := 0 to FFields.Count-1 do begin Result := FFields[i]; if AnsiCompareText(Result.Name, FieldName) = 0 then Exit; end; Result := nil;
I then made a quick benchmark, consisting of a three cases:
- “best” case consists in finding the first field
- “worst” case consists in finding the 20th field
- “all” case consists in finding 20 fields out of the 20 fields
Field names were like “MyFieldName1″, “MyFieldName2″, etc. up to “MyFieldName20″. You’ll note that the differencing characters are at the end of the string, so the situation is quite unfavorable in terms of string comparisons, but neutral if you were to hash those strings f.i.
Also keep in mind I’ve just got a recent CPU (at the date of writing), and the timings afterward are for 100k lookups. On a regular end-user machine, you could probably double or quadruple the figures.
With the naive implementation, the “best” case performance is 19 ms, “worst” case 400 ms, and “all” is 210 ms.
This is quite lengthy, as with Unicode, case insensitive comparison (AnsiCompareText) is quite complex and expensive time-wise. There can be an obvious performance issue with FindField if used in a loop.
Case study of an optimization gone wrong
To cut down on that complexity, the VCL implementors chose to go for a hash. A risky choice to begin with: a hash has to be good enough to limit collisions, it has to be computed fast enough to actually bring a benefit, and last but not least, it results in sometimes complex extra code (and here you need a case insensitive hash, a plain old binary hash won’t do).
So the VCL code goes on to compute a hash for each of the fields, and alters the naive implementation above by checking the hash before performing the AnsiCompareText, doing the comparison only the hash matches. So far so good, eh? Well, here the trouble begins.
First, the VCL code is still facing one AnsiCompareText per FindField hit, plus an AnsiUpperCase (which is almost as expensive) to compute the hash.
Second, the TNamedItem.HashName implementation is a collection of “don’t”, look for yourself in the code:
- it includes a custom GetStringLength inline, to cut down on the access to the string character count probably, access which happens twice in the implementation (and which a profiling would have revealed to be negligible, especially in light of the following)
- since the introduction of FrankenStrings, a String can hold Ansi as well as UTF16, and the hash is computed on an UpperCase’d UTF-16, so you’ve got extra conversion code in there, in case it is Ansi, including dynamic allocation to serve as buffer for UnicodeFromLocaleChars, which is invoked no less than twice
- in case the String is already UTF-16, a buffer is still dynamically allocated, and the AnsiUpperCase’d name copied to it, I guess that’s to preserve the “efficiency” of the loop that computes the actual hash…
- then comes the hash computation loop, that was obviously where the optimization effort went, it works on a PChar, does a ROL and XOR, and it is certainly efficient, except that a mere profiling would have shown its efficiency didn’t matter, unless your field names are several hundred characters long…
The VCL implementation has a “best” case performance of 42 ms (2 times slower than naive), a “worst” case of 50 ms (8 times faster than naive), and “all” of 46ms (5 times faster than naive).
Thing is, you likely won’t often have 20 fields in your queries, and the VCL implementation needs at least 3 fields to pull ahead of the naive implementation. Given the size and complexity of the VCL code involved, I would say that’s quite an under-achievement.
Last but not least, if you profile the VCL code, you’ll see that HashName, a whole bunch of memory allocation and string management code from System.pas are quite stressed, given the above, that’s not too surprising, but that means performance in a multi-threaded situation will only get worse.
Doing it the efficient way
Let’s do it with the help of a profiler, and a bit of laziness.
Initially AnsiCompareText is the obvious, overwhelming culprit the profiler will point to in the naive implementation, there are two roads from that point:
- optimizing AnsiCompareText, this is complex, involves quite a bit of code, and we’re lazy, remember?
- the fastest AnsiCompareText is the one you don’t do, that’s the lazy road.
How to not do the AnsiCompareText?
One reason there are so many of them in the first place is that there is a loop on the fields. And when optimizing, loops are good, they mean you’ve got big O optimization potential, and big O optimization is how you achieve orders of magnitude speedups.
In this case, it’s a simple O(n) string search loop, for which one of the classic optimizations is a reduction to O(ln n) using binary search. That however requires an ordered list, and the Fields list isn’t sorted, and can’t be sorted.
So we need a good old fashioned index.
One such readily available index is good old TStringList, with Sorted set to True: place the field names in the Strings, the TField object in the Objects. Use IndexOf() to find a field. That’s all. You have reduced the AnsiCompareText from O(n) to O(ln n).
// after filling up or altering the Fields FIndex.Clear; // FIndex assumed created, set to sorted, case insensitive for i := 0 to FFields.Count-1 do FIndex.AddObject( FFields[i].Name, FFields[i] ); ... // Find a field with i := FIndex.IndexOf( fieldName ); if i >=0 then field := TField( FIndex.Objects[i] ) else field := nil;
With the above code, on a 20 fields situation, the best/worst/all cases benchmarks around 78 ms, which is coherent with the O(ln n) expectation (there are no best or worst case).
A quick look in the profile reveals AnsiCompareText is still the overwhelming bottleneck, instead of being the one in our code, it’s now the one in the TStringList.CompareStrings. The profiler tells us the AnsiCompareText is the key, no need to worry about optimizing Length()
How to go further from there?
We are O(ln n), could we go to O(1)? That would involve a hash list, which means a hash, a case-insensitive hash. We don’t have one handy, they’re complex, this is not lazy. Also a hash list can be costly memory-wise, we don’t want a huge setup for what could be a two-fields-dataset affair in practice.
The fastest AnsiCompareText is the one you don’t do… so, do we really need AnsiCompareText? No.
Why? Because we only really need to index the names that are looked up, and FindField is troublesome when it’s invoked in a loop, looking up the very same name strings again and again, ad nauseum.
Rather than indexing the field names, we can index the field names that are actually looked up, but this time in a case sensitive TStringList, thus changing our index to a cache of sorts:
// after filling up or altering the Fields FIndex.Clear; // FIndex assumed created, set to sorted, case sensitive ... // Find a field with k:=FIndex.IndexOf(FieldName); if k>=0 then Result:=TField(FIndex.Objects[k]) else begin // not in index, find it with naive implementation Result:=nil; for i:=0 to FFields.Count-1 do begin Result:=FFields[i]; if AnsiCompareText(Result.Name, FieldName) = 0 then Break; end; // add to index FIndex.AddObject(FieldName, Result); end;
After benchmarking, however, there is no speedup… A look at the profiling results shows that TStringList.CompareStrings is still the bottleneck, this time because of AnsiCompareStr… is there no end to them slow Unicode functions???
Why is AnsiCompareStr slow? in part because in Unicode the same character can be encoded differently, and in part because the WinAPI implementation is just plain no good.
In our case however, the Unicode encodings details don’t matter, it’s a cache, the ordering is meaningless as long as it is consistent, so we can just subclass TStringList and override CompareStrings:
function TIndexStringList.CompareStrings(const S1, S2: string): Integer; begin Result:=CompareStr(S1, S2); end;
In Delphi XE, CompareStr() is fast, it is still based on Pierre le Riche’s FastCode version (but who knows what will happen to it in Delphi 64 where there is no BASM? but I digress…).
Wrapping it up
The new benchmark figures are now around 8 ms, in all cases. That’s five times faster than the VCL’s best case with a 20 fields dataset, and it scales nicely with field counts, as we’re still O(ln n). Just to illustrate:
- with 3 fields: VCL takes 42 to 45 ms, our code 4 to 5.6 ms
- with 100 fields: VCL takes 42 to 81 ms, our code 10 to 18 ms.
This is also achieved with a lot less code than the VCL, no asm or fancy tricks, and we achieved speedups of the same magnitude as what Marco Cantu reported, what François and all TDataset users have to labor for by optimizing on the spot every time they have a loop on a dataset.
Is there some more fat to be trimmed?
The final profiling of the benchmark look like that:
As a bonus, you’ll notice there is no noteworthy stress on dynamic memory allocation, meaning things will scale all the better when multi-threading.
As an alternative to TStringList, you could wonder about the new generic TDictionary<>, if you’re feeling adventurous.
However, you wouldn’t be rewarded for your risk, as it’s slower than the solution exposed here (up to 5 times slower when there are few fields, and slightly slower when there are hundreds of fields). A look at the profiler shows it wastes most of its time… computing the hash. Its memory overhead is also quite higher and would probably bite in a multi-thread situation (I’ll let the astute reader figure out why the TStringList approach gets away with very little memory allocation).
The optimization is done.
You could of course go further, there are a couple low hanging percents to grab, but to really improve, it would involve libraries or extra code that would go beyond the scope of an isolated optimization case such as this one, IMHO.
A new version has been released which adds support for the new Delphi XE paths, you can download it here.
Note that there is still a pseudo-random issue of unknown origin under Windows 7, where the utility may or may not be able to gather profiling information. If that happens, close the application and launch it again. The issue doesn’t happen with Windows OSes before 7.
Passing parameters as “const” is a classic Delphi optimization trick, but the mechanisms behind that “trick” go beyond cargo-cult recipes, and may actually stumble into the “good practice” territory.
Why does it work?
The most well known case is “const String“, for which the compiler than takes advantage of the “const” to pass the String reference directly (as a pointer)… and without increasing the reference counter.
To illustrate what the reference counting implies, here are two screen-shots from the CPU disassembly view, taken in Delphi 6, but the recent compilers up to Delphi 2010 at least behave in a similar fashion (just with the Unicode version of the functions).
I’ve made a pseudo-function, with only one String parameter that calls a single function (here Length(), which isn’t inlined in Delphi 6, and corresponds to LStrLen), guess which CPU disassembly corresponds to “Test(const a : String)” and which to “Test(a : String)”
What you’re seeing on the left is an implicit try…finally construct which is used to protect the reference counting (LStrAddRef/LStrClr) on the implicit local variable used to store the String parameter. If you have such calls nested a few levels deep (not uncommon in object-oriented code), you could accumulate quite some overhead.
Also not see here but hidden within the AddRef and Clr calls are bus locks, which may hit you disproportionately in multi-threading scenarios.
And what if you’re modifying the passed String? Can you forego the “const“? Well no, the extra overhead is still there, and using a “const” still beneficial.
String isn’t the only type affected, there are similar gains for all reference counted types (interfaces, dynamic arrays, records holding reference-counted types). And when passing a record as “const“, there is an additional gain in the lack of defensive copying of the record (the larger the record, the greater the gain).
For ordinal and numeric types, there is no compiler optimization (yet), but it can be good practice to mark them as “const“, not in the optimistic hope that someday the compiler will optimize it, but to ease up debugging and code writing. Copies of those parameters (when you want to modify them in the function body) are typically handled adequately by the compiler (and often enough comes for free), so don’t let performance considerations hold you.
There is one case in which using “const” could have adverse effects, but it involves global variables which you would modify or release during the call… so you don’t really need to know more about it, do you?
How does it manifests itself in Profiling?
A missing “const” can manifests itself in different ways during profiling, for String, Interface and dynamic array types, you’ll usually see it via the relevant xxxAddRef or xxxClr functions (the obvious case) and via time spent in begin/end (the less obvious case). Depending on how many functions calls with the reference counting are involved in your inner loop, none of the above may come ahead, even if calls and reference counting are an issue: the time spent will be spread over the above functions and your various begin/end sections.
For record parameters, the lack of “const” could materialize itself in “Move“, or other data copying costs (large records will be moved, smaller ones can be copied via ad-hoc in-place code by the compiler).
In multi-threading, things can be a little less obvious, as rather than the reference counting and exception frame, it could be the bus locks hitting you, or sometimes worse, inter-CPU traffic to pass around the values of the reference counter.
Multi-threading’s worst case would be referring to the same string field from multiple threads, and passing that string around in functions without “const” parameters, thus repeatedly hitting the reference counter of that string from multiple threads at the same time, resulting in extra inter-CPUs traffic to share the value of that reference counter.
In such cases, “const” can help, but it will often not be enough, you’ll also have to look for functions/methods that return strings with an unnecessary reference counting, like when getting your strings from a TStringList.
Usually I tend to “const” just about every parameter that isn’t “var” or an object, this isn’t just to pre-emptively alleviate possible performance issues, but also to gain something valuable when debugging and writing a function’s body: untouched input parameters, wherever you are in the function’s body. This means not just in the function’s body, but also when in the functions called by the function.
And that’s the key point to remember about “const” parameter: it’s here to state that a parameter will be left unchanged inside the body, something which allows the compiler to optimize it’s output, and the developer to optimize his thought process too. Qualifying all your parameters as either “const” or “var” makes your intentions obvious.
So IME, it’s not worth it to hoard your keystrokes, best spend them on “const” and explicit local variables when you really need to alter those input parameters.
Whoever has to maintain your code will be thankful.
There is one case I’ve not mentioned in all the above, the case of object parameters (and class references). It’s a special case because when you pass an object as “const“, you’re not making any promises of not keeping the object constant, but just keeping the underlying variable (pointer) constant. This is unlike other languages. There is also a risk of confusion when your code will be read by developers not really familiar with the underlying pointer-based mechanics of objects (and they are more common than you may think, even, and maybe more so in highly OO languages).
So for objects, unless you want to spend a lot of time educating, my rule of thumb is to avoid “const” or “var” for object parameters.
Beyond “const” parameters
Const’ing your parameters will eliminate some defensive copying, implicit try…finally, and reference counting overhead, but it leaves open another field of similar inefficiencies: that of return values.
Alas there is no “const Result” syntax, which would allow to state that your result can’t be altered in any way (a sort of strict immutability if you will). Thus whenever you have a function returning a String f.i., you’ll end up triggering the implicit reference counting and exception machinery if you’re not careful (like in this case)…
There are strategies to tackle this issue, but none of them are as simple as adding a “const“, and most of them come with sacrifices… so, let them be food for other articles.
SamplingProfiler v1.7.4 is now available. This version adds an option for Delphi 2010 paths, and fixes a bug with the silent mode execution that would render it inoperative. There also have been other minor changes, mostly cosmetic.
This release also includes preparation for an “attach to process” option, which is currently not enabled, but should hopefully make in the next version (available “when ready”).