Chapter 36. Investigating Managed Code Performance

 

You cannot teach beginners top-down programming, because they don’t know which end is up.

 
 --C. A. R. Hoare

An important of element of software development, especially when dealing with important tools and tools that perform complex calculations, is performance and optimization. Performance testing and optimization techniques for pretty much every language have been around for quite some time, and developers making the switch, or developers who are hesitant to make the switch, are worried about code performance and not having the understanding to write optimized code. The truth for developers worried about managed code optimizations is that much of the advice remains the same as it did prior to the introduction of .NET and the managed runtime.

In this chapter, I will briefly cover two approaches for investigating performance and then lead into a substantial number of considerations for writing efficient code for the .NET platform.

Investigating Performance

There are two discrete approaches for investigating performance: white box investigation and black box investigation. Both approaches have different strengths and weaknesses, and sometimes it is beneficial to use both approaches to conduct a robust and thorough performance investigation.

Using the white box approach involves studying the implementation details behind a particular component or function, and deriving a list of characteristics to factor into performance testing based on the complexity and perceived cost of completing a particular task. White box testing is great when trying to understand the technology in greater detail, and it allows you to easily identify and address performance pitfalls related to implementation. White box testing makes it very difficult to produce quantitative performance metrics. Relative performance can be measured using an order of magnitude, but only on a piece of functionality in isolation from other components in the system or various software and hardware configurations. Code that may appear slow or overly complex may run quite fast on different hardware with a caching strategy or with an optimizing compiler. Code that may appear efficient has the potential to run extremely slowly or inefficiently on different hardware or software configurations.

Using the block box approach involves disregarding the implementation details and instead basing test results on overall execution time. Black box testing is great in the sense that you can end up with a set of strong and unambiguous metrics to measure performance, including a fairly precise resolve understanding of the capacity of a particular test case. The downside to black box testing results from the nearly infinite number of software and hardware combinations, also known as a combinatorial explosion. The sheer volume of combinations makes it nearly impossible to determine a set of distinct systems that tests can be performed against.

Choosing the best approach depends on the type of investigation being conducted, and how large the system is. White box testing will not suffice for large systems where there is a massive amount of code, or for systems that rely on third-party components that do not have source code available. Black box testing is really to determine which operations or tasks are problematic, making it difficult to determine exactly where the problem originates, unless the slowdown results from architecture and the communication between distributed and isolated components. Quite often, it makes sense to perform a little bit of white box testing in complex areas and then perform a thorough investigation of performance using the black box approach.

Avoid Manual Optimization

Many developers, especially the ones who have been in the industry for a while, are used to optimizing code so tightly that they are concerned with how the compiler is going to execute each segment of critical code, and so they use alternate syntax in an attempt to compile more efficient code. There is no point in trying to perform low-level optimizations manually. Current compilers on the market are quite intelligent when it comes to low-level optimizations, even smarter than you. Just make sure that you enable compiler optimizations when building a final release so that the compiler can work its magic. Be sure to test your application in release mode because preprocessor symbols and conditionally compiled function calls can perform unexpectedly when executed in a different compilation mode. Certain bugs may be masked in debug mode and only appear in an optimized compilation.

The optimizations that you should be concerned with are high-level, such as memory allocation, network traffic, and using inappropriate data structures and algorithms. Make sure that you profile your code before attempting to optimize. It is a waste of time to incorrectly guess where code is suffering from poor performance and attempt to optimize in an area that does not need it.

Use code analysis tools such as FxCop to help identify performance bottlenecks. Some bottlenecks are very hard to identify without a robust tool to help you. Lastly, pass your assemblies through a commercial obfuscator. The main purpose of an obfuscator is to make it difficult to decompile your application by mangling your private and internal type and variable names, but some obfuscators can increase performance by a slight amount through shorter names and optimized memory layouts.

String Comparison

Pretty much every application performs comparisons between strings, with the variant being the number of comparisons performed. String equality can be defined as two strings with the identical sequence of characters, also known as binary equality. This type of string comparison works for most situations, but binary equality does not suffice when multiple locales are used, or case sensitivity matters. The term logical equality is used to describe two strings that are equivalent despite binary differences.

The System.String data type of the .NET framework provides numerous ways to store and manipulate string data, including methods of performing binary and logical equality comparisons. Three methods exist that provide the ability to check for binary equality: two instance methods and one static method. The first instance method is strongly typed to accept a string parameter, and the other instance method overrides the Equals method inherited from System.Object. The overridden method is not recommended unless you are comparing more than just strings, because this method suffers a performance penalty by needing to perform type checking. The static method uses a performance tuned approach that is employed throughout the .NET framework. First, a check is performed to see if any of the strings are null. Then a reference equality check is performed to see if the two strings refer to the same object. If no result has been returned yet, the virtual instance method is called.

Note

The C# equality operator ==, represented by the MSIL op_Equality, simply makes a call to the static Equals method of System.String, so you do not have to worry about performance with this one.

Logical equality is provided through the use of the overloaded Compare method, which has parameters for locale formatting and case insensitive comparisons. Unlike the Equals method, which returns a boolean value, the Compare method returns an integer that describes the lexical difference between the two strings, with a value of zero stating that the two strings are lexicographically identical.

Being able to perform locale-aware and case insensitive comparisons comes at a significant performance cost. The cost is dependent on the locale used and the complexity of the rules related to the locale. Because of the fairly significant performance hit when using the Compare method, it is important to minimize calls to it whenever possible. One approach is to identify comparisons where case and locale rules can be ignored, using String.Equals() instead of String.Compare(). This approach works well for situations where the data originates from back-end or embedded systems, to name a couple examples. Situations where case and locale rules cannot be ignored, but binary equality is common, are best served by calling String.Equals() before String.Compare(). Doing so can result in a considerable performance gain if most of the comparisons exhibit binary equality. The following code shows this.

string string1 = "This is from one place";
string string2 = "This is from elsewhere";
if (string1 == string2 || String.Compare(string1, string2) == 0)
{
    // Handle identical strings
}

In terms of case insensitive comparisons, String.Compare() is able to perform these comparisons without allocating new strings, whereas a call to String.Equals() with calls to String.ToUpper() will result in two new strings being allocated.

It is a common mistake to think that checking whether the length of a string is zero is faster than comparing the string to the empty string constant. There is a sliver more performance when checking the length because only the string metadata must be examined, but this increase is marginal and it is doubtful that you will see any measurable performance bonus. It is important to note that you should use the canonical String.Empty instance as a constant when comparing against an empty string rather than "" since an allocation will be avoided in this case.

String Formatting

Having the ability to insert, remove, and replace data inside strings is a common task in almost every application. This functionality is provided through a couple of mechanisms of the .NET platform and class framework. The first mechanism, also the easiest to use, is the ToString() method that is available on data types that inherit from System.Object. The default behavior of this method is to return the full name of the type, though it can be overridden to support extended functionality. The typical implementation will return a partial representation of the class with the help of the member variables inside the class. Pretty much all the string formatting features of the .NET framework boil down to using ToString() at some point. Therefore, it is important that you make sure that this method performs efficiently and quickly.

Another formatting mechanism is String.Format, which functions like sprintf in the C realm. This function is used to format a string against a pattern where certain tokens are substituted with values in the supplied arguments array. This function is perhaps the slowest of the bunch, because it lacks the efficiency of type overloading provided by a class like StringBuilder, causing a number of boxing and unboxing operations to occur. Classes can also inherit from IFormattable to extend the formatting capabilities of ToString().

One of the biggest slowdowns when dealing with strings is that they are immutable, which means that any time a string is modified, a new string object is created. Immutable strings are great when they are shared frequently and are modified infrequently. Reading is cheap in terms of performance because locking or reference counting is unnecessary, and you can avoid abstraction and sharing schemes. The class framework provides the StringBuilder class, which is used to perform high-performance string operations against a mutable Unicode character array. After modifications on the string within a StringBuilder are complete, you can call ToString() to retrieve the contents of the internal string.

Note

Calling ToString() on a StringBuilder object will simply reference the internal character array, but a copy operation occurs as soon as the result is assigned to a string object and further operations are performed on the StringBuilder. The recommended approach is to only call ToString() after all modifications on the StringBuilder are complete. Otherwise, it is advisable to use the ToString() overload that allows only a substring of the internal character array to be returned.

StringBuilder manages an internal character buffer that is allocated during instantiation. The initial capacity defaults to an array of 16 characters, but you can specify a different value for the initial capacity as a parameter in the constructor. When an operation requires that the internal size of the StringBuilder be increased, a new array that is double the size of the old one is created, and the old data is copied into the new array. The reallocation is quite expensive, and it should be avoided as much as possible. It is highly recommended that you explicitly set the initial capacity if you have enough information to estimate the value that works best for your situation.

There are some downsides to working with a StringBuilder object over a String object. The first downside is that StringBuilder only implements a fraction of the functions offered by String. StringBuilder also incurs a significant overhead cost when first initialized, so there are times where using StringBuilder for only a handful of manipulations can actually decrease your performance. The rule of thumb is to only consider using StringBuilder when the number of manipulations reaches double digits. StringBuilder is great to use when appending strings within a loop. There is no real definitive answer on when to use StringBuilder, because the performance is dependent on system parameters and design. It may not be such a bad idea to profile critical code when using StringBuilder or another formatting mechanism to determine which approach is faster.

Concatenating strings is a form of string formatting, but there are additional performance increases that can be investigated. String.Concat() is, by far, the fastest and most efficient way to join a couple of strings together. Use this method over anything else if you can combine all your strings in one call to String.Concat(). Otherwise, you can resort to a more flexible mechanism like StringBuilder if you need to join many strings together. Never use normal String instances and the concatenation operator to join strings together; this is the most inefficient way you could possibly use to accomplish the task.

String Reversal

The unification of text storage and manipulation into a single String data type was an excellent decision, although the .NET Framework is missing a way to efficiently reverse the contents of a string. String reversal is an uncommon activity but not extremely rare. There are a number of ways to accomplish string reversal, such as appending each string character to a StringBuilder in reverse order, generating a character array and calling Array.Reverse, or calling the StrReverse method in the Microsoft.VisualBasic.Strings library. All three methods will perform the task, but they are not the most efficient way to accomplish it.

The fastest way to perform string reversal is by using a character array with each character from the input string appended to the array in reverse order, afterwards constructing a new string from the reversed character array.

The following code shows how to do this.

string ReverseString(string input)
{
    chars[] chars = new char[input.Length];
    int index1 = input.Length -1;
    int index2 = 0;
    while (index1 >= 0)
    {
        chars[index1—] = input[index2++];
    }
    return new string(chars);
}

Compiling Regular Expressions

In a nutshell, regular expressions are a very powerful text manipulation tool that compresses verbose and suboptimal text manipulation and matching patterns into a couple of lines composing an efficient regular expression. The .NET framework provides a number of robust classes for working with regular expressions, like the Regex type that exists in the System.Text.RegularExpressions namespace. Regex provides a mechanism to execute a regular expression against a text string. When a regular expression is set on the Regex object, it is converted to a partially compiled representation, which is cached for execution during the application lifetime.

In order to further increase performance when executing a regular expression, there is support for pre-compiling a regular expression to MSIL, which will then be JIT’ed (Just-in-Time compiled) to native code before execution. Pre-compiled regular expressions are placed in dynamically generated assemblies that can be loaded at runtime within an application domain. Assemblies cannot be unloaded, so there is a potential problem with this approach where you will not be able to unload regular expressions from memory until the application domain itself is released. To solve this problem, you can persist the dynamically generated assemblies to the hard drive and load them at runtime into a second application domain. This functionality is available through the Regex.CompileToAssembly() method.

The following code shows how to compile a regular expression to a dynamically generated assembly.

using System;
using System.Reflection;
using System.Text.RegularExpressions;
string name = "AlphaNumericTest";
string nameSpace = "CompiledExpression";
string assembly = "RegularExpressionTest";
string expression = "[^a-zA-Z0-9]";
RegexOptions options = RegexOptions.None;
RegexCompilationInfo info = new RegexCompilationInfo(expression,
                                                     options,
                                                     name,
                                                     nameSpace,
                                                     true);
AssemblyName assemblyName = new AssemblyName();
assemblyName.Name = assembly;
Regex.CompileToAssembly(new RegexCompilationInfo[]{info}, assemblyName);

The following code shows how to use the regular expression that has been compiled into the RegularExpressionTest assembly.

using System;
using System.Text.RegularExpressions;
string searchString = "Your Search String Here";
CompiledExpression.AlphaNumericTest expression
                                      = new CompiledExpression.AlphaNumericTest();
foreach (Match match in expression.Matches(searchString))
{
    // Do something with match.Value
}

Note

The performance improvement from using precompiled expressions is dependent on the regular expression used.

Use the Most Specific Type

In the majority of object-oriented programming languages that support inheritance, it is generally possible to use any data type in the inheritance tree to declare a variable. For example, you could instantiate a SpeedBoat object and reference it with a variable of type Boat. Unless there is a specific reason, the general rule is to use the most specific type possible, because doing otherwise can cause performance problems. An example could be declaring a variable of type Object and storing an integer with it. In this particular example, Object is a reference type, and integer is a value type. Treat an integer as an object and you end up with boxing operations.

Luckily, VB.NET is more prone to errors of this nature than C#, because in C# you have to explicitly cast a reference type storing a value type to that correct type before using it in arithmetic operations, for example. This explicit casting will give C# enough information to generate relatively efficient code, although using the most specific type in the first place would still be the most efficient.

Avoid Boxing and Unboxing

There are two data types in the .NET platform—value and reference—and new developers can introduce significant performance penalties without fully understanding the implications behind boxing and unboxing operations.

Value types are lightweight objects that are allocated on the stack, unless the value type is allocated as an array element, or if the value type is a field of a reference type. All primitives and structures are value types that are derived from System.ValueType. Value types are stack-based, which means that allocating and accessing them is much more efficient than using reference types.

Reference types are heavyweight objects that are allocated on the heap, unless the stackalloc keyword is used. Reference layers impart a level of indirection, meaning that they require a reference to access their storage location. These types cannot be accessed directly, so a variable always holds a reference to the actual object or it is null. Reference types are allocated on the heap, so the runtime must check to see that each allocation is successful.

A boxing operation occurs when a value type needs to behave like a reference type. The Common Language Runtime allocates enough memory to hold a copy of the value type, including the necessary information to create a valid reference type. There is a significant amount of performance overhead because of the heap allocation and storage of the value type state. This conversion can occur explicitly through a cast operation, or implicitly by an assignment operation or a method call.

An unboxing operation occurs when a boxed value type is to be explicitly converted back to a value type on the stack. The Common Language Runtime returns a pointer to the referenced data, and then the data is typically copied to a location on the stack through an assignment operation. The boxed value type will still remain on the heap until the garbage collector is able to reclaim it.

It is important to be aware of areas of your code where large numbers of boxing and unboxing operations occur. Also be aware that the .NET framework has many methods and properties that can cause implicit conversions to occur when used with value types. If a method takes an object as a parameter, then value type instances will be boxed.

Use Value Types Sensibly

Using value types in performance-critical code can lead to some performance gain, but only if used correctly. Performance can be significantly decreased if value types are overused or are used inefficiently. Value types are much faster to instantiate and uninstantiate, and they also take up less space in memory. The size difference between a value type and a reference type on a 32-bit machine is three words. This is because reference types store the reference to the object data, a sync block index, and a method table index. Three words may seem insignificant, but consider situations where you have a large number of objects. You do need to also consider the performance implications when value types need to behave as objects, resulting in a boxing and unboxing operation.

Working with structures can also offer the potential for performance improvements. Classes are specified as auto layout so that the CLR can arrange fields in the optimal manner for speed and memory size, taking byte alignment into account. Structures are specified as sequential layout by default, which makes things easy when passing structures to P/Invoke and unmanaged code, because the layout of the structure easily maps to the structure in unmanaged code. Performance in this situation is ideal because hardly any marshaling is required. However, using structures with sequential layout without interacting with unmanaged code is very inefficient. If you are using structures for performance reasons, without the intent to communicate with legacy code, you can explicitly declare a struct as auto layout with the following code.

[StructLayout(LayoutKind.Auto)]
public struct MyStructure
{
    // ...
}

The Myth About Foreach Loops

A common misconception with code optimization is that using a for loop instead of a foreach loop will offer better performance. In actuality, this advice used to be correct back when compilers were not intelligent enough to determine the logical equality between a for and a foreach loop in like situations. This thought is based on the assumption that an enumerator is instantiated inside a foreach loop to iterate through the elements of a collection, which is not a factor anymore because of processor speeds and compiler optimization. Using a foreach loop to iterate through the elements of an array will make no substantial difference in performance, if any at all.

To review, writing a for loop like the following code:

for (int index = 0; index < array.Length; index++)
{
    // Do processing on array[index]
}

will perform the same as a foreach loop like the following code (assuming an array of bytes for the sake of argument).

foreach (byte element in array)
{
    // Do processing on element
}

There is one situation where a for loop might be more efficient than a foreach loop, and that is when the size of the collection is a fixed value that you are aware of when writing the code. Consider the following code.

for (int index = 0; index < 15; index++)

Being able to write a for loop with a constant iteration count will give the JIT compiler a lot more flexibility and scope for optimization.

Use Asynchronous Calls

The .NET platform offers mechanisms to provide both asynchronous and synchronous execution. Typically, synchronous execution is used for the bulk of your application, though some situations warrant an asynchronous model in order to increase performance and responsiveness. An example would be downloading a file from a network or the Internet, which is generally a processor-intensive task depending on the size of the file. The file could be downloaded asynchronously while the user interface displays the running process of the operation.

An asynchronous model can be extremely advantageous when used correctly, although it can destroy your performance if used incorrectly.

Note

There is a small overhead penalty incurred when using asynchronous calls. When an asynchronous call is invoked, the security state of the call stack is copied and attached to the thread that is executing the asynchronous call. This penalty is insignificant if the callback executes a fair chunk of code, or if the asynchronous calls are infrequently executed.

Efficient IO Buffer Sizes

The .NET framework provides a number of data buffers that inherit from BufferedStream. These buffers have a default buffer size value, but you are able to set the value to any size that you want. Even though you have this freedom, in almost every case you will be getting sub-optimal performance unless you have the buffer size set to a value between 4000 and 8000 bytes. Generally, the only time where a large buffer size is efficient occurs when a very predictable size is being managed, such as files that are usually around the same size.

Minimize the Working Set

Managed code takes care of many low-level responsibilities and handles them transparently, but managed code does not always handle things in the most efficient way possible. External assemblies are loaded into the main application domain when they are used for the first time, which increases memory usage and decreases performance by a slight amount. Therefore, it is important to minimize the number of assemblies that you use in order to keep the working set small.

The more types in an assembly, the more memory it will take up and the more time it will take to be JIT’ed. Consider moving types that are rarely used into separate assemblies that can be loaded into a second AppDomain on demand. The same goes for large resources; keep them in external assemblies instead of embedding them into the main assembly. Lastly, if you are only using a couple methods out of a fairly large assembly, you might consider implementing your own copy of those methods to avoid having to load the assembly.

Note

You can use the VaDump tool, downloadable from Microsoft.com, to track your working set. You can also use Performance Counters (perfmon.exe) to give you detailed feedback about a number of useful statistics like the number of classes that you load.

The .NET platform provides transparent support for automatic memory management, but there are some tasks that you should explicitly do in order to design for optimum performance. The first task is to ensure that Dispose() is called on the appropriate objects as soon as possible. Also, ensure that you do not reference objects once you are done using them. References to unused objects will prevent the garbage collector from collecting and removing the objects from the application memory.

Perform Chunky Calls

There are generally two types of calls when working with data across managed and unmanaged interfaces: “chatty” and “chunky.” Chatty calls are those that occur quite often and do very little work, while chunky calls are those that occur less frequently, but generally do more work when they occur.

I should mention that chunky calls are not always the best solution. A chatty call that passes simple data may be less computationally expensive than a chunky call. The incurred performance costs are cheaper because the data marshaling is not as complex. P/Invoke, Interop, and Remoting calls all carry significant overhead, so you want to minimize the number of calls using them. The best approach is to prototype both call types early in the development phase so that you can make the best decision for the solution.

When a call is sent between managed and unmanaged code, there are some events that transpire in order to facilitate this communication. First, data marshaling must be performed to get the source data into the appropriate target format for the receiver. Next, the calling convention signatures must be fixed to pass data between the sender and receiver. The next step is to protect callee-saved registers and switch the threading mode so that the garbage collector does not block unmanaged threads. Lastly, a frame to handle exceptions is created to supervise calls into managed code. The events equate to roughly 30 x86 instructions when using P/Invoke (roughly 10 when marshaling is not required), and roughly 60 x86 instructions when using COM Interop. Therefore, it is important to use P/Invoke over COM Interop whenever possible to speed up the calls between managed and unmanaged code.

The biggest slowdown occurs during data translation, such as converting text from ASCII to Unicode. Classes with explicit layout are extremely cheap, and primitive types require almost no marshaling at all. Blittable types are those that can be transferred directly across managed and unmanaged code with no marshaling at all. These types are byte, sbyte, double, float, long, ulong, int, uint, short, and ushort. You can also freely pass value types and single-dimensional arrays that contain blittable types.

Minimize Exception Throwing

One of the best features of the .NET platform is the exception handling model that is available to all applications. This model offers the ability to develop robust applications that can handle and respond to exceptions and errors gracefully in almost all situations. However, this model must be used carefully, or some significant performance costs can be introduced into your application. Throwing exceptions is expensive in terms of performance, so throw as few as possible. You can check how many exceptions your application throws at runtime through the use of Performance Counters (perfmon.exe). Also, be aware that the .NET runtime can throw its own exceptions. It is advisable to use Performance Counters to check this, and use the debugger to locate the source of the exceptions.

One myth that circulates around developers working with managed code is that try/catch blocks introduce performance overhead. The truth is that you only incur a performance cost when an actual exception is thrown. You can use as many try/catch blocks as you want. Do not use exceptions to control program flow.

Thoughts About NGen

The methods of a managed application are Just-in-Time compiled (JIT’ed) the first time they are used during runtime. This dynamic compilation can lead to a significant startup penalty if the application invokes a lot of methods during startup.

Also, there are many shared libraries in the .NET class framework that incur significant overhead on top of your own code. There is a tool provided with the .NET framework (ngen.exe) that can generate native images of assemblies and store them in the Global Assembly Cache, essentially precompiling your code for faster startup times and overall runtime execution in certain situations.

While NGen sounds like the silver bullet for increasing runtime performance, there are only certain situations when performance can be improved through its use. Native images cannot be used when crossing application domains, so there is no real benefit from using NGen for ASP.NET applications. However, generating native images for Windows Forms can result in a performance increase.

Note

NGen must be run on the assemblies after they have been deployed to the target machine. Doing so allows the application to be optimized for the machine it is installed on.

There are some situations where your application may perform better with JIT compilation instead of native images. Some optimizations cannot be done with native images, so make sure that you profile the startup and operating times of your application while using native images and JIT compilation. You should also profile combinations of native images and regular assemblies.

Conclusion

This chapter examined performance considerations when developing applications for the .NET platform. First, two approaches for investigating performance were discussed: white box and black box. The rest of the chapter focused on performance considerations for commonly used and abused areas of everyday .NET development. A misconception regarding performance optimization is that a considerable amount of time should be spent on optimizing code down to the compiler level. In reality, especially with .NET, the majority of performance loss results from application architecture and design. These problems occur at a high level, and can be identified using black box performance testing.

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.71.94