Chapter 3 JIT Compilation

.NET code is distributed as assemblies of Microsoft Intermediate Language (MSIL, or just IL for short). This language is somewhat assembly-like, but simpler. If you wish to learn more about IL or other CLR standards, see the extensive documentation Microsoft provides at http://www.writinghighperf.net/go/17.

When your managed program executes, it loads the CLR which will start executing some wrapper code. This is all assembly code. The first time a managed method is called from your assembly, it actually runs a stub that executes the Just-in-Time (JIT) compiler which will convert the IL for that method to the hardware’s assembly instructions. This process is called just-in-time compilation (“JITting”). The stub is replaced and the next time that method is called, the assembly instructions are called directly. This means, the first time any method is called, there is always a performance hit. In most cases, this hit is small and can be ignored. Every time after that, the code executes directly and incurs no overhead.

This process can be summarized in the following diagram:

Image

Figure 3-1. Compilation and JITting flow.

While all code in a method will be converted to assembly instructions when JITted, some pieces may be placed into “cold” code sections of memory, separate from the method’s normal execution path. These rarely executed paths will thus not push out other code from the “warm” sections, allowing for better overall performance as the commonly executed code is kept in memory, but the cold pages may be paged out. For this reason, rarely used things like error and exception-handling paths can be expensive.

In most cases, a method will only need to be JITted once. The exception is when the method has generic type arguments. In this case, the JIT may be called for each invocation with a different type parameter.

You need to be concerned about JIT costs if this first-time warm-up cost is important to your application or its users. Most applications only care about steady-state performance, but if you must have extremely high-availability, JIT can be an issue that you will need to optimize for. This chapter will tell you how.

  1. Benefits of JIT Compilation

Code that is just-in-time compiled has some significant advantages over compiled unmanaged code.

  1. Good Locality of Reference—Code that is used together will often be in the same page of memory, preventing expensive page faults.
  2. Reduced Memory Usage—It only compiles those methods that are actually used.
  3. Cross-assembly Inlining—Methods from other DLLs, including the .NET Framework, can be inlined into your own application, which can be a significant savings.

There is also a benefit of hardware-specific optimizations, but in practice there are only a few actual optimizations for specific platforms. However, it is becoming increasingly possible to target multiple platforms with the same code, and it is likely we will see more aggressive platform-specific optimizations in the future.

Most code optimizations in .NET do not take place in the language compiler (the transformation from C#/VB.NET to IL). Rather, they occur on-the-fly in the JIT compiler.

  1. Costs of JIT Compilation

You can easily see the IL-to-assembly-code transformation in action. As a simple example, here is the JitCall sample program that demonstrates the code fix up that JIT does behind the scenes:

static void Main(string[] args)
{
int val = A();
int val2 = A();
Console.WriteLine(val + val2);
}

[MethodImpl(MethodImplOptions.NoInlining)]
static int A()
{
return 42;
}

To see what happens, first get the disassembly of Main. Getting to this point is a little bit of a trick.

1. Launch Windbg.

2. File | Open Executable… (Ctrl+E).

3. Navigate to the JitCall binary. Make sure you pick the Release version of the binary or the assembly code will look quite different than what is printed here.

4. The debugger will immediately break.

5. Run the command: sxe ld clrjit. This will cause the debugger to break when clrjit.dll is loaded. This is convenient because once this is loaded you can set a breakpoint on the Main method before it is executed.

6. Run the command: g.

7. The program will execute until clrjit.dll is loaded and you see output similar to the following:

(1a74.2790): Unknown exception - code 04242420 (first chance)
ModLoad: 6fe50000 6fecd000
C:WindowsMicrosoft.NETFrameworkv4.0.30319clrjit.dll

8. Run the command: .loadby sos clr.

9. Run the command: !bpmd JitCall Program.Main. This sets the breakpoint at the beginning of the Main function.

10. Run the command: g.

11. Windbg will break right inside the Main method. You should see output similar to this:

(11b4.10f4): CLR notification exception - code e0444143 (first chance)
JITTED JitCall!JitCall.Program.Main(System.String[])
Setting breakpoint: bp 007A0050 [JitCall.Program.Main(System.String[])]
Breakpoint 0 hit

12. Now open the Disassembly window (Alt+7). You may also find the Registers window interesting (Alt+4).

The disassembly of Main looks like this:

push ebp
mov ebp,esp
push edi
push esi

; Call A
call dword ptr ds:[0E537B0h] ds:002b:00e537b0=00e5c015
mov edi,eax
call dword ptr ds:[0E537B0h]
mov esi,eax

call mscorlib_ni+0x340258 (712c0258)
mov ecx,eax
add edi,esi
mov edx,edi
mov eax,dword ptr [ecx]
mov eax,dword ptr [eax+38h]

; Call Console.WriteLine
call dword ptr [eax+14h]
pop esi
pop edi
pop ebp
ret

There are two calls to the same pointer. This is the function call to A. Set break points on both of these lines and start stepping through the code one instruction at a time, making sure to step into the calls. This pointer at 0E537B0h will get updated after the first call.

Stepping into the first call to A, you can see that it is little more than a jmp to the CLR method ThePreStub. There is no return from this method here because ThePreStub will do the return.

mov al,3
jmp 00e5c01d
mov al,6
jmp 00e5c01d
(00e5c01d) movzx eax,al
shl eax,2
add eax,0E5379Ch
jmp clr!ThePreStub (72102af6)

On the second call to A, you can see that the function address of the original pointer was updated and the code at the new location looks more like a real method. Notice the 2Ah (our decimal 42 constant value from the source) being assigned and returned via the eax register.

012e0090 55 push ebp
012e0091 8bec mov ebp,esp
012e0093 b82a000000 mov eax,
2Ah
012e0098 5d pop ebp
012e0099 c3 ret

For most applications, this first-time, or warm-up, cost is not significant, but there are certain types of code that lend themselves to high JIT time, which we will examine in the next few sections.

As an exercise, what happens to the JITted code when you remove the NoInlining attribute from A? You should see a few compiler optimizations in action.

  1. JIT Compiler Optimizations

The JIT compiler will perform some standard optimizations such as method inlining and array range check elimination, but there are things now that you should be aware of that can prevent the JIT compiler from optimizing your code. Some of these topics have their own treatments in Chapter 5. Note that because the JIT compiler executes during runtime, it is limited in how much time it can spend doing optimizations. Despite this, it can do many important optimizations.

One of the biggest classes of optimizations is method inlining, which puts the code from the method body into the call site, avoiding a method call in the first place. Inlining is critical for small methods which are called frequently, where the overhead of a function call is larger than the function’s own code.

All of these things prevent inlining:

  • Virtual methods.
  • Interfaces with diverse implementations in a single call site. See Chapter 5 for a discussion of the interface dispatch problem.
  • Loops.
  • Exception handling.
  • Recursion.
  • Method bodies larger than 32 bytes of IL.

As of the time of this writing, the next version of the JIT, code-named RyuJIT, features significantly improved code generation performance as well as improved generated code quality, particularly for 64-bit code. Read more at http://www.writinghighperf.net/go/18. RyuJIT has been released as Community Technology Preview (CTP) which you can test out now.

  1. Reducing JIT and Startup Time

The other major factor in considering the JIT compiler is the amount of time it takes to generate the code. This mostly comes down to a factor of how much code needs to be JITted.

In particular, be careful of the use of:

  • LINQ
  • The dynamic keyword
  • Regular expressions
  • Code generation

All of these have a simple fact in common: Much more code is possibly hidden from you and actually executed than is obvious from your source. All of that hidden code may require significant time to be JITted. With regular expressions and generated code in particular, there is likely to be a pattern of large, repetitive blocks of code.

While code generation is usually something you would write for your own purposes, there are some areas of the .NET Framework that will do this for you, the most common being regular expressions. Before execution, regular expressions are converted to an IL state machine in a dynamic assembly and then JITted. This takes more time up front, but saves a lot of with repeated execution. You usually want to enable this option, but you probably want to defer it until it is needed so that the extra compilation does not impact application start time. Regular expressions can also trigger some complex algorithms in the JIT that take longer than normal, most of which are improved in RyuJIT. As with everything else in the book, the only way to know for sure is to measure. See Chapter 6 for more discussion of regular expressions.

Even though code generation is implicated as a potential exacerbation of JIT challenges, as we will see in Chapter 5, code generation in a different context can get you out of some other performance issues.

LINQ’s syntactic simplicity can belie the amount of code that actually runs for each query. It can also hide things like delegate creation, memory allocations, and more. Simple LINQ queries may be ok, but as in most things, you should definitely measure.

The primary issue with dynamic code is, again, the sheer amount of code that it translates to. Jump to Chapter 5 to see what dynamic code looks like under the covers.

There are other factors besides JIT, such as I/O, that can increase your warm-up costs, and it behooves you to do an accurate investigation before assuming JIT is the only issue. Each assembly has a cost in terms of disk access for reading the file, internal overhead in the CLR data structures, and type loading. You may be able to reduce some of the load time by combining many small assemblies into one large one, but type loading is likely to consume as much time as JITting.

If you do have a lot of JIT happening, you should see stacks like the following show up in CPU profiles of your application:

Image

Figure 3-2. PerfView's CPU profiling will show you any JIT stubs that are being called.

Also see the Measurement section in this chapter for a demonstration of how PerfView can show you exactly which methods are being JITted and how long each one took.

  1. Optimizing JITting with Profiling

.NET 4.5 includes an API that tells .NET to profile your application’s startup and store the results on disk for future references. On subsequent startups, this profile is used to start generating the assembly code before it is executed. This happens in a separate thread. The saved profiles allow this generated code to have all the same benefits of locality as JITting. The profiles are updated automatically on each execution of your program.

To use it, simply call this at the beginning of your program:

ProfileOptimization.SetProfileRoot(@”C:MyAppProfile”);
ProfileOptimization.StartProfile(“default”);

Note that the profile root folder must already exist, and you can name your profiles, which is useful if your app has different modes with substantially different execution profiles.

  1. When to Use NGEN

NGEN stands for Native Image Generator. It works by converting your IL assembly to a native image—in effect, running the JIT compiler and saving the results to a native image assembly cache. This native image should not be confused with native code in the sense of unmanaged code. Despite the fact that the image is now mostly assembly language, it is still a managed assembly because it must run under the CLR.

If your original assembly is called foo.dll, NGEN will generate a file called foo.ni.dll and put it in the native image cache. Whenever foo.dll is loaded, the CLR will inspect the cache for a matching .ni. file and verify that it matches the IL exactly. It does this using a combination of time stamps, name, and GUIDs to 100% ensure that it is the correct file to load.

In general, NGEN should be your last resort. While it has its place, it does have some disadvantages. The first is that you lose locality of reference, as all the code in an assembly is placed sequentially, regardless of how it is actually executed. In addition, you can lose certain optimizations such as cross-assembly inlining. You can get most of these optimizations back if all of the assemblies are available to NGEN at the same time.

That said, if application startup or warm-up costs are too high and the profile optimization mentioned above does not satisfy your performance requirements, then NGEN may be appropriate. Before making such a decision, remember the prime directive of performance: Measure, Measure, Measure! See the tips at the end of this chapter for how to measure JIT costs in your application.

NGENed applications can have an advantage in faster load time, but there are drawbacks. You must also update the native images every time there is a change—not a big deal, but an extra step for deployment. NGEN can be very slow and native images can be significantly larger than their managed counterparts. Sometimes, JIT will produce more optimized code, especially for commonly executed paths.

Most usages of generics can be successfully NGENed, but there are cases where the compiler cannot figure out the right generic types ahead of time. This code will still be JITted at runtime. And of course, any time you rely on dynamically type loading or generation, those pieces cannot always be NGENed ahead of time.

To NGEN an assembly from the command line, execute this command:

D:BookReflectionExeinRelease>ngen install ReflectionExe.exe

1> Compiling assembly D:BookReflectionExeinReleaseReflectionExe.exe (CLR v4.0.30319) ...
2> Compiling assembly ReflectionInterface, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null (CLR v4.0.30319) ...

From the output you see that there are actually two files being processed. NGEN will automatically look in the target file’s directory and NGEN any dependencies it finds. It does this by default to allow the code to make cross-assembly calls in an efficient away (such as inlining small methods). You can suppress this behavior with the /NoDependencies flag, but there may be a significant performance hit during runtime.

To remove an assembly’s native image from the machine’s native image cache, you can run:

D:BookReflectionExeinRelease>ngen uninstall ReflectionExe.exe

Uninstalling assembly D:BookReflectionExeinReleaseReflectionExe.exe

You can verify that a native image was created by displaying the native image cache:

D:BookReflectionExeinRelease>ngen display ReflectionExe

NGEN Roots:
D:BookReflectionExeinReleaseReflectionExe.exe
NGEN Roots that depend on "ReflectionExe":
D:BookReflectionExeinReleaseReflectionExe.exe
Native Images:
ReflectionExe, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

You can also display all cached native images by running the command ngen display.

  1. Optimizing NGEN images

I said above that one of the things you lose with NGEN is locality of reference. Starting with .NET 4.5, you can use a tool called Managed Profile Guided Optimization (MPGO) to fix this problem to a large extent. Similar to Profile Optimization for JIT, this is a tool that you manually run to profile your application’s startup (or whatever scenario you want). NGEN will then use the profile to create a native image that is better optimized for the common function chains.

MPGO is included with Visual Studio 2012 and higher. To use, run the command:

Mpgo.exe –scenario MyApp.exe –assemblyList *.* –OutDir c:Optimized

This will cause MPGO to run on some framework assemblies and then it will execute MyApp.exe. Now the application is in training mode. You should exercise the application appropriately and then shut it down. This will cause a new, optimized assembly to be created in the C:Optimized directory.

To take advantage of the optimized assembly, you must run NGEN on it:

Ngen.exe install C:OptimizedMyApp.exe

This will create optimized images in the native image cache. Next time the application is run, these new images will be used.

To use the MPGO tool effectively, you will need to incorporate it into your build system so that its output is what gets shipped with your application.

  1. The Future of Native Code Generation

On April 2, 2014, Microsoft announced the .NET Native Developer Preview. This is a technology that radically changes how the CLR works by statically linking core framework libraries into applications. It completely removes the need for JIT and uses the Visual C++ compiler optimizer to produce extremely high quality assembly code in a small footprint. In effect, we get to have our cake and eat it too—all the benefits of rapid development in .NET as well as a native image that needs no JITting at runtime!

The bad news is that this only works for Windows Store apps now, but I am hopeful that the target platform will expand to eventually include all .NET applications.

  1. When JIT Can’t Compete

The JIT is great for most applications. If there are performance problems, there are usually bigger issues than pure code generation quality or speed. However, there are some areas where the JIT has room to improve.

For example, there are some processor instructions the JIT will not use, even though they are available on the current processor. A big example is much of the category of SSE or SIMD instructions which execute a single set of instructions across multiple data sets. Most modern x64-platform processors from both Intel and AMD support these instructions, and they are critical for parallel computations required in things like gaming and scientific and mathematical computation. The current JIT compiler as of this writing (4.5.2) makes very limited use of these instructions and registers, but the good news is that RyuJIT will support more SIMD instructions with SSE2, bringing many of these advantages to the managed world.

The other major situation where the JIT compiler is not going to be quite as good as a native code compiler is with direct native memory access vs. managed array access. For one, accessing native memory directly usually means you can avoid the memory copy that will come with marshalling it to managed code. While there are ways around this with things like UnmanagedMemoryStream, which will wrap a native buffer inside a Stream, you are really just making an unsafe memory access.

If you do transfer the bytes to a managed buffer, the code that accesses the buffer will have boundary checks. In many cases, these checks can be optimized away, but it is not guaranteed. With managed buffers, you can wrap a pointer around them and do some unsafe access to get around some of these checks.

With applications that do an extreme amount of array or matrix manipulation, you will have to consider this tradeoff between performance and safety. For most applications, frankly, you will not have to care and the boundary checks are not a significant overhead.

If you find that native code really is more efficient at this kind of processing, you can try marshalling the entire data set to a native function via P/Invoke, compute the results with a highly optimized C++ DLL, and then return the results back to managed code. You will have to profile to see if the data transfer cost is worth it.

Mature C++ compilers may also be better at other types of optimizations such as inlining or optimal register usage, but this is more likely to change with RyuJIT and future versions of the JIT compiler.

  1. Measurement
  1. Performance Counters

The CLR publishes a number of counters in the .NET CLR Jit category, including:

  • # of IL Bytes Jitted
  • # of Methods Jitted
  • % Time in Jit
  • IL Bytes Jitted / sec
  • Standard Jit Failures
  • Total # of IL Bytes Jitted (exactly the same as “# of IL Bytes Jitted”)

Those are all fairly self-explanatory except Standard Jit Failures. Failures can occur only if the IL is unverified or there is an internal JIT error.

Closely related to JITting, there is also a category for loading, called .NET CLR Loading. A few of them are:

  • % Time Loading
  • Bytes in Loader Heap
  • Total Assemblies
  • Total Classes Loaded
  1. ETW Events

With ETW events, you can get extremely detailed performance information on every single method that gets JITted in your process, including the IL size, native size, and the amount of time it took to JIT.

  • MethodJittingStarted—A method is being JIT-compiled. Fields include:
  • MethodID—Unique ID for this method.
  • ModuleID—Unique ID for the module to which this method belongs.
  • MethodILSize—Size of the method’s IL.
  • MethodNameSpace—Full class name to which this method belongs.
  • MethodName—Name of the method.
  • MethodSignature—Comma-separated list of type names from the method signature.
  • MethodLoad—A method is done JITting and has been loaded. Generic and dynamic methods do not use this version. Fields include:
  • MethodID—Unique ID for this method.
  • ModuleID—Unique ID for this module to which this method belongs.
  • MethodSize—Size of the compiled assembly code after JIT.
  • MethodStartAddress—Start address of the method.
  • MethodFlags:
  • 0x1—Dynamic method
  • 0x2—Generic method
  • 0x4—JIT-compiled (if missing, it was NGENed)
  • 0x8—Helper method
  • MethodLoadVerbose—A generic or dynamic method has been JITted and loaded.
  • It has most of the same fields as MethodLoad and MethodJittingStarted.

  1. What Methods and Modules Take the Longest To JIT?

In general, JIT time is directly proportional to the amount of IL instructions in a method, but this is complicated by the fact that type loading time can also be included in this time, especially the first time a module is used. Some patterns can also trigger complex algorithms in the JIT compiler, which may run longer. You can use PerfView to get very detailed information about JITting activity in your process. If you collect the standard .NET events, you will get a special view called “JITStats.” Here is some of the output from running it on the PerfCountersTypingSpeed sample project:

  • Module: PerfCountersTypingSpeed.exe
  • JitTime msec: 12.9
  • Num Methods: 8
  • IL Size: 1,756
  • Native Size: 3,156
  • MethodName: PerfCountersTypingSpeed.Program.Main()
  • JitTime msec: 9.7
  • IL Size: 22
  • Native Size: 45
  • MethodName: PerfCountersTypingSpeed.Form1..ctor()
  • JitTime msec: 0.3
  • IL Size: 176
  • Native Size: 313
  • MethodName: PerfCountersTypingSpeed.Form1.InitializeComponent()
  • JitTime msec: 1.4
  • IL Size: 1,236
  • Native Size: 2,178
  • MethodName: PerfCountersTypingSpeed.Form1.CreateCustomCategories()
  • JitTime msec: 0.8
  • IL Size: 107
  • Native Size: 257
  • MethodName: PerfCountersTypingSpeed.Form1.timer_Tick(class System.Object,class System.EventArgs)
  • JitTime msec: 0.3
  • IL Size: 143
  • Native Size: 257
  • MethodName: PerfCountersTypingSpeed.Form1.OnKeyPress(class System.Object,class System.Windows.Forms.KeyPressEventArgs)
  • JitTime msec: 0.1
  • IL Size: 23
  • Native Size: 27
  • MethodName: PerfCountersTypingSpeed.Form1.OnClosing(class System.ComponentModel.CancelEventArgs)
  • JitTime msec: 0.2
  • IL Size: 19
  • Native Size: 36
  • MethodName: PerfCountersTypingSpeed.Form1.Dispose(bool)
  • JitTime msec: 0.1
  • IL Size: 30
  • Native Size: 43

 

The only method that takes more time to JIT than its IL size would suggest is Main, which makes sense because this is where you will pay for more loading costs.

  1. Summary

To minimize the impact of JIT, carefully consider large amounts of generated code, whether from regular expressions, code generation, dynamic, or any other source. Use profile-guided optimization to decrease application startup time by pre-JITting the most useful code in parallel.

To encourage function inlining, avoid things like virtual methods, loops, exception handling, recursion, or large method bodies. But do not sacrifice the integrity of your application by over-optimizing in this area.

Consider using NGEN for large applications or situations where you cannot afford the JIT cost during startup. Use MPGO to optimize the native images before using NGEN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.10