Visualize with KCachegrind (QCachegrind)

Ruby-prof can generate profiles in so-called callgrind format. This is the format used by the Valgrind profiler used in the C and C++ world. With the CallTree­­­Printer, you can obtain the callgrind-formatted profile and reuse some of the best visualization tools developed by C/C++ programmers.

KCachegrind[11] is, in my opinion, the best such tool. It comes as a part of all major Linux distributions. Its version for Mac OS and Windows is called QCachegrind. On Mac you can install it via MacPorts or Homebrew. On Windows you can use precompiled packages.[12]

If you want to follow along, install it now, if necessary. And then let’s take our example application and generate the callgrind output. Add this code snippet to the end of app.rb:

chp4/app.rb
 
printer = RubyProf::CallTreePrinter.new(result)
 
printer.print(File.open(​"callgrind.out.app"​, ​"w+"​))

This will generate a file called callgrind.out.app. Callgrind has a strange naming convention where the file takes a callgrind.out prefix instead of an extension (suffix). After the prefix comes the name of the executable file, app in our case.

The file format is textual, but it’s optimized for machine reading. So let’s open KCachegrind or QCachegrind and see how it looks and what visualizations we have.

images/src/kcachegrind1_whole_window.png

Flat Profile and Callers List

The list at the left looks like the same flat profile that ruby-prof’s flat printer generates, except that it is sorted by Incl. time (total time in ruby-prof’s parlance) by default. Clicking on the column will change the sorting.

While the profile looks the same as ruby-prof’s flat profile, in reality it’s not. Timings are different. So let’s take a closer look at this profile and try to understand why and how it differs.

images/src/kcachegrind2_flat_profile.png

The first notable difference is that Incl. and Self times are in microseconds. If you divide the Global#[No method] time you see in KCachegrind by 106, you should get the same total time reported by flat and graph profiles.

The second difference is in the total time for the Array#map function. It’s way larger than what we see in the flat profile. Confusingly, it is also larger than total app execution time! Self time appears to be the same, though. What gives?

Remember that when we looked at the graph profile, we discussed the double counting prevention that ruby-prof has. KCachegrind does no such thing. This is why the Incl. time for Array#map includes its time when called from Object#find_youngest, Object#parse_data, and Object#parse_col. The last is called from parse_data and should have not been counted.

This way, total time for Array#map is a sum of its times inside all caller functions. You can clearly see this in KCachegrind in the second list view on the right, in the Callers tab.

images/src/kcachegrind3_callers.png

If you add the process_time numbers, you’ll get the same Incl. time that you see in the flat profile view.

Except for the differences we have just seen, all other data should be the same as in ruby-prof’s flat profile.

Callee and Caller Maps

When all you want is to quickly see where the whole application or any given function spends its time, look at the callee map. It shows functions as nested blocks. The area of each block is proportional to its Incl. time.

You can get the callee map for any function. For example, this is the map for Global#[No method]. That is the name for the whole code part that we profile.

images/src/kcachegrind4_callee_map.png

As expected, we see Object#parse_col, Date#parse, and Regexp#match taking most of the space. Object#parse_col has both significant self and total times. Its self time is visible in this report as the area not occupied by nested functions.

This view makes it easy to get an idea, graphically, of what takes time without digging through the numbers from the flat profile. But beware of the two shortcomings.

First, it tends to visualize nested calls as recursive. In reality Array#map calls Object#parse_row in a loop, which, in turn, calls Array#map again to parse columns. There’s no recursion here—just two nested loops. What we see in the chart is instead the recursive call from Array#map to Object#parse_row, and recursively to Array#map and, again, Object#parse_row. KCachegrind is not smart enough to understand that the map that iterates over rows is different from the map that iterates over columns, and confuses them with the recursion.

Second, it double counts. Select the Array#map line in the Flat Profile view at the left and look at its callee map.

images/src/kcachegrind5_callee_map_array_map.png

See how the largest box for Object#parse_col already reports 100% of its total time. And subsequent calls in the loop show more and more Object#parse_col boxes. Remember, we saw Array#map Incl. time double counting in the previous section. Column parsing is done within that map iterator, so its time is miscalculated as well, and distributed among the Array#map boxes in the chart.

Once you’re aware of the potential miscalculations, the Callee Map view can be very useful in the first stages of the optimization. Caller Map at the bottom of the KCachegrind window is the similar view that shows an area chart of the function’s callers. You should exercise similar caution when interpreting its results. Incorrect nesting and double counting are possible.

Source Code

The Callgrind format lets profilers store full path and line numbers of profiled functions. Ruby-prof does that for Ruby code so you can look inside the source with KCachegrind. For example, this is how Global#nomethod’s source code view looks:

images/src/kcachegrind6_source_code_global.png

Lines that we are not profiling are grayed out. Each line for which ruby-prof has performance data is annotated with Incl. time. Double-clicking on the annotation line lets you dive into the callee function.

Unfortunately, KCachegrind doesn’t show you the source of Ruby C standard library functions. Ruby-prof knows neither where to find Ruby sources nor how to find the function implementation there. Despite that, browsing the profile by reviewing the actual code can reveal information that you wouldn’t otherwise get. For example, you can spot code that does nothing, or debug output.

Call Graph

The last view I’m going to talk about here is the call graph, and it’s the most useful one in KCachegrind. It represents the same information that we have already seen in ruby-prof’s graph report, but as an actual graph rendered by the dot tool from the GraphViz package.

KCachegrind can draw the call graph for any function that you select in the Flat Profile view. Choose Global#nomethod to get the graph for the entire code that you profile. This is how it looks for our example program:

images/src/kcachegrind7_call_graph_global.png

Here you can see the distribution of execution time between functions even more easily than with default ruby-prof’s graph report. Also, all boxes on the chart are clickable, so you can dive in and see the call graph for any given function.

Double counting is a concern here, similar to other visualizations. Array#map has extra Incl. time once again.

Your intuition may suggest that you’ll get the most value from the call graph by reading it from top to bottom and focusing on the branch where your program spends the most time. But double counting invalidates this approach.

The right way to look at a call graph is a bottom-up, breadth-first search and optimization. First, you look at the graph leaves and optimize them. Then you move up and try to optimize the next level of nodes, and so on. In the next chapter I’ll show you how we can optimize our example application using this approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.96