We know that assert_performance should measure the current performance, compare it with the performance from the previous run, and store the current measurements as the reference value for the next run. Of course, the first test run should just store the results because there’s no previous data to compare with.
Now let’s think through success and failure scenarios for such tests. Failure is easy. If performance is significantly worse, then report the failure. The success scenario, though, has two possible outcomes: one when performance is not significantly different, and another when it has significantly improved.
It looks like it’s not enough just to report failure/success. We need to report the current measurement, as well as any significant difference in performance.
So let’s get back to the editor and try to do exactly that.
chp8/assert_performance.rb | |
| require 'minitest/autorun' |
| |
| class Minitest::Test |
| def assert_performance(current_performance) |
| self.assertions += 1 # increase Minitest assertion counter |
| |
| benchmark_name, current_average, current_stddev = *current_performance |
| past_average, past_stddev = load_benchmark(benchmark_name) |
| save_benchmark(benchmark_name, current_average, current_stddev) |
| |
| optimization_mean, optimization_standard_error = compare_performance( |
| past_average, past_stddev, current_average, current_stddev |
| ) |
| |
| optimization_confidence_interval = [ |
| optimization_mean - 2*optimization_standard_error, |
| optimization_mean + 2*optimization_standard_error |
| ] |
| |
| conclusion = if optimization_confidence_interval.all? { |i| i < 0 } |
| :slowdown |
| elsif optimization_confidence_interval.all? { |i| i > 0 } |
| :speedup |
| else |
| :unchanged |
| end |
| |
| print "%-28s %0.3f ± %0.3f: %-10s" % |
| [benchmark_name, current_average, current_stddev, conclusion] |
| if conclusion != :unchanged |
| print " by %0.3f..%0.3f with 95%% confidence" % |
| optimization_confidence_interval |
| end |
| print "
" |
| |
| if conclusion == :slowdown |
| raise MiniTest::Assertion.new("#{benchmark_name} got slower") |
| end |
| end |
| |
| private |
| |
| def load_benchmark(benchmark_name) |
| return [nil, nil] unless File.exist?("benchmarks/#{benchmark_name}") |
| benchmark = File.read("benchmarks/#{benchmark_name}") |
| benchmark.split(" ").map { |value| value.to_f } |
| end |
| |
| def save_benchmark(benchmark_name, current_average, current_stddev) |
| File.open("benchmarks/#{benchmark_name}", "w+") do |f| |
| f.write "%0.3f %0.3f" % [current_average, current_stddev] |
| end |
| end |
| |
| def compare_performance(past_average, past_stddev, |
| current_average, current_stddev) |
| # when there's no past data, just report no performance change |
| past_average ||= current_average |
| past_stddev ||= current_stddev |
| |
| optimization_mean = past_average - current_average |
| optimization_standard_error = (current_stddev**2/30 + |
| past_stddev**2/30)**0.5 |
| |
| # drop non-significant digits that our calculations might introduce |
| optimization_mean = optimization_mean.round(3) |
| optimization_standard_error = optimization_standard_error.round(3) |
| |
| [optimization_mean, optimization_standard_error] |
| end |
| end |
Again, this includes some simplifications you can easily undo. First, we save the benchmark results to the file in a predefined hard-coded location. Second, we hardcode the number of measurement repetitions to 30, exactly as in the performance_benchmark function. And third, our assert_performance works only with Minitest 5.0 and later, so we need to install the minitest gem.
But now that we have our assert, we can write our first performance test.
chp8/test_assert_performance1.rb | |
| require 'assert_performance' |
| require 'performance_benchmark' |
| |
| class TestAssertPerformance < Minitest::Test |
| |
| def test_assert_performance |
| actual_performance = performance_benchmark("string operations") do |
| result = "" |
| 700.times do |
| result += ("x"*1024) |
| end |
| end |
| assert_performance actual_performance |
| end |
| |
| end |
Let’s run it (don’t forget to gem install minitest first).
| $ ruby -I . test_assert_performance1.rb |
| # Running: |
| string operations 0.172 ± 0.011: unchanged |
| . |
| Finished in 2.294557s, 0.4358 runs/s, 0.4358 assertions/s. |
| 1 runs, 1 assertions, 0 failures, 0 errors, 0 skips |
The first run will save the measurements to the benchmarks/string operations file. If we rerun the test without making any changes, it should report no change.
| $ ruby -I . test_assert_performance1.rb |
| # Running: |
| string operations 0.168 ± 0.016: unchanged |
| . |
| Finished in 2.313815s, 0.4322 runs/s, 0.4322 assertions/s. |
| 1 runs, 1 assertions, 0 failures, 0 errors, 0 skips |
As expected, the test reports that performance hasn’t changed despite the difference in average numbers. That’s statistical analysis at work! Now you know why we spent so much time talking about it.
Now let’s optimize the program. I’ll take my own advice from Chapter 2 and replace String#+= with String#<<.
chp8/test_assert_performance2.rb | |
| require 'assert_performance' |
| require 'performance_benchmark' |
| |
| class TestAssertPerformance < Minitest::Test |
| |
| def test_assert_performance |
| actual_performance = performance_benchmark("string operations") do |
| result = "" |
| 700.times do |
* | result << ("x"*1024) |
| end |
| end |
| assert_performance actual_performance |
| end |
| |
| end |
Let’s run the performance test again.
| $ bundle exec ruby -I . test_assert_performance2.rb |
| # Running: |
| string operations 0.004 ± 0.000: speedup by 0.161..0.167 with 95% confidence |
| . |
| Finished in 1.089948s, 0.9175 runs/s, 0.9175 assertions/s. |
| 1 runs, 1 assertions, 0 failures, 0 errors, 0 skips |
And of course the test reports the huge optimization. That’s exactly what we like to see when we optimize.
However, if the execution environment isn’t perfect, our performance test might report a slowdown or optimization even if we did nothing. For example, I can get the slowdown error from the first unoptimized test on my laptop when it gets busy doing something else. This is one such test run:
| $ ruby -I . test_assert_performance1.rb |
| # Running: |
| string operations 0.201 ± 0.059: slowdown by -0.044..-0.022 with 95% confidence |
| F |
| Finished in 2.456716s, 0.4070 runs/s, 0.4070 assertions/s. |
| |
| 1) Failure: |
| TestAssertPerformance#test_assert_performance [test_assert_performance1.rb:10]: |
| string operations got slower |
| |
| 1 runs, 1 assertions, 1 failures, 0 errors, 0 skips |
See how big my standard deviation is? It’s almost a quarter of my average. This means that some of the measurements were outliers, and they made the test fail.
We already talked about two ways of dealing with that. One is to further minimize external factors. Another is to exclude outliers. But there’s one more: you can increase the confidence level for the optimization interval.
The 95% confidence interval we use is roughly plus/minus two standard errors from the mean of the difference between before and after numbers. We can demand 99% confidence. This increases the interval to about plus/minus three standard errors.
Let’s do some quick math to see whether that helps with my failing test. My before and after numbers numbers are 0.168 ± 0.016 and 0.201 ± 0.059.
The mean of the difference is
The standard error of the mean of the difference is
The three standard error interval is (-0.066..0). This means that we can’t be 99% confident that the second test run was slower or faster. So the new conclusion is that nothing has changed.
Note how simple tweaking of the confidence interval changed the test outcome. So I recommend that you play with this and come up with the confidence level that works reliably for your performance tests.
There’s of course a limit to confidence level increases. See how we were barely able to determine that performance in our test stayed the same. Had the standard deviation been one millisecond less, we would have declared this run as a slowdown.
You might be tempted to increase the interval size to four or five standard errors from the mean. But in practice, three standard errors (99%) is the highest confidence you should aim for. You can’t demand the confidence of the large hadron collider experiments from your Ruby tests. If your tests are still not reliable, step back and look for more external factors, or start excluding outliers in measurements.
3.14.251.57