Benchmarks are useful for measuring the performance of specific operations, but when we’re trying to make a slow program faster, we often have no idea where to begin. Every programmer knows Donald Knuth’s aphorism about premature optimization, which appeared in “Structured Programming with go to Statements” in 1974. Although often misinterpreted to mean performance doesn’t matter, in its original context we can discern a different meaning:
When we wish to look carefully at the speed of our programs, the best technique for identifying the critical code is profiling. Profiling is an automated approach to performance measurement based on sampling a number of profile events during execution, then extrapolating from them during a post-processing step; the resulting statistical summary is called a profile.
Go supports many kinds of profiling, each concerned with a different
aspect of performance, but all of them involve recording
a sequence of events of interest, each of which has an accompanying
stack trace—the stack of function calls active at the moment of
the event.
The go test
tool has built-in support for several
kinds of profiling.
A CPU profile identifies the functions whose execution requires the most CPU time. The currently running thread on each CPU is interrupted periodically by the operating system every few milliseconds, with each interruption recording one profile event before normal execution resumes.
A heap profile identifies the statements responsible for allocating the most memory. The profiling library samples calls to the internal memory allocation routines so that on average, one profile event is recorded per 512KB of allocated memory.
A blocking profile identifies the operations responsible for blocking goroutines the longest, such as system calls, channel sends and receives, and acquisitions of locks. The profiling library records an event every time a goroutine is blocked by one of these operations.
Gathering a profile for code under test is as easy as enabling one of the flags below. Be careful when using more than one flag at a time, however: the machinery for gathering one kind of profile may skew the results of others.
$ go test -cpuprofile=cpu.out $ go test -blockprofile=block.out $ go test -memprofile=mem.out
It’s easy to add profiling support to non-test
programs too, though the details of how we do that vary between
short-lived command-line tools and long-running server applications.
Profiling is especially useful in long-running applications, so the
Go runtime’s profiling features can be enabled under programmer
control using the runtime
API.
Once we’ve gathered a profile, we need to analyze it using the
pprof
tool.
This is a standard part of the Go distribution, but since it’s not an
everyday tool, it’s accessed indirectly using go tool pprof
.
It has dozens of features and options, but basic use
requires only two arguments, the executable that produced the profile and
the profile log.
To make profiling efficient and to save space, the log
does not include function names; instead, functions are identified by
their addresses.
This means that pprof
needs the executable in order to make
sense of the log.
Although go test
usually discards the test executable once the
test is complete, when profiling is enabled it saves the
executable as foo.test
, where foo
is
the name of the tested package.
The commands below show how to gather and display a simple CPU profile.
We’ve selected one of the benchmarks from the net/http
package.
It is usually better to profile specific benchmarks that have been
constructed to be representative of workloads one cares about.
Benchmarking test cases is almost never representative, which is why
we disabled them by using the filter -run=NONE
.
$ go test -run=NONE -bench=ClientServerParallelTLS64 -cpuprofile=cpu.log net/http PASS BenchmarkClientServerParallelTLS64-8 1000 3141325 ns/op 143010 B/op 1747 allocs/op ok net/http 3.395s $ go tool pprof -text -nodecount=10 ./http.test cpu.log 2570ms of 3590ms total (71.59%) Dropped 129 nodes (cum <= 17.95ms) Showing top 10 nodes out of 166 (cum >= 60ms) flat flat% sum% cum cum% 1730ms 48.19% 48.19% 1750ms 48.75% crypto/elliptic.p256ReduceDegree 230ms 6.41% 54.60% 250ms 6.96% crypto/elliptic.p256Diff 120ms 3.34% 57.94% 120ms 3.34% math/big.addMulVVW 110ms 3.06% 61.00% 110ms 3.06% syscall.Syscall 90ms 2.51% 63.51% 1130ms 31.48% crypto/elliptic.p256Square 70ms 1.95% 65.46% 120ms 3.34% runtime.scanobject 60ms 1.67% 67.13% 830ms 23.12% crypto/elliptic.p256Mul 60ms 1.67% 68.80% 190ms 5.29% math/big.nat.montgomery 50ms 1.39% 70.19% 50ms 1.39% crypto/elliptic.p256ReduceCarry 50ms 1.39% 71.59% 60ms 1.67% crypto/elliptic.p256Sum
The -text
flag specifies the output format, in this case, a
textual table with one row per function, sorted so the “hottest”
functions—those that consume the most CPU cycles—appear first.
The -nodecount=10
flag limits the result to 10 rows.
For gross performance problems, this textual format may be enough to
pinpoint the cause.
This profile tells us that elliptic-curve cryptography is
important to the performance of this particular HTTPS benchmark.
By contrast, if a profile is dominated by memory allocation functions from the
runtime
package, reducing memory consumption may be a
worthwhile optimization.
For more subtle problems, you may be better off
using one of pprof
’s graphical displays.
These require GraphViz, which can be downloaded
from www.graphviz.org
.
The -web
flag then
renders a directed graph of the functions of the program, annotated
by their CPU profile numbers and colored to indicate the hottest
functions.
We’ve only scratched the surface of Go’s profiling tools here. To find out more, read the “Profiling Go Programs” article on the Go Blog.
18.118.200.136