Chapter 15. BPF

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15

BPF

This chapter describes the BCC and bpftrace tracing front ends for extended BPF. These front ends provide a collection of performance analysis tools, and these tools were used in previous chapters. The BPF technology was introduced in Chapter 3, Operating System, Section 3.4.4, Extended BPF. In summary, extended BPF is a kernel execution environment that can provide programmatic capabilities to tracers.

This chapter, along with Chapter 13, perf, and Chapter 14, Ftrace, are optional reading for those who wish to learn one or more system tracers in more detail.

Extended BPF tools can be used to answer questions such as:

What is the latency of disk I/O, as a histogram?
Is CPU scheduler latency high enough to cause problems?
Are applications suffering file system latency?
What TCP sessions are occurring and with what durations?
What code paths are blocking and for how long?

What makes BPF different from other tracers is that it is programmable. It allows user-defined programs to be executed on events, programs that can perform filtering, save and retrieve information, calculate latency, perform in-kernel aggregation and custom summaries, and more. While other tracers may require dumping all events to user space and post-processing them, BPF allows such processing to occur efficiently in kernel context. This makes it practical to create performance tools that would otherwise cost too much overhead for production use.

This chapter has a major section for each recommended front end. Key sections are:

The differences between BCC and bpftrace may be obvious from their usage in prior chapters: BCC is suited for complex tools, and bpftrace is suited for ad hoc custom programs. Some tools are implemented in both, as shown in Figure 15.1.

Images — Figure 15.1 BPF tracing front ends

Specific differences between BCC and bpftrace are summarized in Table 15.1.

Table 15.1 BCC versus bpftrace

Characteristic	BCC	bpftrace
Number of tools by repository	>80 (bcc)	>30 (bpftrace) >120 (bpf-perf-tools-book)
Tool usage	Typically supports complex options (`-h`, `-P PID`, etc.) and arguments	Typically simple: no options, and zero or one argument
Tool documentation	Man pages, example files	Man pages, example files
Programming language	User-space: Python, Lua, C, or C++ Kernel-space: C	bpftrace
Programming difficulty	Difficult	Easy
Per-event output types	Anything	Text, JSON
Summary types	Anything	Counts, min, max, sum, avg, log2 histograms, linear histograms; by zero or more keys
Library support	Yes (e.g., Python import)	No
Average program length¹ (no comments)	228 lines	28 lines

¹Based on the tools provided in the official repository and my BPF book repository.

Both BCC and bpftrace are in use at many companies including Facebook and Netflix. Netflix installs them by default on all cloud instances, and uses them for deeper analysis after cloud-wide monitoring and dashboards, specifically [Gregg 18e]:

BCC: Canned tools are used at the command line to analyze storage I/O, network I/O, and process execution, when needed. Some BCC tools are automatically executed by a graphical performance dashboard system to provide data for scheduler and disk I/O latency heat maps, off-CPU flame graphs, and more. Also, a custom BCC tool is always running as a daemon (based on tcplife(8)) logging network events to cloud storage for flow analysis.
bpftrace: Custom bpftrace tools are developed when needed to understand kernel and application pathologies.

The following sections explain BCC tools, bpftrace tools, and bpftrace programming.

15.1 BCC

The BPF Compiler Collection (or “bcc” after the project and package names) is an open-source project containing a large collection of advanced performance analysis tools, as well as a framework for building them. BCC was created by Brenden Blanco; I’ve helped with its development and created many of the tracing tools.

As an example of a BCC tool, biolatency(8) shows the distribution of disk I/O latency as power-of-two histograms, and can break it down by I/O flags:

Tool	Description	Section
biolatency(8)	Summarize block I/O (disk I/O) latency as a histogram	9.6.6
biotop(8)	Summarize block I/O by process	9.6.8
biosnoop(8)	Trace block I/O with latency and other details	9.6.7
bitesize(8)	Summarize block I/O size as process histograms	-
btrfsdist(8)	Summarize btrfs operation latency as histograms	8.6.13
btrfsslower(8)	Trace slow btrfs operations	8.6.14
cpudist(8)	Summarize on- and off-CPU time per process as a histogram	6.6.15, 16.1.7
cpuunclaimed(8)	Show CPU that is unclaimed and idle despite demand	-
criticalstat(8)	Trace long atomic critical kernel sections	-
dbslower(8)	Trace database slow queries	-
dbstat(8)	Summarize database query latency as a histogram	-
drsnoop(8)	Trace direct memory reclaim events with PID and latency	7.5.11
execsnoop(8)	Trace new processes via execve(2) syscalls	1.7.3, 5.5.5
ext4dist(8)	Summarize ext4 operation latency as histograms	8.6.13
ext4slower(8)	Trace slow ext4 operations	8.6.14
filelife(8)	Trace the lifespan of short-lived files	-
gethostlatency(8)	Trace DNS latency via resolver functions	-
hardinqs(8)	Summarize hardirq event times	6.6.19
killsnoop(8)	Trace signals issued by the kill(2) syscall	-
klockstat(8)	Summarize kernel mutex lock statistics	-
llcstat(8)	Summarize CPU cache references and misses by process	-
memleak(8)	Show outstanding memory allocations	-
mysqld_qslower(8)	Trace MySQL slow queries	-
nfsdist(8)	Trace slow NFS operations	8.6.13
nfsslower(8)	Summarize NFS operation latency as histograms	8.6.14
offcputime(8)	Summarize off-CPU time by stack trace	5.5.3
offwaketime(8)	Summarize blocked time by off-CPU stack and waker stack	-
oomkill(8)	Trace the out-of-memory (OOM) killer	-
opensnoop(8)	Trace open(2)-family syscalls	8.6.10
profile(8)	Profile CPU usage using timed sampling of stack traces	5.5.2
runqlat(8)	Summarize run queue (scheduler) latency as a histogram	6.6.16
runqlen(8)	Summarize run queue length using timed sampling	6.6.17
runqslower(8)	Trace long run queue delays	-
syncsnoop(8)	Trace sync(2)-family syscalls	-
syscount(8)	Summarize syscall counts and latencies	5.5.6
tcplife(8)	Trace TCP sessions and summarize their lifespan	10.6.9
tcpretrans(8)	Trace TCP retransmits with details including kernel state	10.6.11
tcptop(8)	Summarize TCP send/recv throughput by host and PID	10.6.10
wakeuptime(8)	Summarize sleep to wakeup time by waker stack	-
xfsdist(8)	Summarize xfs operation latency as histograms	8.6.13

Tool	Description	Section
argdist(8)	Display function parameter values as a histogram or count	15.1.15
funccount(8)	Count kernel or user-level function calls	15.1.15
funcslower(8)	Trace slow kernel or user-level function calls	-
funclatency(8)	Summarize function latency as a histogram	-
stackcount(8)	Count stack traces that led to an event	15.1.15
trace(8)	Trace arbitrary functions with filters	15.1.15

Type	Shortcut	Description
`tracepoint`	`t`	Kernel static instrumentation points
`usdt`	`U`	User-level statically defined tracing
`kprobe`	`k`	Kernel dynamic function instrumentation
`kretprobe`	`kr`	Kernel dynamic function return instrumentation
`kfunc`	`f`	Kernel dynamic function instrumentation (BPF based)
`kretfunc`	`fr`	Kernel dynamic function return instrumentation (BPF based)
`uprobe`	`u`	User-level dynamic function instrumentation
`uretprobe`	`ur`	User-level dynamic function return instrumentation
`software`	`s`	Kernel software-based events
`hardware`	`h`	Hardware counter-based instrumentation
`watchpoint`	`w`	Memory watchpoint instrumentation
`profile`	`p`	Timed sampling across all CPUs
`interval`	`i`	Timed reporting (from one CPU)
`BEGIN`		Start of bpftrace
`END`		End of bpftrace

Expression	Description
`==`	Equal to
`!=`	Not equal to
`>`	Greater than
`<`	Less than
`>=`	Greater than or equal to
`<=`	Less than or equal to
`&&`	And
`\|\|`	Inclusive or

Table of Contents for Chapter 15. BPF

Create new playlist

Sign In

Sign Up

Chapter 15

15.1 BCC

15.1.1 Installation

15.1.2 Tool Coverage

15.1.3 Single-Purpose Tools

15.1.4 Multi-Purpose Tools

15.1.5 One-Liners

funccount(8)

stackcount(8)

trace(8)

argdist(8)

15.1.6 Multi-Tool Example

15.1.7 BCC vs. bpftrace

15.1.8 Documentation

15.2 bpftrace

15.2.1 Installation

15.2.2 Tools

15.2.3 One-Liners

CPUs

Memory

File Systems

Disk

Networking

Applications

Kernel

15.2.4 Programming

1. Usage

2. Program Structure

3. Comments

4. Probe Format

5. Probe Wildcards

6. Filters

7. Actions

8. Hello, World!

9 Functions

10 Variables

11 Map Functions

12. Timing vfs_read()

15.2.5 Reference

1. Probe Types

Probe Arguments

2. Flow Control

Filter

Ternary Operators

If Statements

Loops

Operators

3. Variables

4. Functions

5. Map Functions

15.2.6 Documentation

15.3 References

Table of Contents for
Chapter 15. BPF