Chapter 3. eBPF Programs

In this chapter, let’s turn to what’s involved in writing eBPF code. We need to consider the eBPF program itself, that runs in the kernel, and also the user space code that will interact with it.

Kernel and User Space Code

First of all, what programming languages can you use to write eBPF programs?

The kernel accepts eBPF programs in bytecode form.1 It’s possible to write this bytecode by hand, in much the same way that it’s possible to write application code in assembly language—but it’s generally more practical for humans to use a higher-level language that can be compiled (that is, translated automatically) into bytecode.

eBPF programs can’t be written in arbitrary high-level languages for a couple of reasons. First, the language compiler needs to have support for emitting the eBPF bytecode format that the kernel expects. Second, many compiled languages have runtime features—for example, Go’s memory management and garbage collection—that make them unsuitable. At time of writing the only options for writing eBPF programs are C (compiled with clang/llvm) and, more recently, Rust. The vast majority of eBPF code published to date is in C, and this makes sense given that it’s the language of the Linux kernel.

At a minimum, something in user space needs to load the program into the kernel and attach it to the right event. There are utilities such as bpftool to help with this, but these are low-level tools that assume detailed knowledge of eBPF and are designed more for eBPF specialists than for the average user. In most eBPF-based tools, there is a user space application that takes care of loading the eBPF program into the kernel, passes in any configuration parameters, and displays information collected by the eBPF program in a user-friendly way.

The user space part of an eBPF tool can, at least in theory, be written in any language, though in practice there are libraries to support this in a fairly small set of languages: C, Go, Rust, and Python among them. This language choice is further complicated because not all languages have libraries that support libbpf, which has become a popular option for making eBPF programs portable across different versions of the kernel. (We’ll discuss libbpf in Chapter 4.)

Custom Programs Attached to Events

The eBPF program itself is typically written in C or Rust and compiled into an object file.2 This is a standard ELF (Executable and Linkable Format) file that can be inspected with tools like readelf, and it contains both the program bytecode and the definition of any maps (which we’ll discuss shortly). As shown in Figure 3-1, user space program reads this file and loads it into the kernel, if allowed by the verifier that you met in the previous chapter.

Figure 3-1. A user space application uses the bpf() system call to load eBPF programs from an ELF file into the kernel

Once you have an eBPF program loaded into the kernel, it has to be attached to an event. Whenever the event happens, the associated eBPF program(s) are run. There’s a very wide range of events that you can attach programs to; I won’t cover them all, but the following are some of the more commonly used options.

Entry to/Exit from Functions

You can attach an eBPF program to be triggered whenever a kernel function is entered or exited. Many of today’s eBPF examples use the mechanism of kprobes (attached to a kernel function entry point) and kretprobes (function exit). In more recent kernel versions, there is a more efficient alternative called fentry/fexit.3

Note that you can’t guarantee that all functions defined in one kernel version will necessarily be available in future versions unless they are part of a stable API such as the syscall interface.

You can also attach eBPF programs to user space functions with uprobes and uretprobes.

Tracepoints

You can also attach eBPF programs to tracepoints4 defined within the kernel. Find the events on your machine by looking under /sys/kernel/debug/tracing/events.

Perf Events

Perf5 is a subsystem for collecting performance data. You can hook eBPF programs to all the places where perf data is collected, which can be determined by running perf list on your machine.

Linux Security Module Interface

The LSM interface allows for security policies to be checked before the kernel allows certain operations. You may have come across AppArmor or SELinux that make use of this interface. With eBPF, you can attach custom programs to the same checkpoints, allowing for flexible, dynamic security policies and some new approaches to runtime security tooling.

Network Interfaces—eXpress Data Path

eXpress Data Path (XDP) allows attaching an eBPF program to a network interface, so that it is triggered whenever a packet is received. It can inspect or even modify the packet, and the program’s exit code can tell the kernel what to do with that packet: pass it on, drop it, or redirect it. This can form the basis of some very efficient networking functionality.6

Sockets and Other Networking Hooks

You can attach eBPF programs to run when applications open or perform other operations on a network socket, as well as when messages are sent or received. There are also hooks called traffic control or tc within the kernel’s network stack where eBPF programs can run after initial packet processing.

Some features can be implemented with an eBPF program alone, but in many cases we want the eBPF code to receive information from, or pass data to, a user space application. The mechanism that allows data to pass between eBPF programs and user space, or between different eBPF programs, is called maps.

eBPF Maps

The development of maps is one of the significant differences that justify the e for extended, in the eBPF acronym.

Maps are data structures that are defined alongside eBPF programs. There are a variety of different types of maps, but they are all essentially key–value stores. eBPF programs can read and write to them, as can user space code. Common uses for maps include:

  • An eBPF program writing metrics and other data about an event, for user space code to later retrieve

  • User space code writing configuration information, for an eBPF program to read and behave accordingly

  • An eBPF program writing data into a map, for later retrieval by another eBPF program, allowing the coordination of information across multiple kernel events

If both the kernel and user space code will access the same map, they will need a common understanding of the data structures stored in that map. This can be done by including header files that define those data structures in both the user space and kernel code, but if these aren’t written in the same language, the author(s) will need to carefully create structure definitions that are byte-for-byte compatible.

We’ve discussed the main constituents of an eBPF tool: eBPF programs that run in the kernel, user space code to load and interact with those programs, and maps that allow programs to share data. To make things concrete, let’s look at an example.

Opensnoop Example

For this example of an eBPF program, I’ve chosen opensnoop, a utility that shows you what files any process opens. The original version of this utility was one of many BPF tools that Brendan Gregg originally wrote in the BCC project which you can find on GitHub. It was later rewritten for libbpf (which you’ll meet in the next chapter), and in this example I’m using the newer version under the libbpf-tools directory.

When you run opensnoop, the output you’ll see depends a lot on what’s happening on the virtual machine at the time, but it should look something like this:

PID    COMM         FD ERR PATH
93965  cat           3   0 /etc/ld.so.cache
93965  cat           3   0 /lib/x86_64-linux-gnu/libc.so.6
93965  cat           3   0 /usr/lib/locale/locale-archive
93965  cat           3   0 /usr/share/locale/locale.alias
...

Each line of output indicates that a process opened (or attempted to open) a file. The columns show the process ID, the command being run, the file descriptor, an indication of any error code, and the path of the file being opened.

Opensnoop works by attaching eBPF programs to the open() and openat() system calls that any application has to make to ask the kernel to open a file. Let’s dig in to see how this is implemented. For brevity, we won’t look at every line of the code, but I hope it’s sufficient to give you an idea of how it works. (Feel free to skip to the next chapter if you’re not interested in diving this deep!)

Opensnoop eBPF Code

The eBPF code is written in C, in the file opensnoop.bpf.c. Near the beginning of this file you can see the definitions of two eBPF maps—start and events:

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);
    __type(value, struct args_t);
} start SEC(".maps");
struct {
    __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
    __uint(key_size, sizeof(u32));
    __uint(value_size, sizeof(u32));
} events SEC(".maps");

When the ELF object file is created, it contains a section for each map and each program to be loaded into the kernel, and the SEC() macro defines these sections.

As you’ll see when we look into the program, the start map is used to temporarily store the arguments to the syscall—including the name of the file being opened—while the syscall is being processed. The events map7 is used for passing event information from the eBPF code in the kernel to the user space executable. This is illustrated in Figure 3-2.

Figure 3-2. Calling open() triggers eBPF programs that store data in opensnoop’s eBPF maps

Later in the opensnoop.bpf.c file, you’ll find two extremely similar functions:

SEC("tracepoint/syscalls/sys_enter_open")
int tracepoint__syscalls__sys_enter_open(struct 
    trace_event_raw_sys_enter* ctx)

and

SEC("tracepoint/syscalls/sys_enter_openat")
int tracepoint__syscalls__sys_enter_openat(struct 
    trace_event_raw_sys_enter* ctx)

There are two different system calls for opening files:8 openat() and open(). They are identical except that openat() has an extra argument for a directory file descriptor, and the path name for the file to be opened is taken relative to that directory. Likewise, the two functions in opensnoop are identical except for handling this difference in the arguments.

As you can see, they both take a parameter that is a pointer to a structure called trace_event_raw_sys_enter. You’d find the definition for this structure in the vmlinux header file generated for the particular kernel you’re running on. The art of writing eBPF programs includes working out what structure each program receives as its context, and how to access the information within it.

These two functions use a BPF helper function to retrieve the ID of the process that’s calling this syscall:

u64 id = bpf_get_current_pid_tgid();

The code gets the filename and any flags that were passed to the syscall, and puts them in a structure called args:

args.fname = (const char *)ctx->args[0];
         args.flags = (int)ctx->args[1];

This structure is written into the start map using the current process ID as the key:

bpf_map_update_elem(&start, &pid, &args, 0);

And that’s all that the eBPF programs do on entry to the syscall. But there’s another pair of eBPF programs defined in opensnoop.bpf.c that get triggered when the syscalls exit:

SEC("tracepoint/syscalls/sys_exit_open")
int tracepoint__syscalls__sys_exit_open

This program and its openat() twin share common code in the function trace_exit(). Have you noticed that all the functions called by eBPF programs are prefixed by static __always_inline? That forces the compiler to put the instructions for these functions inline, because in older kernels a BPF program is not allowed to jump to a separate function. Newer kernels and versions of LLVM can support noninlined function calls, but this is a safe way to ensure the BPF verifier stays happy. (Nowadays there is also the concept of a BPF tail call, where execution jumps from one BPF program to another. You can read more about BPF function calls and tail calls in the eBPF documentation.)

The trace_exit() function creates an empty event structure:

struct event event = {};

This will get populated with information about the open/openat syscall that’s coming to a conclusion and sent to user space via the events map.

There should be an entry in the start hash map that corresponds to the current process ID:

ap = bpf_map_lookup_elem(&start, &pid);

This has the information about the filename and flags that was written earlier during the sys_enter_open(at) call. The flags field is an integer stored directly in the structure, so it’s OK to read it directly from the structure:

event.flags = ap->flags;

In contrast, the filename is written into some number of bytes in user space memory, and the verifier needs to be sure that it’s safe for this eBPF program to read that number of bytes from that location in memory. This is done using another helper function, bpf_probe_read_user_str():

bpf_probe_read_user_str(&event.fname, sizeof(event.fname), 
                    ap->fname);

The current command name (that is, the name of the executable that made the open(at) syscall) is also copied into the event structure, using another BPF helper function:

bpf_get_current_comm(&event.comm, sizeof(event.comm));

The event structure gets written into the events perf buffer map:

bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
                    &event, sizeof(event));

The user space code reads event information out of this map. Before we get to that, let’s look briefly at the Makefile.

libbpf-tools Makefile

When you build eBPF code, you get an object file containing the binary definitions of the eBPF programs and maps. You also need an additional user space executable that will load those programs and maps into the kernel, and act as the interface for the user.9 Let’s look at the Makefile that builds opensnoop to see how it creates both the eBPF object file and the executable.

Makefiles comprise a set of rules, and the syntax for these can be a bit opaque, so if you’re not familiar with Makefiles and don’t particularly care about the details, please do feel free to skip over this section!

The opensnoop example that we’re looking at is one of a large set of example tools that are all built using one Makefile that you’ll find in the libbpf-tools directory. Not everything in this file is particularly of interest, but there are a few rules I’d like to highlight. The first is a rule that takes a bpf.c file and uses the clang compiler to create a BPF target object file:

$(OUTPUT)/%.bpf.o: %.bpf.c $(LIBBPF_OBJ) $(wildcard %.h) $(AR..
    $(call msg,BPF,$@)
    $(Q)$(CLANG) $(CFLAGS) -target bpf -D__TARGET_ARCH_$(ARCH) 
          -I$(ARCH)/ $(INCLUDES) -c $(filter %.c,$^) -o $@ && 
    $(LLVM_STRIP) -g $@

So, opensnoop.bpf.c gets compiled into $(OUTPUT)/opensnoop.bpf.o. This object file contains the eBPF programs and maps that will get loaded into the kernel.

Another rule uses bpftool gen skeleton to create a skeleton header file from the map and program definitions contained in that bpf.o object file:

$(OUTPUT)/%.skel.h: $(OUTPUT)/%.bpf.o | $(OUTPUT)
   $(call msg,GEN-SKEL,$@)
   $(Q)$(BPFTOOL) gen skeleton $< > $@

The opensnoop.c user space code includes this opensnoop.skel.h header file to get the definitions of the maps that it shares with the eBPF programs in the kernel. This allows the user space and kernel code to know about the layout of the data structures that get stored in these maps.

The following rule compiles the user space code from opensnoop.c into a binary object called $(OUTPUT)/opensnoop.o:

$(OUTPUT)/%.o: %.c $(wildcard %.h) $(LIBBPF_OBJ) | $(OUTPUT)
   $(call msg,CC,$@)
   $(Q)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@

Finally, there is a rule that uses cc to link the user space application objects (in our case, opensnoop.o) into a set of executables:

$(APPS): %: $(OUTPUT)/%.o $(LIBBPF_OBJ) $(COMMON_OBJ) | $(OUT...
   $(call msg,BINARY,$@)
   $(Q)$(CC) $(CFLAGS) $^ $(LDFLAGS) -lelf -lz -o $@

Now that you have seen how the eBPF and user space programs are generated separately, let’s look at the user space code.

Opensnoop User Space Code

As I mentioned, the user space code that interacts with eBPF code could be written in pretty much any programming language. The example that we’ll discuss in this section is written in C, but if you’re interested, you could compare it with the original BCC version written in Python, that you’ll find in bcc/tools.

The user space code is in the opensnoop.c file. The first half of the file has #include directives (one of them being the autogenerated opensnoop.skel.h file), various definitions, and the code to handle different command line options, which we won’t dwell on here. Let’s also gloss over functions like print_event() which writes the information about an event to the screen. From an eBPF perspective, all the interesting code is in the main() function.

You will see functions like opensnoop_bpf__open(), opensnoop_bpf__load(), and opensnoop_bpf__attach(). These are all defined in the autogenerated code created by bpftool gen skeleton.10 This autogenerated code handles all the individual eBPF programs, maps, and attachment points defined in the eBPF object file.

Once opensnoop is up and running, its job is to listen on the events perf buffer and write the information contained in each event to the screen. First, it opens the file descriptor associated with the perf buffer and sets handle_event() as the function to be called when a new event arrives:

pb = perf_buffer__new(bpf_map__fd(obj->maps.events), 
    PERF_BUFFER_PAGES, handle_event, handle_lost_events, 
    NULL, NULL);

Then it polls on buffer events until either a time limit is reached, or the user interrupts the program:

while (!exiting) {
         err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS);
...
}

The data parameter passed to handle_event() points to the event structure that the eBPF program wrote into the map for this event. The user space code can retrieve this information, format it and write it out for the user to see.

As you’ve seen, opensnoop registers eBPF programs that are called every time any application calls the open() or openat() system call. These eBPF programs running in the kernel collect information about the context of that system call—the executable name and process ID—and about the file being opened. This information is written into a map, from which user space can read it and display it to the user.

You’ll find dozens more examples of eBPF tools like this in the libbpf-tools directory, each of which typically instruments one syscall, or a family of related syscalls like open() and openat().

System calls are a stable kernel interface, and they offer a very powerful way to observe what’s happening on a (virtual) machine. But don’t be fooled into thinking that eBPF programming begins and ends at intercepting system calls. There are plenty of other stable interfaces, including LSM and various points in the networking stack, to which eBPF can be attached. If you’re willing to risk or work around changes between kernel versions, the range of places where you can attach eBPF programs is absolutely vast.

1 See the BPF instruction set documentation.

2 It’s also possible to skip the object file and load bytecode directly into the kernel using the bpf() system call.

3 fentry/fexit is described in an article by Alexei Starovoitov: “Introduce BPF Trampoline” (LWN.net, November 14, 2019).

4 Oracle Linux Blog, “Taming Tracepoints in the Linux Kernel,” by Matt Keenan, posted March 9, 2020.

5 Brendan Gregg’s site is a good source of information about perf events.

6 If you’re interested in seeing a concrete example of this, you might like to watch my talk at eBPF Summit 2021 where I implement a very basic load balancer in a few minutes, as an illustration of how we can use eBPF to change the way the kernel handles network packets.

7 At the time of writing, this code uses a perf buffer for the events map. If you were writing this code today for recent kernels, you would get better performance from a ring buffer, which is a newer alternative.

8 In some kernels you’ll also find openat2(), but this isn’t handled in this version of opensnoop, at least at time of writing.

9 You could use a general-purpose tool like bpftool, which can read BPF object files and perform operations on them, but that requires the user to know details about what to load and what events to attach programs to. For most applications, it makes sense to write a specific tool that simplifies this for the end user.

10 See Andrii Nakryiko’s post describing BPF skeleton code generation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.240.222