eBPF compilation, part 1: Userspace point of view

This is an enhanced transcription of my C++Russia talk. This research was done during my work on a Userspace JIT for eBPF (not opensource yet).

Recap: What is eBPF?

From a user’s perspective, eBPF is a restricted virtual machine inside the Linux kernel. Originally, the idea for eBPF came from the network subsystem, where it was designed for fast and safe packet filtering. Safety is the key part here – it is the main difference between eBPF’s VM approach and the raw binary approach of kernel modules. It is also much simpler to communicate with eBPF than with the kernel or kernel modules, because the “loader” process and the eBPF program can communicate via various types of maps (ring buffers, arrays, hash maps, etc.) instead of using sysfs, procfs, ioctls, and other filesystem-based mechanisms.

Today, eBPF is quite popular and allows you to:

Write your own CPU scheduler via sched_ext.
Unwind DWARF stacks in the kernel, as in Perforator.
Write network stacks that fully bypass kernel routing, as in Cilium.
Perform low-overhead syscall and activity monitoring, as in bpftrace, seccomp, or various AV software.

There are also plans to extend sched_ext’s approach to other subsystems, for example for NUMA/CXL migration policies.

Writing your first eBPF program

For our first program we will write a simple network filter, because it is easier to load and as a respect for eBPF’s origin. Let’s say an imaginary manager asks you to mitigate a DDoS attack by blocking all packets smaller than 100 bytes. In a first attempt you would probably write something like this:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

const int MIN_SIZE = 100;

SEC("xdp")
int xdp_ddos(struct xdp_md *ctx) {
    void *end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    if (end - data < MIN_SIZE) {
        return XDP_DROP;
    }

    return XDP_PASS;
}

char _license[] SEC("license") = "GPL";

Any eBPF program has one argument, containing the context of its hook. For XDP filters, it consists of pointers to the packet (and some minor fields to store metadata), which is more than enough for filtering.

But what if your imaginary manager wants to adjust the threshold a little for another DDoS attack? You can’t ship source code plus compiler to each server and teach every engineer how to use your build system, not to mention build dependencies… We should make MIN_SIZE a load-time parameter, and there is a neat trick for this:

const volatile MIN_SIZE = 100;

Marking this variable as volatile prevents thee compiler from inlining the constant; it stays as a dedicated location in the .rodata ELF section, where the loader can patch it directly later. As far as I understand, the C standard doesn’t specify this behavior, but it’s so widely relied upon that when GCC tried to change it, the patch was reverted.

With your code approved by the imaginary manager, it is now time to compile it into ELF. But for which architecture should we compile it?

eBPF Virtual ISA

eBPF virtual CPU has a two-operand 64-bit RISC architecture with 10 general purpose registers and one read-only stack pointer. Fortunately, this time RISC really means reduced instruction set, with the simplest instructions and encodings possible. You can read some semi-misleading documentation on bytecode here, but documentation is not required for reading bytecode in this post.

What is more interesting is the ABI of eBPF:

Functions have no more than five arguments
All arguments are passed via registers r1-r5
Return value is passed by r0
r6-r10 are callee-saved

If you look a little closer, this is a simplified version of the x64 and ARM64 ABI (sometimes with stronger guarantees). This is a logical technical decision – we want the FFI between eBPF and C to be a no-op without any ABI conversions, at least on mainline architectures. We also want to easily verify the code, so the ISA should be very limited and primitive.

While all these reasons are valid, this introduces performance problems to eBPF as we will see later. The root cause is that the eBPF ISA deliberately hides the reaal CPU’s capabilities, so each target loses something:

ARM64 loses its three-operand instructions, as well as two thirds of its registers.
x64 loses powerful complex instructions like reg + reg * X address encoding, which now requires 2 ALU eBPF instructions and pollutes one register for computation.

Building and running your code

With that said, we can now compile our code with clang:

$ clang -target ebpf xdp.c –o xdp.bpf.o

and view the resulting bytecode with

$ llvm-objdump -d xdp.bpf.o

Another option is to experiment with eBPF on godbolt.

For the DDoS filter above, clang will produce the following bytecode (comments added by me):

xdp_ddos:
    ; Load pkt to w2 (lower half of r2)
    w2 = *(u32 *)(r1 + 0)
    ; Load pkt_end to w1 (lower half of r1)
    w1 = *(u32 *)(r1 + 4)

    ; Arithmetic to get r1 = pkt size
    r1 -= r2

    ; Get MIN_SIZE address and load it
    ; (see libbpf section)
    r2 = MIN_SIZE ll
    w2 = *(u32 *)(r2 + 0)

    ; Sign extending int (MIN_SIZE) to i64 (ptr diff)
    r2 <<= 32
    r2 s>>= 32

    ; Conditional return
    w0 = 1
    if r1 s< r2 goto .LBB0_2
    w0 = 2
.LBB0_2:
    exit

MIN_SIZE:
    .long   100

As you can see, eBPF assembly language is not far from C and is easy to read without any preparation. You can extract eBPF bytecode for any currently loaded eBPF program with bpftool:

# Look for prog IDs and names
$ sudo bpftool prog list
$ sudo bpftool prog dump xlated id <ID> # or name <NAME>

Let’s load our DDoS filter:

$ clang -target ebpf xdp.c –o xdp.bpf.o
$ sudo bpftool prog load xdp.bpf.o /sys/fs/bpf/test_filter
# You can unload your program with `rm /sys/fs/bpf/test_filter`

# You can attach it to any network interface
# CAUTION: It will really filter out any packets smaller than 100 bytes, and can be destructive for your SSH connection
# Note: We are using xdpgeneric instead of xdp to avoid any unpleasant limitations of a particular driver's XDP path
$ sudo bpftool net attach xdpgeneric pinned /sys/fs/bpf/test_filter dev eth0

# You can dump bytecode of your filter
$ sudo bpftool prog dump xlated name xdp_ddos

If you read the returned assembly carefully enough, you will notice that bpfprog returns slightly different bytecode than llvm-objdump. In particular, the MIN_SIZE address load translates into a pointer to an eBPF map:

# Instead of 'r2 = MIN_SIZE'
3: (18) r2 = map[id:3][0]+0

This transformation was done by the loader, in our case by bpftool, and its underlying library libbpf. Libbpf is responsible for all load-time transformations, as well as parameter patching in the .rodata section. It can even codegen skeleton headers with nice API for such parameters.

libbpf

The Linux kernel does not accept ELF binaries in the bpf() syscall, so any compiled ELF has to be parsed and converted into the expected format. libbpf (or any other alternative implementation) does this in 3 passes:

Extract functions and group them into an eBPF program

A single ELF can contain several XDP filters and even mix different eBPF program types. Since the kernel loads one program per syscall, libbpf has to split the ELF into separate programs. It parses symbol tables to find top-level definitions, as well as the static functions used in each program, to copy them as well.

Translate .rodata to an eBPF map

The kernel doesn’t have a .rodata section for eBPF programs, so libbpf converts it into a single-element eBPF map. This single element contains the whole section content, allowing to rewrite

r2 = <.rodata + 0x16>

with

r2 = &rodata_map[0] + 0x16
# or more accurately
r2 = &map[id:<id of rodata map>][0] + 0x16

This map expression is actually a single eBPF instruction (intrinsic). It will be converted back into an address assignment much later in the compilation pipeline, but for now libbpf only knows about abstract eBPF maps, not about their memory location or layout.

Resource initialization

libbpf is responsible for creating and populating maps. It can also “freeze” them, telling the kernel that their content is immutable. This lets the verifier apply some optimizations in the kernel (see DCE in the next part).

After all those transformations, the eBPF program is passed to the kernel. In the kernel it will be validated by kernel’s verifier, transformed by some optimizations and translated into native instructions. But the kernel side we will explore in future posts.

Next part: We have compiler at home