eBPF compilation, part 2: We have compiler at home

In the previous part we walked through the userspace part of loading an eBPF program and stopped right before the kernel.

What does kernel do with eBPF?

The kernel’s task is to compile eBPF into native instructions. As with any compiler, there are 3 stages:

Front-end: eBPF verifier. It parses the received eBPF code and checks its validity. You can think of it as a borrow checker in Rust or static checks in Clang.

eBPF provides the following guarantees:

Bounded execution (no infinite loops)
Memory safety (no read/writes from arbitrary/uninitialized memory)
Type safety (no loads with mismatched size)

Middle-end: optimizations and transformations on eBPF bytecode. Ideally, all machine-independent optimizations would happen here (that’s not the case in Linux). Currently, kernel performs:

Dead Code Elimination (DCE): Unreachable instructions are removed, or at least poisoned.
Inlining: Selected small helper functions are inlined to avoid the cost of call/ret and improve code locality.
Simplification for JIT: “Linking” kernel functions and adapting instructions to common CPU semantics

Back-end: ahead-of-time translation of eBPF bytecode into native instructions for fast execution. This is an optional step, because eBPF can be interpreted, but any reasonable setup runs with translation enabled.

Note: Although the kernel documentation calls this step JIT, it is ahead-of-time compilation from the perspective of eBPF execution. There is no compilation after the program is loaded, and no dynamic switching between the interpreter and JIT. We will stick with “JIT” in this post for consistency.

Stage 1: Verifier

The verifier is the most important part of the eBPF ecosystem and the reason for its popularity. It is much easier to deploy someone’s extension if you are sure it will not break your system. But those safety guarantees impose a high bar on the verifier itself: it should be as robust as possible, without any bugs, and treat any unverifiable program as incorrect.

For this section we will write a filter that drops packets smaller than 36 bytes or if 16th byte isn’t screaming:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int xdp_scream(struct xdp_md *ctx) {
    void *pkt = (void *)(long)ctx->data;
    void *pkt_end = (void *)(long)ctx->data_end;
    u8 val = 0;

    if (pkt_end > pkt + 36) {
        val = *(u8 *)(pkt + 16);
    }

    return val == 0xAA ? XDP_PASS : XDP_DROP;
}

char _license[] SEC("license") = "GPL";

This program has one “possibly unsafe” memory read at line 11 (highlighted), because there is no guarantee that the packet is bigger than 16 bytes. This instruction is actually safe, because it is hidden under a packet size check, so any minimally usable verifier has to be able to reason about control flow.

To understand how this reasoning is performed, let’s step down to the bytecode level with this stripped example:

xdp_scream:
    r2 = *(r1 + 4)
    r1 = *(r1 + 0)
    r3 = r1
    r3 += 36
    if r3 >= r2 jmp ... ; jumps to return(DROP)
    r1 = *(r1 + 16)
    ; ... Scream check + return

The verifier “emulates” function execution while tracking:

register type: integer (8bit/16bit/…), pointer (to packet/map/arena, nullable or not)
range of register’s value (max/min, both for signed and unsigned interpretations)
stack map to preserve metadata on register spills

At the start of any function we know register state, because we know its signature:

xdp_scream: ; r1 = ptr_to_cxt r2..r10 = uninit

And because we know kernel types, we can understand which fields were read:

    r2 = *(r1 + 4) ; r2 = pkt_end
    r1 = *(r1 + 0) ; r1 = pkt

We can track arithmetics and pointer offsets:

    r3 = r1 ; r3 = pkt
    r3 += 36 ; r3 = pkt(offset = 36)

And on conditional branches we can deduce packet size (or range of register values in general):

    ; r2=pkt_end r3 = pkt(offset=36)
    if r3 >= r2 jmp ...
    r1 = *(r1 + 16) ; sizeof(packet) > 36

Therefore we can allow this memory read, because all execution paths leading to it check the packet size. And because of this “execution simulation” we can easily guarantee bounded execution time via a limit on simulated instructions.

For more info on the verifier you can read the explanation comment in the source code. However, be aware that it sounds much smarter than it actually is. You should look at the source code instead, and understand that it’s no more than thousands of hardcoded checks.

Verifier: problem of fragility

Personally, I would never write a packet size check via pointer comparison like in the previous example. I would write it more directly:

if (pkt_end - pkt > 36) {
    val = *(u8 *)(pkt + 16);
}

// or even with a separate variable:

size_t pkt_size = pkt_end - pkt;
if (pkt_size > 36) {
    val = *(u8 *)(pkt + 16);
}

A hopeful developer would assume his program can still be loaded, because nothing really changed. But the verifier will happily reject this code, because from its point of view:

pkt_size is a random scalar with no connection to the packet
It is indeed greater than 36 inside the if block
Memory access is invalid because there was no packet size check

So, this example breaks the verifier because packet size check deduction is based on a hardcoded “pointer vs pointer” comparison. And while this particular example results in a minor inconvenience, it illustrates a broad set of problems with the verifier. For example, this paper on Rex (a Rust-based eBPF-like extensions) highlighted the following real-world problems:

The verifier can forget some metadata if you reuse stack slots
Control flow joints can introduce metadata conflicts
LLVM’s optimizations (loop invariants, for example) can be smarter than the verifier, so you have to lobotomize your compiler

My favorite example is the following code snippet from Cilium:

// Copy 4 bytes of dst addr into internal struct
addr->p1 = ctx->dst_ip6[0];
asm volatile("" ::: "memory");
addr->p2 = ctx->dst_ip6[1];
asm volatile("" ::: "memory");
addr->p3 = ctx->dst_ip6[2];
asm volatile("" ::: "memory");
addr->p4 = ctx->dst_ip6[3];
asm volatile("" ::: "memory");

They’re basically telling Clang to forget everything it knew about memory on each line, because otherwise it will merge these four 1-byte moves into one single 4-byte move. And the problem with a 4-byte move (or a 2-byte move) is that it doesn’t pass the verifier’s type check, forcing the use of compiler barriers which have global effects. Clang will then have to reload other variables from memory instead of caching them in registers, hurting performance, especially in loops.

But compiler barriers are not the only technique used to pass the verifier. Other commonly used methods are:

volatile asm to write a non-optimized version of code in assembly.
volatile read to prevent caching or separation from usage.
Copy-pasting code to prevent control-flow complications, sometimes by placing common code in an always_inline function.

Obviously, all those methods hurt performance. But maybe optimization passes in the kernel will help us?

Dead Code Elimination

The very first question that should pop into your head is: “Why does eBPF have dead code at all?”. Indeed, the bytecode was originally compiled from C/Rust by LLVM, which eliminated all redundant functions and checks. But there are two reasons for dead code with a common motive: not all information is known at compile time, so because the kernel knows more than LLVM, it can eliminate more code.

The first source of such information is, of course, const volatile flags and arguments (see part 1). Their actual value is injected into the ELF binary during load time and is unknown from Clang’s point of view. The kernel, however, knows their values because .rodata is already patched and passed via an eBPF map, which is thoughtfully “frozen” by libbpf during resource initialization.

One such example from Cilium:

if (CONFIG(encryption_strict_ingress) && !ctx_is_decrypt(ctx)) {
    ret = DROP_UNENCRYPTED_TRAFFIC;
    goto out;
}

A funny note here: the CONFIG() macro is actually an inline assembly read. Cilium developers were afraid that otherwise LLVM would separate the memory load from the branch itself for performance reasons, confusing the verifier and preventing the optimization from being applied.

#define CONFIG(name)
(*({
	void *out;
	asm volatile("%0 = " __stringify(__config_##name) " ll"
			: "=r"(out));
	(typeof(__config_##name) *)out;
}))

Another source of new information is the kernel itself. While writing an eBPF extension you do not know on which kernel version it will be launched, so you might decide to write something like this to support multiple kernel versions:

if (bpf_ksym_exists(some_new_fancy_kernel_function)) {
    // Use some_new_fancy_kernel_function
} else {
    // slow fallback with lots of logic
    // partially reimplements kernel
}

Due to the active development of kernel interfaces around sched_ext, you can find many such fallbacks in its schedulers. Sometimes even as a ternary operator for each function call:

#define scx_bpf_now()					\
	(bpf_ksym_exists(scx_bpf_now) ?		\
	 scx_bpf_now() :					\
	 bpf_ktime_get_ns())

However, writing dead code isn’t zero cost and it leaves traces…

DCE: How it happens

Algorithm of DCE is very similar to mark-and-sweep garbage collectors. The mark stage is done by the verifier, which label every reachable instruction during its simulation. Let’s say we translated CONFIG_TRUE ? 100 : 200 into eBPF, where CONFIG_TRUE is a const volatile flag set to true:

r1 = &CONFIG_TRUE ; r1 = const ptr (into .rodata map)
r1 = *(r1 + 0)    ; r1 = scalar(min=?? max=??)
if r1 != 0 jmp +2
r2 = 200
jmp +1
r2 = 100

Because the verifier knows that the .rodata map is frozen and will not change, it can read the config’s value directly instead of its usual speculation and understand that this condition is always true:

r1 = *(r1 + 0)    ; r1 = scalar(min=1 max=1)
if r1 != 0 jmp +2 ; 0 is out of [1, 1]

The verifier will also skip all false branches without validation, allowing developers to call non-existing functions under bpf_ksym_exists checks and perform invalid operations under CONFIG() checks.

The sweep stage of DCE has two options:

If eBPF program came from an unprivileged process, code layout and control flow isn’t altered for “security” reasons (note: eBPF does not support untrusted code, by the maintainer’s own words link1 link2). However, dead code is replaced with trap instructions to prevent garbage execution.
For other programs, several optimizations are applied:

Conditional branches with a known result are simplified:

r1 = *(r1 + 0)    ; r1 = scalar(min=1 max=1)
jmp +2

All dead code is removed:

r1 = &CONFIG_TRUE
r1 = *(r1 + 0)
jmp +0
r2 = 100

All no-op jumps are eliminated too:

r1 = &CONFIG_TRUE
r1 = *(r1 + 0)
r2 = 100

A funny note: because the r1 load was actually reached by the verifier and the verifier has no analysis of whether this value is needed or not, it will be preserved in the final eBPF bytecode and JITted versions. bpf_ksym_exists will also leave trails of function address loads.

DCE: What could be better

DCE can eliminate quite a large part of the code, for example up to 80% in the sched_ext::scx_bpfland scheduler. That 80% come from:

Simplified configuration (no NUMA, disabled advanced policies) that eliminated additional branches and functions
A modern kernel, which does not need large fallback paths

The sad part is that the original compiler was fooled into thinking that all this code would be executed. And it probably:

Allocated registers for deleted calculations, increasing register pressure even more
Calculated inline budgets without knowing that a function will be 8x smaller, missing performance gains where possible
Failed to apply some control flow optimizations because of more complex code and flow than actually existed

Therefore, while kernel’s DCE is useful, it can’t fix all performance issues introduced by compatibility checks and const volatile options.

Inline

The second optimization implemented in the kernel’s pipeline is inlining. To understand what’s wrong with eBPF’s approach, we first need to recap why inlining is beneficial at all.

The direct benefit is eliminating the call/ret pair, but on modern processors the return instruction isn’t that expensive thanks to the return stack buffer. And duplicating your code can actually hurt performance due to less efficient instruction cache sharing.

However, inlining opens the door for further optimizations:

Inlined functions aren’t limited by their ABI. You can avoid register moves and spills in prologues and epilogues, or even replace pointer-based passing with register-based.
An inlined function can contain dead code from the caller’s perspective. Null checks, for example, are obsolete for most call sites.
Code of an inlined function can be shuffled independently, moving “producing” blocks closer to “consuming” blocks.
Inlining reveals more about the function’s output parameters than just its type, enabling more optimizations.

Here’s an example of how control flow can get streamlined: a called function has three execution paths (normal and two different errors), each handled separately.

Before:

   ---- Err1 --                     --- Handle(Err1) --
  /             \                  /                   \
Call --- Ok ------ ret --- switch ----- Handle(Ok) ------ ret
  \             /                  \                   /
   ---- Err2 ---                    --- Handle(Err2) --

After:

   ---- Err1 --- Handle(Err1) --
  /                             \
Call --- Ok ---- Handle(Ok) ------ ret
  \                             /
   ---- Err2 --- Handle(Err2) --

So inlining is not profitable just by itself, but very profitable if combined with other optimizations. With that in mind, let’s look at how the kernel currently handles inlining:

// Simplified version with some comments or checks removed

if (insn->imm == BPF_FUNC_get_branch_snapshot) {
/* if (unlikely(flags)) return -EINVAL */
insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);

/* Transform size (bytes) into number of entries (cnt = size / 24).
 *
 * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
 */
insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);

/* call perf_snapshot_branch_stack implementation */
insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
/* if (entry_cnt == 0) return -ENOENT */
insn_buf[5] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4);
/* return entry_cnt * sizeof(struct perf_branch_entry) */
insn_buf[6] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size);
insn_buf[7] = BPF_JMP_A(3);
/* return -EINVAL; */
insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL);
insn_buf[9] = BPF_JMP_A(1);
/* return -ENOENT; */
insn_buf[10] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT);
cnt = 11;

new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
// ...
}

They’re simply replacing the call instruction with hand-written eBPF assembly… This has several major downsides:

It runs after the verifier has already marked reachable instructions for DCE, so even redundant safety checks are not eliminated.
It’s very error-prone and expensive to extend. Indeed, such inlining is only done for a tiny handful of functions that someone ran into performance problems with.
It’s not even consistent: for x64, bpf_get_smp_processor_id gets inlined in the verifier, while on arm64 and everything else it happens in the JIT.

Therefore, the only benefit from kernel’s inliner is removal of relatively cheap call/ret pair, which isn’t enough to fix all performance issues introduced by bytecode and verifier. Not to mention all possible optimizations that could’ve happen: for example, not treating six out of ten registers invalidated on single mov instruction of inlined bpf_get_smp_processor_id() call.

Lowering

The lowering step combines multiple small transformations that simplify the JIT’s work.

Linking kernel functions

In the “high-level” eBPF bytecode the verifier works with, all helper functions are referred to by their ids which are used to resolve their signature. But JITs and the interpreter don’t care about function id or signature, they only care about the function address needed to call it. Since this is a common requirement of all “backends”, this resolving happens before JIT.

Intrinsics lowering

eBPF’s ISA contains several “intrinsics”, like getting a map’s pointer by its id. This exists because the userspace compiler can’t know anything about the kernel’s map pointer and must refer to maps via their id. These intrinsics are replaced with actual pointers and values.

Zero division elimination

All CPU architectures raise an exception on zero division. However, eBPF has a well-defined semantic for such operations, so any “unsafe” division has to be transformed into a ternary operation:

// unsigned
a / b -> (b == 0) ? UINT_MAX : a / b;

// signed
a / b -> (b == 0) ?
    ( (a >= 0) ? INT_MAX : INT_MIN ) :
    a / b;

Division can be considered safe if the verifier can proof b != 0, so divisions by a constant are not affected.

Zero extending

Not all architectures have automatic zero extensions on instructions like mov u32 -> u64, so the kernel can insert additional eBPF instructions to simplify the JIT’s work.

JIT spraying protection

As already mentioned, eBPF doesn’t really support secure execution. Still, for unknown reason you might want to protect yourself from JIT spraying attacks, where via some bug JIT can jump to an arbitrary constant in user code, treating it as code. To prevent this, eBPF randomizes all constants with XOR:

// Before
a = CONST // <- Controlled constant

// Some SALT chosen at compile-time
// After
a = (CONST ^ SALT) // <- Randomized non-controlled constant
a = a ^ SALT       // <- Randomized non-controlled constant

For any reasonable setup and eBPF program (without non-constant divisions), only linking and intrinsic lowering actually happens, without any performance regressions. Division, however, has other reasons to be avoided – see the next section.

Translation

As you might remember, eBPF’s ABI and instruction set were deliberately chosen to be as close to real hardware as possible. As a result, translation of eBPF bytecode is pretty straightforward:

Each eBPF register is mapped to a real register with the same properties: caller/callee saved, n-th argument, etc.
Each instruction is translated one-by-one, with each eBPF instruction typically requiring no more than 2 real instructions.

For example:

r6 += r3           =>  add %rdx, %rbx
r2 = *(r1 + 8)     =>  mov 0x8(%rdi), %rdx

if r3 >= r2 jmp +5 =>  cmp %rdi, %rsi
                       jae <+5>

The upside of this approach is that it is straightforward and simple. The downside, as already mentioned, is that you can’t use any CPU feature that can’t be expressed in eBPF bytecode.

For example, instead of a single x64 instruction mov(%rsi, %rdi, 8) which loads from %rsi + 8 * %rdi, we end up with:

r_tmp = %rdi
r_tmp *= 8
r_tmp += %rsi
load (r_tmp)

which is inefficient and pollutes one more register, increasing register pressure in eBPF bytecode even more.

Translation: When it breaks

While eBPF tries to be as close to a real CPU as possible, that’s not always true. Sometimes eBPF diverges from x64’s subset for performance, ARM semantics or logical reasons, significantly hurting JITed performance.

For example, let’s try to translate the simple r9 /= 100 instruction into x64 assembly. Ideally, we want the following translation:

; r9 -> r15 mapping
div 100, %r15

Sadly, x64 has fixed source (rax:rdx) and destination (rax) for the div instruction. So you have to move your values around a little:

xor %rdx, %rdx
mov %r15, %rax
div 100
mov %rax, %r15

But x64 can’t divide by a constant either! So you have to save 100 into some JIT temporary register to avoid conflicts with “mapped” registers:

mov 100, %r11
xor %rdx, %rdx
mov %r15, %rax
div %r11
mov %rax, %r15

You’ve finally managed to issue the correct instructions, but they clobbers rax and rdx, which are already mapped to other live eBPF registers. You have to preserve them via push/pop pairs:

push %rax
push %rdx
mov 100, %r11
xor %rdx, %rdx
mov %r15, %rax
div %r11
mov %rax, %r15
pop %rdx
pop %rax

And with 9 instructions and 4 memory accesses, you’ve finally managed to emulate a single division, hooray! To be fair:

This example is extreme and represents the worst possible translation
eBPF instructions require 1 or 2 real instructions on average
Translation is simple and error-proof

However:

You are missing all the CPU features you paid for, even basic ones such as complex addressing
You aren’t reallocating registers to avoid rare but huge penalties of mismatched semantics
You aren’t applying minor optimizations, such as instruction scheduling for your particular CPU

Final thoughts on the current eBPF translation state

In this and the previous part we kinda reviewed the current compilation pipeline of eBPF extensions, encountering performance degrading nuances on each step:

Userspace side: Limited compiler awareness with patchable flags and values instead of compile-time constants. volatile reads also can’t be cached in registers, resulting in more memory operations than actually needed.
Verifier: The fragility of its checks forces users to write slower code and apply questionable hacks.
DCE: Through promises of zero-cost checks, it encourages users to write more fallback paths and introduce more load-time options. But dead code isn’t actually zero cost because we left unneeded memory operations and fooled the original compiler, reducing its optimizations.
Inline: Performance gains are minor, while it bloats code size without taking any of the usual advantages of inlining, leaving a lot of possible improvements on the table.
Translation: Implemented in the simplest possible way, taking no advantage of the underlying CPU and occasionally resulting in bad emulations.

The performance loss at each of those steps isn’t that big, but accumulated together it can slow down your code by 2-5x relative to normal C->assembly compilation. So, in a desire to gain performance out of thin air with no radical changes, we will try to use LLVM from the kernel in the next part.