This is an enhanced transcription of my C++Russia talk. This research was done during my work on a Userspace JIT for eBPF (not opensource yet).
Recap: What is eBPF?
From a user’s perspective, eBPF is a virtual machine inside the Linux kernel. Originally, the idea for eBPF came from the network subsystem, where it was designed for fast and safe packet filtering. Safety is the key part here – it is the main difference between eBPF’s VM approach and the raw binary approach of kernel modules. It is also much simpler to communicate with eBPF than with the kernel or kernel modules, because the “loader” process and the eBPF program can communicate via various types of maps (ring buffers, arrays, hash maps, etc.) instead of using sysfs, procfs, ioctls, and other filesystem-based mechanisms.
Today, eBPF is quite popular and allows you to:
- Write your own CPU scheduler via sched_ext.
- Unwind DWARF stacks in the kernel, as in Perforator.
- Write network stacks that fully bypass kernel routing, as in Cilium.
- Perform low-overhead syscall and activity monitoring, as in
bpftrace,seccomp, or various AV software.
There are also plans to extend sched_ext’s approach to other subsystems, for example for NUMA/CXL migration policies.
Writing your first eBPF program
For our first program we will write a simple network filter, because it is easier to load and as a respect for eBPF’s origin. Let’s say an imaginary manager asks you to mitigate a DDoS attack by blocking all packets smaller than 100 bytes. In a first attempt you would probably write something like this:
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
const int MIN_SIZE = 100;
SEC("xdp")
int xdp_ddos(struct xdp_md *ctx) {
void *end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
if (end - data < MIN_SIZE) {
return XDP_DROP;
}
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
Any eBPF program has one argument, containing the context of its hook. For XDP filters, it consists of pointers to the packet (and some minor fields to store metadata), which is more than enough for filtering.
But what if your imaginary manager wants to adjust the threshold a little for another DDoS attack? You can’t ship source code plus compiler to each server and teach every engineer how to use your build system, not to mention build dependencies… We should make MIN_SIZE a load-time parameter, and there is a neat trick for this:
const volatile MIN_SIZE = 100;
Marking this variable as volatile places it in one place inside the .rodata ELF section and does not inline it in the code. As far as I understand, this behavior is non-specified by the C standard, but has become so popular that any attempt to change compiler’s behavior is reverted.
With your code approved by the imaginary manager, it is now time to compile it into ELF. But for which architecture should we compile it?
eBPF Virtual ISA
eBPF virtual CPU has a two-operand 64-bit RISC architecture with 10 general purpose registers and one read-only stack pointer. Fortunately, this time RISC really means reduced instruction set, with the simplest instructions and encodings possible. You can read some semi-misleading documentation on bytecode here, but documentation is not required for reading bytecode in this post.
What is more interesting is the ABI of eBPF:
- Functions have no more than five arguments
- All arguments are passed via registers r1-r5
- r0 is the return register
- r6-r10 are callee-saved
If you look a little closer, this is a simplified version of the x64 and ARM64 ABI (sometimes with stronger guarantees). This is a logical technical decision – we want the FFI between eBPF and C to be a no-op without any ABI conversions, at least on mainline architectures. We also want to easily verify the code, so the ISA should be very limited and primitive.
While all these reasons are valid, this introduces performance problems to eBPF as we will see later. The main reason for those problems is hiding the capabilities and nuances of the underlying CPU, and each architecture is represented poorly:
- ARM64 loses its three-operand instructions, as well as two thirds of its registers.
- x64 loses powerful complex instructions like
reg + reg * Xaddress encoding, which now requires 2 ALU eBPF instructions and pollutes one register for computation.
Building and running your code
With that said, we can now compile our code with clang:
$ clang -target ebpf xdp.c –o xdp.bpf.o
and view the resulting bytecode with
$ llvm-objdump -d xdp.bpf.o
Another option is to experiment with eBPF on godbolt.
For the DDoS filter above, clang will produce the following bytecode (comments added by me):
xdp_ddos:
; Load pkt to w2 (lower half of r2)
w2 = *(u32 *)(r1 + 0)
; Load pkt_end to w1 (lower half of r1)
w1 = *(u32 *)(r1 + 4)
; Arithmetic to get r1 = pkt size
r1 -= r2
; Get MIN_SIZE address and load it
; (see libbpf section)
r2 = MIN_SIZE ll
w2 = *(u32 *)(r2 + 0)
; Sign extending int (MIN_SIZE) to i64 (ptr diff)
r2 <<= 32
r2 s>>= 32
; Conditional return
w0 = 1
if r1 s< r2 goto .LBB0_2
w0 = 2
.LBB0_2:
exit
MIN_SIZE:
.long 100
As you can see, eBPF assembly language is not far from C and is easy to read without any preparation. You can extract eBPF bytecode for any currently loaded eBPF program with bpftool:
# Look for prog IDs and names
$ sudo bpftool prog list
$ sudo bpftool prog dump xlated id <ID> # or name <NAME>
Let’s load our DDoS filter:
$ clang -target ebpf xdp.c –o xdp.bpf.o
$ sudo bpftool prog load xdp.bpf.o /sys/fs/bpf/test_filter
# You can unload your program with `rm /sys/fs/bpf/test_filter`
# You can attach it to any network interface
# CAUTION: It will really filter out any packets smaller than 100 bytes, and can be destructive for your SSH connection
# Note: We are using xdpgeneric instead of xdp to avoid any unpleasant limitations of a particular driver's XDP path
$ sudo bpftool net attach xdpgeneric pinned /sys/fs/bpf/test_filter dev eth0
# You can dump bytecode of your filter
$ sudo bpftool prog dump xlated name xdp_ddos
If you read the returned assembly carefully enough, you will notice that bpfprog returns slightly different bytecode than llvm-objdump. In particular, the MIN_SIZE address load translates into a pointer to an eBPF map:
# Instead of 'r2 = MIN_SIZE'
3: (18) r2 = map[id:3][0]+0
This transformation was done by the loading layer, in our case by bpftool, and its underlying library libbpf. Libbpf is responsible for all load-time transformations, as well as parameter patching in the .rodata section. It can even codegen skeleton headers with nice API for such parameters.
libbpf
The Linux kernel does not accept ELF binaries in the bpf() syscall, so any compiled ELF has to be parsed and converted into the expected format. libbpf (or any other alternative implementation) does this in 3 passes:
- Extract functions and group them into an eBPF program
A single ELF can contain several XDP filters and even mix different eBPF program types. Since the kernel loads one program per syscall, libbpf has to split the ELF into separate programs. It parses symbol tables to find top-level definitions, as well as the static functions used in each program, to copy them as well.
- Translate
.rodatato an eBPF map
The kernel doesn’t have a .rodata section for eBPF programs, so libbpf converts it into a single-element eBPF map. This single element contains the whole section content, allowing to rewrite
r2 = <.rodata + 0x16>
with
r2 = &rodata_map[0] + 0x16
# or more accurately
r2 = &map[id:<id of rodata map>][0] + 0x16
This map expression is actually a single eBPF instruction (intrinsic). It will be converted back into an address assignment much later in the compilation pipeline, but for now libbpf only knows about abstract eBPF maps, not about their memory location or layout.
- Resource initialization
libbpf is responsible for creating and populating maps. It can also freeze them, telling the kernel that their content is immutable. This lets the verifier apply some optimizations in the kernel.
After all those transformations, the eBPF program is passed to the kernel. In the kernel it will be validated by kernel’s verifier, transformed by some optimizations and translated into native instructions. But the kernel side we will explore in future posts.
Next part: We have compiler at home