An introduction to the BPF Compiler Collection

Please consider subscribing to LWN
Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

December 22, 2017

This article was contributed by Matt Fleming

BPF in the kernel

In the previous article of this series, I discussed how to use eBPF to safely run code supplied by user space inside of the kernel. Yet one of eBPF's biggest challenges for newcomers is that writing programs requires compiling and linking to the eBPF library from the kernel source. Kernel developers might always have a copy of the kernel source within reach, but that's not so for engineers working on production or customer machines. Addressing this limitation is one of the reasons that the BPF Compiler Collection was created. The project consists of a toolchain for writing, compiling, and loading eBPF programs, along with example programs and battle-hardened tools for debugging and diagnosing performance issues.

Since its release in April 2015, many developers have worked on BCC, and the 113 contributors have produced an impressive collection of over 100 examples and ready-to-use tracing tools. For example, scripts that use User Statically-Defined Tracing (USDT) probes (a mechanism from DTrace to place tracepoints in user-space code) are provided for tracing garbage collection events, method calls and system calls, and thread creation and destruction in high-level languages. Many popular applications, particularly databases, also have USDT probes that can be enabled with configuration switches like --enable-dtrace. These probes are inserted into user applications, as the name implies, statically at compile-time. I'll be dedicating an entire LWN article to covering USDT probes in the near future.

The project documentation shows how to use the existing scripts and tools to conduct a thorough performance investigation without writing a line of code, and a handy tutorial is provided in the BCC repository. Another useful guide to some of the BCC tools was written by Brendan Gregg, who has the second highest number of patches to bcc/tools (Sasha Goldshtein holds the number one spot as of this writing).

Front-ends for the Python and Lua programming languages are available in BCC. Using these high-level languages, it's possible to write short but expressive programs with all the data-manipulation advantages that are missing with C. For example, developers can treat eBPF maps as Python dictionaries and access map contents directly, which is implemented internally by using the BPF helper functions. This helps to lower the bar for would-be developers using eBPF because they can use the standard patterns that they're used to for processing data.

BCC invokes the LLVM Clang compiler, which has a BPF back end, to translate C into eBPF bytecode. BCC then takes care of loading the eBPF bytecode into the kernel with the bpf() system call. If loading fails, for example if the in-kernel verifier checks fail, then BCC provides hints as to why loading failed, e.g. "HINT: The 'map_value_or_null' error can happen if you dereference a pointer value from a map lookup without first checking if that pointer is NULL." This is another motivation for creating BCC — it's difficult to write obviously correct BPF programs; BCC tells you when you've made a mistake.

A really quick "Hello, World!" example

To demonstrate how quickly you can start working with BCC, here's the "Hello, World!" program example from the BCC repository. It prints into the trace buffer every time the clone() system call runs. I've reformatted it slightly to make it easier to read.

    #!/usr/bin/env python
    from bcc import BPF

    program='''
    int kprobe__sys_clone(void *ctx)
    {
    	bpf_trace_printk("Hello, World!\n");
	return 0;
    }
    '''

The entire eBPF program is contained in the program variable; this is the code that runs inside the kernel on the eBPF virtual machine. The format of the function name, "kprobe__sys_clone()", is important: the kprobe__ prefix directs the BCC toolchain to attach a kprobe to the kernel symbol that follows it. In this case, that's sys_clone(). When sys_clone() is called and this kprobe fires, the eBPF program runs and bpf_trace_printk() prints "Hello, World!" into the kernel's trace buffer.

The remainder of the Python program causes the eBPF code to be loaded into the kernel and run:

    b = BPF(text=program)
    b.trace_print()

The previously cumbersome task of compiling the program to eBPF bytecode and loading it into the kernel is handled entirely by instantiating a new BPF object; all the low-level work is done behind the scenes, inside the Python bindings and BCC's libbpf.

BPF.trace_print() performs a blocking read on the kernel's trace buffer file (/sys/kernel/debug/tracing/trace_pipe) and prints the contents to the standard output. Here's the output:

    gnome-terminal--3210  [003] d..2 19252.369014: 0x00000001: Hello, World!
    gnome-terminal--3210  [003] d..2 19252.369080: 0x00000001: Hello, World!
    pool-21543 [001] d..2 19252.382317: 0x00000001: Hello, World!
    bash-21545 [002] d..2 19252.385535: 0x00000001: Hello, World!
    bash-21546 [003] d..2 19252.385752: 0x00000001: Hello, World!
    bash-21545 [002] d..2 19252.386883: 0x00000001: Hello, World!

The output shows:

The name of the application running when the kprobe fired
Its PID
The CPU it was running on (in [brackets])
Various process context flags
A timestamp

The final field is our "Hello, World!" string that we passed to bpf_trace_printk(). The penultimate field contains the address 0x00000001. Normally, when kernel code writes to the trace buffer, the instruction pointer address following the call to trace_printk() is printed in that field. Unfortunately, this isn't implemented for bpf_trace_printk(), so the hard-coded address 0x00000001 is always used.

More examples

argdist.py inserts a probe (uprobe, kprobe, tracepoint, or USDT) into to a given function, which can be in the kernel or in user-space code. When the probe fires, argdist.py prints the function's parameter values, either as a count or histogram. It runs until interrupted by the user. For example, the following command prints the number of times irq_handler_entry() is called, along with which interrupt was raised:

    $ tools/argdist.py -C 't:irq:irq_handler_entry():int:args->irq'
    [14:14:24]
    t:irq:irq_handler_entry():int:args->irq
    COUNT      EVENT
    12         args->irq = 45
    16         args->irq = 53
    52         args->irq = 48
    [14:14:25]
    t:irq:irq_handler_entry():int:args->irq
    COUNT      EVENT
    1          args->irq = 49
    5          args->irq = 53
    24         args->irq = 45

Because the histogram option (-H) uses buckets to group multiple interrupts together, it's less useful than the count option (-C) in this case. One scenario where histogram output is helpful, however, is for the btrfsdist.py tool, which summarizes the latency of Btrfs reads, writes, opens, and fsync operations into power-of-two buckets:

 $ tools/btrfsdist.py Tracing ^C

operation = 'read' usecs 0 -> 1 2 -> 3 4 -> 7 8 -> 15 16 -> 31 32 -> 63 64 -> 127 128 -> 255 256 -> 511 512 -> 1023

operation = 'write' usecs 0 -> 1 2 -> 3 4 -> 7 8 -> 15 16 -> 31 32 -> 63

operation = 'open' usecs 0 -> 1 2 -> 3 4 -> 7 8 -> 15 16 -> 31

operation = 'fsync' usecs 0 -> 1 2 -> 3 4 -> 7 8 -> 15 16 -> 31 32 -> 63 64 -> 127 128 -> 255 256 -> 511 512 -> 1023 1024 -> 2047 2048 -> 4095 4096 -> 8191

There's more to come

That was explore some data structures, compiled, end.

btrfs operation latency... Hit Ctrl-C to end. : count distribution : 775 |**************************************| : 60 |* | : 20 |* | : 3 | | : 3 | | : 0 | | : 0 | | : 1 | | : 19 | | : 12 | | : count distribution : 0 | | : 2 | | : 8 |****************************| : 1 |* | : 4 | | : 4 | | : count distribution : 636 |**********************| : 22 |* | : 16 |* | : 2 | | : 1 | | : count distribution : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 1 |****************************************| just a quick introduction to BCC. In the next one, we'll of the more complicated topics, like how to access eBPF how to configure the way your eBPF program is and how to debug your programs, all using the Python front class="IndexEntries">

to post comments class="CommentTitle">An introduction to the BPF Compiler Collection Posted Dec 23, 2017 3:38 UTC (Sat) by unixbhaskar (guest, #44758) [Link]

Thanks, Matt!

An introduction to the BPF Compiler Collection

Posted Dec 24, 2017 14:33 UTC (Sun) by flb (subscriber, #69248) [Link]

I've recently come across ply (https://wkz.github.io/ply/) which is a lightweight language for some BPF tracing tasks. It doesn't require access to kernel headers, so it cannot dissect structs for example, but in simple cases it might be enough.

An introduction to the BPF Compiler Collection

Posted Apr 21, 2019 19:58 UTC (Sun) by ncm (guest, #165) [Link]

Everyone seems to assume that eBPF program source is C. But LLVM is happy to generate code from IF produced from other languages, notably C++ and Rust. It would be unfortunate if BCC fails to make available improved ways to express eBPF programs.

People have asked me why anyone would code small program fragments like eBPF in C++. The short answer is that C++ and Rust enable better encapsulation of semantics, particularly those useful for a whole collection of eBPF program fragments. Once you find a use for eBPF in one place, you are likely to notice many other places.

Bcc would be a good place to park C++ and Rust abstractions useful for any eBPF program fragment.

Index entries for this article
Kernel	Development tools
GuestArticles	Fleming, Matt