Blog
bpftrace: a scriptable magnifying glass with X-ray vision for Linux
Zach Mitchell | 22 November 2024
Fun Package Fridays is a series where we ask members of the Flox team to share fun and perhaps not-so-well-known tools and utilities you can find in the Flox Catalog, powered by Nix. Today's edition comes from Flox engineer Zach Mitchell, who's just kind of a nerd about profiling.
It's inevitable that your machine behaves in an unexpected way. If software always did what you expected, sure, you'd have fewer gray hairs, but writing software would be much less interesting.
Depending on the context you may run your program under a debugger,
open the dev tools in your browser,
or you may run perf
to see where your program is spending its execution time.
Allow me to add another tool to your toolbox: bpftrace
.
eBPF
eBPF is a technology that lets you dynamically instrument and program the Linux kernel. A core piece of eBPF is a small bytecode interpreter embedded in the Linux kernel. Programs written and compiled for eBPF can be triggered by events in the Linux kernel, such as a process exiting, memory being allocated, a file being read, or a packet arriving from the network.
Some programs use eBPF (which stands for "extended Berkeley Packet Filter") to inspect and route every incoming network packet. As you can imagine, this bytecode interpreter needs to run pretty quickly to not affect network performance too much!
bpftrace
bpftrace
leverages eBPF to allow you to see deep inside your system and ask
arbitrary questions about what's happening under the hood.
Let's start out by installing bpftrace
from the Flox Catalog.
If you're on macOS, you're out of luck since bpftrace
uses features built
into the Linux kernel.
Follow along anyway, bpftrace
is pretty cool.
Create a new directory bpftrace_fun
so we can create a Flox environment and
install bpftrace
.
You see a warning here because Flox environments are cross-platform by default,
but Flox knows that bpftrace
is only available on Linux and helpfully installs
it only for Linux systems.
Let's activate the environment so we can use bpftrace
:
Getting familiar with bpftrace
bpftrace
lets you write scripts in an awk
-inspired language to decide which
kernel events (called "probes") you want to respond to,
at which point you can extract information from those events or the system
state, and print formatted output.
bpftrace
also comes with a collection of useful scripts already written for
you,
so let's run one of those.
We'll start with cpuwalk.bt
,
which samples which CPUs are executing processes and provides you with a
histogram of execution per CPU.
This program runs until you press Ctrl-C
,
and displays the histogram on exit.
Note that bpftrace
and its bundled scripts require superuser permissions
to run.
That's pretty nifty!
Let's try another one.
The undump.bt
script dumps the contents of messages being sent across Unix
Domain Sockets on your system (again, press Ctrl-C
to quit it):
Looks like systemd
is unhappy about something on my system!
Who knew?
Ok, one more.
The tcpconnect.bt
script will list all new TCP connections on your machine.
Run the script and open a new page in your browser,
it should show up in the output:
There's a whole collection of useful scripts bundled with bpftrace
,
and it's crazy that one tool can do so many different things.
Writing your own scripts
While these scripts are useful, at some point you'll have your own ideas about what to instrument. Say, for instance, you want to track all of the fork and exec system calls that originate from a command.
In order to customize your instrumentation you'll need to write a bpftrace
script.
Refer to the bpftrace documentation for the exact syntax,
but it looks pretty similar to awk
or DTrace if you've used either of those
tools before.
Here's what the execsnoop.bt
script looks like:
This script is triggering on the tracepoint:syscalls:sys_enter_exec*
family
of probes to print some information every time a process calls one of the
exec
system calls.
To see which probes are available you can run the bpftrace -l
command.
Probes have three components to their names, and you can use an asterisk as a
wildcard character.
In addition, there are separate probes for the start and end of a system call:
tracepoint:syscalls:sys_enter_*
and tracepoint:syscalls:sys_exit_*
.
Let's see which probes exist for the read
system call:
Let's dig in to the read
system call and see what information it makes
available:
From this output you can see that we get access to the file descriptor, a pointer to the buffer that will be read into, and the number of bytes that were requested to be read.
Let's make a script called bigreadsnoop.bt
that reports reads larger than
100,000 bytes on this machine:
Running bpftrace
with this script on my machine gives the following output:
I'm not sure what this process is doing, but it seems to be reading quite a lot! It's trying to read 262kB every ~20ms!
This has been Fun Package Friday
Everyone has packages that they love,
and bpftrace
is one of mine.
What are yours?