Bridging Kernel Space and AI: Building an MCP Server for Linux Scheduler Observability - Daniel Hodges, Meta

by Daniel Hodges

Day 1 · Astor Ballroom (7th Floor) · MCP Best Practices · 30 min · 4 min read

Applied AI

TL;DR

Daniel Hodges demonstrated building an MCP server for Linux scheduler observability using BPF. He walked through a five-step example with a single trace point and ring buffer. He then presented SCXTOP, a real MCP tool that hooks 15–20 BPF events, shards ring buffers across NUMA nodes, and exposes analysis tools for AI agents. The talk covered pitfalls like PID instability, high-frequency event loss, and permission isolation.

Building a Simple BPF MCP Server

▶ Watch (2:05)

The talk started with a five-step recipe. First, define the BPF event: the sched_switch trace point fires every time the scheduler swaps processes. Second, write the BPF program: attach to the trace point, reserve an entry from a ring buffer, fill in fields like PID and task name, then submit the event. Third, set up the ring buffer in userspace: use libbpf-rs to build a ring buffer builder with a callback. Fourth, aggregate the data: parse events and store them in a Rust hashmap. Fifth, expose the tools via MCP: create endpoints like get_scheduler_stats, enable_collection, and disable_collection. The enable method loads the BPF program; the disable method unloads it.

SCXTOP: An MCP Tool for Scheduler Observability

▶ Watch (10:14)

SCXTOP is built for sched_ext, a framework that allows scheduling logic to be written in BPF. The tool hooks 15 to 20 different BPF events. Its architecture has a kernel side and a userspace side. An event processing layer decouples the TUI, MCP, and trace modes. When collecting events on a machine with hundreds of cores, a single ring buffer drops events. SCXTOP solves this by sharding ring buffers based on L3 cache or NUMA nodes. The tool also generates Perfetto traces, which can be loaded into Google’s Perfetto UI for SQL queries and analysis.

Handling High-Frequency Events and Multiple Ring Buffers

▶ Watch (16:04)

For multiple event types flowing through one ring buffer, SCXTOP uses a generic BPF event struct with a type field as the first element. User space reads the type to parse the union of event structs. High-frequency events like sched_switch can run at tens to hundreds of thousands per second. To avoid flooding the ring buffer, the tool uses per-CPU counters. A BPF map of type BPF_MAP_TYPE_PERCPU_ARRAY stores counts cheaply. User space reads per-CPU data from the map at intervals. This approach trades event context for throughput.

PID Stability and Dynamic Tracing

▶ Watch (19:07)

PID is not a stable identifier: the PID lifecycle is shorter than the task struct’s existence. Instead, the tool converts the task pointer to a u64 and uses that as a stable map key. For dynamic tracing, SCXTOP uses a generic kprobe. The BPF side defines a handler with a BPF_FUNC_get_func_ip call to capture the function address. User space can then attach to any kernel function at runtime, for example tcp_send_message, without predefining all possible trace points in BPF.

Permissions and Security

▶ Watch (22:22)

BPF requires root or CAP_BPF. Granting that to an AI agent is risky. SCXTOP’s solution: use the trace subcommand to collect Perfetto data on the target machine (with full permissions), then transfer the trace file to a separate analysis machine that runs the MCP server without elevated privileges. This isolation keeps the privileged data collection short-lived and the analysis sandboxed. For fleet-wide monitoring versus one-off debugging, the permission model differs. The talk recommended always decoupling the collection step from the AI-facing tool.

Q&A

Is anyone doing similar work for JVM environments, like heap dumps? Hodges was not aware of direct equivalents but thought the same BPF patterns could apply to JVM thread scheduling. ▶ Watch (28:16)

What is the end-to-end workflow you use with the agent? Hodges works on BPF schedulers themselves. He feeds the scheduler source code and raw performance data into an agent to understand why one scheduler performs worse than another. ▶ Watch (28:48)

Notable Quotes

The key insight is BPF lets you add that custom instrumentation to a running kernel without writing kernel modules or even rebooting. Daniel Hodges · ▶ Watch (1:36)

We had to shard those out. So you could do sharding based on things like L3 cache, numa node, things like that. Daniel Hodges · ▶ Watch (12:06)

PID is not a stable interface. So if you’re just trying to track different events and store a PID into a BPF map and then do something based on that PID later, the PID may have changed. Daniel Hodges · ▶ Watch (19:07)

AI agents generate BPF byte code and load that into the kernel. That’s a very dangerous thing to do. Daniel Hodges · ▶ Watch (2:50)

Key Takeaways

A five-step BPF MCP server pattern works for single trace points and scales with ring buffer sharding.
Task pointer as a u64 key avoids PID reuse issues in BPF maps.
Decouple privileged data collection from AI analysis to limit permission exposure.

About the Speaker(s)

Daniel Hodges is a software engineer on the Linux team at Meta. He has previous worked in areas such a observability, profiling, and application performance testing.

Filed under

Applied AI