When MCP Isn't Enough: Product Decisions Behind Scalable Agent Systems - John Su, Datadog

by John Su

Day 1 · Juilliard Complex (5th Floor) · Apps and Agents · 21 min · 2 min read

Applied AI

TL;DR

John Su, Datadog product director, shared how Bits AI scaled to 4,000 customers by making product decisions around autonomy, transparency, and feedback. The agent investigates thousands of alerts weekly using hypothesis-driven reasoning, surfaces inconclusive results, and charges only for conclusive investigations. A feedback loop with 'not quite' options drove system improvements.

Autonomy Is Not Binary

▶ Watch (5:04)

Bits AI investigates thousands of production alerts each week for 4,000 customers. The team chose a supervised autonomy model. The agent autonomously investigates incidents but requires human approval before merging a fix. This avoids making a sleepy engineer lose time. Autonomy sits on a spectrum. High-risk systems like infra or payment always need human approval. Low-risk, high-clarity tasks like tagging tickets can run fully autonomous.

Hypothesis-Driven Reasoning Beats Fixed Workflows

▶ Watch (8:36)

Fixed workflows failed when multi-service failures or missing context occurred. The team switched to adaptive reasoning. The agent generates multiple hypotheses, tests each, and validates dynamically. Engineers can step in mid-investigation to guide the agent. For example, they can say “don’t look into that service, we killed that months ago.” This mirrors how a human engineer thinks during an incident.

Surface Uncertainty, Don’t Hide It

▶ Watch (10:59)

The agent labels investigations as conclusive or inconclusive. Inconclusive means the root cause was not found. Datadog does not charge for inconclusive investigations. This built trust and encouraged usage. The hypothesis tree shows the agent’s reasoning process transparently. When GPT 5.2 caused hallucinations, the team rolled back in minutes. Being ready for model changes is critical.

Feedback Loops That Capture Partial Correctness

▶ Watch (15:18)

Initial feedback options were yes and no. Users rarely clicked either because the answer was partly correct. Datadog changed to no and not quite. “Not quite” became the most clicked option. A comment box collected reasons like “it forgot to check my logs.” Each comment became an eval point. The team used these to add missing checks. Internal engineers tested the product for a full year before GA.

Notable Quotes

“autonomy is not binary” John Su · ▶ Watch (6:33)

“we didn’t hide the uncertainty here but we actually surfaced it to the user.” John Su · ▶ Watch (11:46)

“the not quite answer was actually became the most clicked answer.” John Su · ▶ Watch (16:26)

“it started creating hallucinations everywhere.” John Su · ▶ Watch (13:09)

Key Takeaways

Autonomy is a spectrum; match it to risk and user expectations.
Hypothesis-driven reasoning outperforms fixed workflows in dynamic production environments.
Surface uncertainty transparently to build trust and enable usage-based pricing.

Filed under

Applied AI