Evaluate What You Can't See: Measure the Probabilistic Nature of MCP - Prathmesh Patel & Marcelo Jimenez Rocabado, MCPJam

by Prathmesh Patel, Marcelo Jimenez Rocabado

Day 1 · Juilliard Complex (5th Floor) · Apps and Agents · 31 min · 3 min read

Applied AI

TL;DR

MCP server developers lack visibility into agent reasoning and user intent. Patel presented the user value chain — six steps from connection to user satisfaction — and showed how server logs hide failures. He advocated for evals: nondeterministic tests that measure whether the right tool was called with correct arguments. MCPJam's sandbox environments and trace-based eval generation let teams close the feedback loop.

Why MCP Developers Focus on Wrong Solutions

▶ Watch (3:24)

The XY problem occurs when someone asks about their assumed solution instead of their actual problem. Patel gave two examples: breeding a faster horse instead of finding faster transport, and asking “MCP sucks, how do I build a CLI?” when the real problem is agent workflow inefficiency and ballooning costs. Focusing on the Y narrows the solution space. With MCP host applications, progressive disclosure of tools or caching frequently used tools can reduce context window usage without abandoning MCP.

Server Logs Lie: Identical Tool Calls, Opposite Outcomes

▶ Watch (8:50)

Two user sessions called the same MCP tool “get project status” twice, both returning 200. Server logs showed identical latency and status codes. But behind the host application, one user got what they wanted and moved on. The other received wrong projects three times and grew frustrated. Patel explained that by default, MCP servers only see what hits their endpoint. The agent’s reasoning, tool selection, and user intent are invisible. That gap hides failure.

Six Steps to User Value, One Observable

▶ Watch (14:08)

Patel introduced a six-step value chain: connection, tools list success, agent discovery of your server, correct tool selection, correct arguments, correct response, agent reasoning of response, and user satisfaction. Only the tool call step gives the MCP server developer full control and observability. Every other step can drop user value. Most teams only instrument the tool call. That leaves the entire chain of user intent and agent behavior unmeasured.

Evals: Nondeterministic Tests That Measure Real Value

▶ Watch (19:00)

Patel built MCP evals by gathering product managers to define workflows, then creating a “golden prompt set” of user prompts. Evals are nondeterministic tests: given a prompt, did the server deliver value? Did the right tool get called with correct arguments? He noted that existing AI product data (chatbot logs) provides a starting point, though those users are typically power users. The real goal is to build evals from production user data.

Six Ways to Get Real User Data

▶ Watch (25:26)

Patel outlined six ways to capture user context that MCP servers miss: add a user intent parameter to tools, use existing AI chat data, build a test environment with sandboxes, leverage MCP apps for UI instrumentation, request sampling (though 99% of hosts don’t support it), and use isolated sandbox environments. These feed a flywheel: capture signal, cluster into workflows, build eval tests, set a quality gate in CI/CD, ship, and observe real user interactions to restart the cycle.

Notable Quotes

Same logs, same server, completely different user outcomes, user satisfaction on either side of the spectrum. Patel · ▶ Watch (10:44)

Sampling. Um 99% of host applications do not support sampling. Patel · ▶ Watch (23:45)

Building and shipping is only half the cycle. The other half is actually delivering user value. Patel · ▶ Watch (26:41)

You should not ship unless you hit your quality bar. Patel · ▶ Watch (26:16)

Key Takeaways

MCP servers lack visibility into agent reasoning and user intent by design.
Evals (nondeterministic tests) capture whether the right tool with correct arguments delivered user value.
Six methods exist to get real user data, feeding a flywheel of improvement and quality gates.

About the Speaker(s)

Prathmesh Patel is leading MCPJam: an open-source developer platform helping thousands test, evaluate, and ship their MCP apps and servers. He’s a former Technical Lead at Asana who owned Asana’s MCP server, REST API, and OAuth AS. He also led Asana’s prototyping and development of…

Filed under

Applied AI