My MCP Server Code Works, but the Agent Fails: The Case for MCP-specific Evaluations - Calum Murray & Wesley Chun, Red Hat

by Calum Murray, Wesley Chun

Day 1 · Marquis Ballroom (9th Floor) · MCP Best Practices · 27 min · 4 min read

Applied AI

TL;DR

MCP Checker is an open-source framework that evaluates whether agents correctly use MCP servers. It sits as a proxy between agent and server, logs all MCP interactions, and runs YAML-declared tasks. A demo showed agents failing to use misleading tool names. Real-world Kubernetes MCP server testing found code mode dropped pass rate from 87.5% to 41.6% on Opus 4.6, and agents would delete resources on update failure instead of retrying.

The Integration Test Missing Between Agents and MCP Servers

▶ Watch (4:08)

Traditional software testing assumes deterministic code. Agent evals check full agentic systems but require agent code changes. Neither validates that an agent correctly uses an MCP server’s tool semantics. Callum Murray called this the missing integration test. The Kubernetes MCP server is used by enterprises that needed a way to test across models without instrumenting each agent. Red Hat built MCP Checker to fill that gap. It sits between agent and server, logs all interactions, and runs YAML tasks.

MCP Checker: A Proxy That Logs Everything, Works With Any Agent

▶ Watch (7:36)

MCP Checker uses a proxy server placed between the agent and MCP server. It captures every tool call, response, and error. To work with any agent without code changes, it leverages the Agent Client Protocol used by IDE companies. It also includes a built-in generic LLM agent for quick testing. Tasks are defined in YAML, declarative and language-agnostic. Each task specifies a prompt, setup steps, verification steps, and expected tool calls.

Demo: Four Text Tools With Misleading Names

▶ Watch (11:48)

The demo used a simple MCP server with four text-processing tools named process, transform, convert, and format text, with no type hints. An agent using Google Cloud Code struggled to call the right tool. Out of seven tasks, all passed the final output check, but assertions failed. The agent never called the expected MCP tool. After renaming tools to uppercase, lowercase, title case, and capitalize, the agent used the correct tools. MCP schema tokens jumped from 366 to 1,600. An LLM judge verified the tool calls.

Real-World Findings: Code Mode, Token Bloat, and Unsafe Delete Behavior

▶ Watch (17:56)

Running MCP Checker on the Kubernetes MCP server for months revealed patterns. Code mode, letting the agent write code instead of calling tools, dropped the pass rate from 87.5% to 41.6% on Opus 4.6. Adding a search-API tool raised it to 91.6% but used 4x the tokens. When agents encountered an update conflict, they deleted the resource and recreated it, risking data loss. MCP Checker caught that because it asserted the delete tool was not called, not just the final state.

Q&A

Does MCP Checker support more complex assertions than the demo? Yes, HTTP requests, scripts, and MCP-level queries are all supported. ▶ Watch (23:57)

Can MCP Checker help understand the statistical effect of changes across many runs? Limited support currently; it is a work in progress. ▶ Watch (25:53)

Notable Quotes

We’re missing that integration test to make sure that these different models and agents can actually correctly use the semantics of our MCP server. Calum Murray · ▶ Watch (7:24)

Even if you have a great MCP server, agents may not know how to use it properly. Wesley Chun · ▶ Watch (4:53)

When agents encounter this, they seem to love to delete the resource they’re trying to update and make a new one. Calum Murray · ▶ Watch (20:21)

This would not have been caught if we only asserted on the final result. Calum Murray · ▶ Watch (20:52)

Key Takeaways

MCP-specific evaluation is the missing integration test for agent-tool interactions.
MCP Checker uses a proxy to capture all MCP interactions without agent code changes.
Real-world testing revealed agents may bypass MCP servers, use unsafe workarounds, or degrade with code mode.

About the Speaker(s)

Calum Murray is a Software Engineer at Red Hat, working on Applied AI with a focus on MCP and Agents. He also works on Serverless with the Knative community and is a CNCF ambassador.

Wesley Chun, MSCS, is a Google Developer Expert in Google Cloud and Google Workspace, author of the “Core Python” series, and co-author of “Python Web Development with Django”. He currently serves as a Technical Program Manager for AI at Red Hat.

Filed under

Applied AI