The Tool Abstraction Problem: Lessons Learned Building 1000+ MCP Tools

by Sam Partee

Day 1 · Broadway Ballroom South (6th Floor) · MCP Best Practices · 16 min · 3 min read

Applied AI

TL;DR

Sam Partee built over 10,000 tools before MCP existed. He argues tools and APIs serve different audiences. Chaining six API calls has a 50% failure rate. The fix: design tools around tasks and intents. Optimizing tool descriptions yielded a 10x error reduction in nightly evals. Descriptions matter more than tool names or context.

The Tool Abstraction Problem

▶ Watch (0:02)

Sam Partee made north of 10,000 tools. He started before MCP existed, with a custom Open Execution Protocol that cared only about tools, not prompts or resources. The challenge: take JSON output from GPT-3, where JSON was not guaranteed, then parse it correctly and ensure the tool result was right. He calls this the tool abstraction problem. It applies to resources and prompts too, but differently.

APIs and Tools Serve Different Audiences

▶ Watch (1:24)

Auto-generating tools from API endpoints fails. APIs are built for other programmers, not for large language models. An LLM needs to read, understand, and produce the right call. Partee gave a concrete example: “Find the customer who complained last week and schedule a follow-up.” That request needs five endpoints. The model rarely realizes it must call get user ID first. The better abstraction is one tool that covers find calendar and submit complaint. Accuracy jumps significantly.

Chaining Is the Hardest Problem

▶ Watch (4:01)

Every major evaluation paper since Apple’s tool sandbox shows the same result: calling six tools means over a 50% chance of failure. Chaining makes it worse. Nestful, tool composition papers, and the Berkeley function calling leaderboard all agree — chaining is the hardest thing. Partee’s solution: model tools around tasks and intents. An agent’s to-do list matches task-oriented tools. “Get track this order” works better than “get user.”

Description Quality Is a 10x Lever

▶ Watch (8:58)

Arcade ran over 20,000 evals. The finding: description quality produces a 10x lever on returns. Descriptions influence model selection more than tool names or actual function use. Context window position matters — schemas sit near the bottom, so the description is the last thing the model processes before deciding. Keep descriptions under 600 words, start with an action verb, and write a short task-intent sentence. After optimizing only descriptions, nightly evals showed a 10x reduction in errors.

Progressive Discovery and the Tool Count Cliff

▶ Watch (12:41)

The tool count cliff is real. Since Apple sandbox, more than 20 tools confuses an agent. Progressive discovery introduces context over time, but it does not solve the whole problem. Partee’s advice: move composition logic inside the tools themselves. Make tools task-intent enabled functions. Chaining inside tools is the right abstraction. Descriptions remain the 10x lever — do not write them once and forget them.

Q&A

How should task intent-based tools handle input that depends on a previous event? Enumerate the options, break into sub tasks or sub agents; if a single agent requires more than 40 tools, the agent scope is too broad. ▶ Watch (14:06)

Notable Quotes

Don’t do it. It’s not going to work. Sam Partee · ▶ Watch (1:32)

chaining is the hardest thing. Sam Partee · ▶ Watch (4:00)

the description quality, the quality of the description and iterating on the description is what actually has a 10x lever Sam Partee · ▶ Watch (9:12)

tool count cliff is real. Sam Partee · ▶ Watch (12:41)

Key Takeaways

Auto-generating tools from API endpoints fails because LLMs need a different abstraction.
Task-intent oriented tools reduce chaining failures and improve selection accuracy.
Tool descriptions have a 10x lever on model selection; iterate on them.

About the Speaker

Sam Partee is the CTO and co-founder of Arcade.dev. Before starting Arcade, Sam lead the applied AI team at Redis responsible for the vector database offering. He is an avid OSS developer and has contributed on projects like Langchain, LlamaIndex, Chapel, DeterminedAI, and others.

Filed under

Applied AI