The Tool Abstraction Problem

▶ Watch (0:02)

Sam Partee made north of 10,000 tools. He started before MCP existed, with a custom Open Execution Protocol that cared only about tools, not prompts or resources. The challenge: take JSON output from GPT-3, where JSON was not guaranteed, then parse it correctly and ensure the tool result was right. He calls this the tool abstraction problem. It applies to resources and prompts too, but differently.

APIs and Tools Serve Different Audiences

▶ Watch (1:24)

Auto-generating tools from API endpoints fails. APIs are built for other programmers, not for large language models. An LLM needs to read, understand, and produce the right call. Partee gave a concrete example: “Find the customer who complained last week and schedule a follow-up.” That request needs five endpoints. The model rarely realizes it must call get user ID first. The better abstraction is one tool that covers find calendar and submit complaint. Accuracy jumps significantly.

Chaining Is the Hardest Problem

▶ Watch (4:01)

Every major evaluation paper since Apple’s tool sandbox shows the same result: calling six tools means over a 50% chance of failure. Chaining makes it worse. Nestful, tool composition papers, and the Berkeley function calling leaderboard all agree — chaining is the hardest thing. Partee’s solution: model tools around tasks and intents. An agent’s to-do list matches task-oriented tools. “Get track this order” works better than “get user.”

Description Quality Is a 10x Lever

▶ Watch (8:58)

Arcade ran over 20,000 evals. The finding: description quality produces a 10x lever on returns. Descriptions influence model selection more than tool names or actual function use. Context window position matters — schemas sit near the bottom, so the description is the last thing the model processes before deciding. Keep descriptions under 600 words, start with an action verb, and write a short task-intent sentence. After optimizing only descriptions, nightly evals showed a 10x reduction in errors.

Progressive Discovery and the Tool Count Cliff

▶ Watch (12:41)

The tool count cliff is real. Since Apple sandbox, more than 20 tools confuses an agent. Progressive discovery introduces context over time, but it does not solve the whole problem. Partee’s advice: move composition logic inside the tools themselves. Make tools task-intent enabled functions. Chaining inside tools is the right abstraction. Descriptions remain the 10x lever — do not write them once and forget them.

Q&A

How should task intent-based tools handle input that depends on a previous event? Enumerate the options, break into sub tasks or sub agents; if a single agent requires more than 40 tools, the agent scope is too broad. ▶ Watch (14:06)

Notable Quotes

Don’t do it. It’s not going to work. Sam Partee · ▶ Watch (1:32)

chaining is the hardest thing. Sam Partee · ▶ Watch (4:00)

the description quality, the quality of the description and iterating on the description is what actually has a 10x lever Sam Partee · ▶ Watch (9:12)

tool count cliff is real. Sam Partee · ▶ Watch (12:41)

Key Takeaways

  • Auto-generating tools from API endpoints fails because LLMs need a different abstraction.
  • Task-intent oriented tools reduce chaining failures and improve selection accuracy.
  • Tool descriptions have a 10x lever on model selection; iterate on them.

About the Speaker

Sam Partee is the CTO and co-founder of Arcade.dev. Before starting Arcade, Sam lead the applied AI team at Redis responsible for the vector database offering. He is an avid OSS developer and has contributed on projects like Langchain, LlamaIndex, Chapel, DeterminedAI, and others.