Building ChatGPT Apps: Principles for a New Kind of Interface - Elliot Garreffa, Ghost Team

by Elliot Garreffa

Day 1 · Broadway Ballroom South (6th Floor) · Apps and Agents · 23 min · 4 min read

Applied AI

TL;DR

Elliot Garreffa shared six principles for designing ChatGPT and MCP apps, backed by data from 316 apps and 2,100 tool descriptions. 57% of apps use negative constraints in tool descriptions. A 60% difference exists between staging tests and live ChatGPT behavior. One in seven app invocations fails. Ghost Team's platform tracks daily app store changes.

From Fragmented Web to Intent-Based Web

▶ Watch (0:02)

Users no longer search the internet for a tool and navigate its UI. They prompt, and the tool comes to them. Garreffa called this shift to an “intent-based web.” The ChatGPT App Store grew slowly at first with a flat period in early February. In the last 10 days before the talk, 96 apps went live from major brands like Statista, Manpower, and Mintel. Ghost Team’s platform refreshed app store data daily and analyzed 316 apps and 2,100 tool descriptions to find patterns. ▶ Watch (0:02)

▶ Watch (4:26)

A ChatGPT app has three parts. The model acts as an orchestration layer with full context and memory. It calls public tools exposed by the MCP server. That server handles authentication, security, and business logic. The widget is a sandbox iframe that pushes private data back into the conversation. Garreffa stressed the widget is a one-way arrow. The model cannot read what happens inside it. This protects private data for enterprises like Statista who do not want their data exposed for training. ▶ Watch (4:26)

Principle One: Start with Goals and Intents

▶ Watch (7:41)

Every tool flows from a defined user intent. Garreffa advised companies to accept that few people ask ChatGPT to book an Uber ride. The payoff is learning the early ecosystem. Design should orient around the intent a user types into ChatGPT, which differs from web or mobile queries. Build tool sets by reverse engineering those prompts on a persona basis. Garreffa recommended Fractal as a no-code tool that turns user requests into intents and then builds the tools around them. ▶ Watch (7:41)

Tool Descriptions Are the New System Prompts

▶ Watch (9:18)

Over 57% of apps use negative constraints inside tool descriptions. Uber’s tool prevents hallucinating a pickup or drop-off location. Statista’s tool stops the model from adding its own unverified research to statistics. Garreffa called tool descriptions the “new SEO.” A 14% average failure rate for invocation means one in seven apps fails when called. Apps with fewer, better tools have higher success rates because the model has less complexity to reason through. Keep apps simple and serve a refined intent. ▶ Watch (9:18)

Let the Model Do the Work with Multi-Tool Chains

▶ Watch (11:24)

Simple single-tool apps work for most use cases. For complex ones, chain tools together and let the model decide. Garreffa showed a Statista example. A search-statistics tool returns candidate data. Instead of a scoring system picking one result, a second chart tool receives all candidates. The model, using its understanding of intent and memory, picks the most relevant statistic to display. Booking.com could use the same approach: return 50 hotels, then use context and memory to show only the three best options. ▶ Watch (11:24)

Design for Conversation, Not One-Shot Prompts

▶ Watch (15:04)

Users refine requests over multiple turns. “Show me hotels in Rome for next week” becomes “make it cheaper” then “what about the week after.” Only 7% of apps proactively design for this. Expedia’s tool description instructs the model to treat every follow-up as a new search intent. It tells the model to “call the tool again using the updated parameters” rather than answering from general knowledge. Garreffa recommended building this multi-turn handling into tool descriptions. ▶ Watch (15:04)

Test Live in ChatGPT, Not Just Staging

▶ Watch (16:49)

Garreffa found a 60% difference between staging tests and live behavior. Free users on long conversations get degraded models. Success rates drop. Testing with structured eval prompts versus casual real-world prompts showed a seven times higher failure rate. His advice: test at scale with realistic prompts aligned to the target persona. Test directly in the ChatGPT client. Ghost Team’s platform monitors apps live and runs conversational testing optimization agents to close the gap. ▶ Watch (16:49)

Key Takeaways

Start with user intent, not features, and build tools backward from prompts.
Treat tool descriptions as system prompts and the new SEO for discovery.
Test live in the ChatGPT client at scale; staging tests miss 60% of failures.

Filed under

Applied AI