The Black Box of MCP App Discovery

▶ Watch (2:13)

Vincent McLeese’s team analyzed 148 live ChatGPT apps across 15,000 prompt simulations. They found that 14% of direct prompts failed to invoke the app. Developers go blind after submission. They cannot see whether the LLM chooses their tools over competitors or why invocation fails. McLeese proposes three foundations to fix this: live LLM monitoring, conversational testing, and continuous optimization. The analysis revealed that even direct mentions of the app do not guarantee tool invocation.

Live LLM Monitoring: Track Invocation Rates in the Wild

▶ Watch (6:49)

Developers must track invocation rates per model and per user tier. The same prompt works on ChatGPT 5.3 but fails on degraded models. Free users hit usage limits and get downgraded, reducing tool quality. Mentioning the app does not ensure invocation; the model may ignore the tool if its description is unclear. Monitoring reveals these gaps. McLeese found that model degradation after usage limits is invisible to developers.

Conversational Testing: How Real Users Prompt

▶ Watch (9:54)

Most teams test with golden prompts that match their tool description perfectly. Real users prompt with vague, incorrect, or mixed-language inputs. McLeese tested four personas and found a 20-point invocation difference between golden prompts and non-technical users. Wrong tools increased. Developers must simulate messy conversations to see if their app survives. He recommends creating hundreds of messy model conversations to extrapolate real usage.

Continuous Optimization: Tool Descriptions as SEO

▶ Watch (12:10)

Stanford-led research found that nearly 100% of tools have quality defects; half fail to describe what they do. For Statista, a single word change boosted tool success rate by 6%. Tool descriptions must be clear and differentiate from competitors. Optimization is not one-time. Models change, user behavior changes. Developers must continuously monitor and iterate, treating tool metadata like SEO in 2005.

Q&A

Does ChatGPT still support organic app discovery? Organic discovery was active at launch but turned off; currently users must connect and mention, though OpenAI has confirmed it is on the roadmap. ▶ Watch (17:14)

Notable Quotes

14% of the time ChatGPT apps did not invoke for their really obvious and direct prompts. Vincent McLeese · ▶ Watch (6:13)

20-point difference between a golden prompt set user and a non-technical user. Vincent McLeese · ▶ Watch (11:11)

single word increased the tool success rate by nearly 6%. Vincent McLeese · ▶ Watch (12:46)

the model loves their own training data. Vincent McLeese · ▶ Watch (8:27)

Key Takeaways

  • Use live LLM monitoring to track invocation rates across models and user tiers.
  • Test with messy, real-world prompts to close the 20-point invocation gap.
  • Continuously optimize tool descriptions; a single word can improve success by 6%.

About the Speaker(s)

Vincent McLeese is the Product and Tech Lead at Ghost Team, an MCP Apps agency. He leads the development of appsdiscoverability.com , a platform devloper by Ghost Team to help mcp apps become more discoverable. A multiple founder with a background in tech strategy from Accenture.