Why AI agents fail in production—and how to fix it

LLM evaluation essentials

Jan 21, 2025

Welcome back to Vanishing Gradients! This edition dives into the critical challenges and opportunities shaping the future of AI and ML, with lesssons from real-world deployments, leading experts, and practical frameworks.

Here’s what’s inside:

🤖 Why AI agents fail in production—and how to fix it

🍦 Scaling data science at Airbnb with Elena Tej Grewal

🛠️ The future of developer tools featuring modular AI systems

🔍 Debugging and evaluating LLM systems with Hamel Husain

🎙️ Ferras Hamad on scalable ML systems at DoorDash, Netflix, and more

🚀 MLOps with Databricks – learning from Maria Vechtomova and Başak Eskili

📖 Reading time: 8-10 minutes

Let’s ~~delve~~ dive in.

Why AI Agents Fail in Production—and How to Fix It 🤖

What’s the most common way AI agents fail in production? They get lost—and that failure cascades. A small error compounds into bigger problems as the agent keeps acting, creating systemic issues.

In the latest episode of the Vanishing Gradients podcast, I spoke with Alex Strick van Linschoten (ZenML), who has analyzed over 400 real-world LLM deployments. We explored:

🏗️ Why structured workflows still outperform autonomous agents in production

⚠️ How to prevent cascading failures before they escalate

🛠️ What companies like Anthropic and Amazon have learned from scaling LLM systems

📺 Watch the clip here:

For the full episode, tune in to the Vanishing Gradients podcast and learn how to build more reliable AI systems. You can also watch the livestream here on YouTube and the podcast should be on your app of choice.

Scaling Data Science at Airbnb—and Beyond 🍦

How do you scale a data science team to 200 over seven years while embedding analytics and machine learning into a company’s DNA? Dr. Elena Tej Grewal accomplished just that at Airbnb.

In the latest episode of High Signal, Elena shares how she built systems that powered product decisions, flagged risky users, and scaled trust in data across teams. By embedding machine learning into fraud detection and other processes, she helped create a data-driven culture that grew with the company.

🎥 Watch the clip here:

What we discussed:
🔍 How simple heuristics addressed early challenges in fraud detection

⚖️ Trade-offs between machine learning models and quick solutions

🍦 The surprising parallels between data science and running an ice cream shop

Today, Elena teaches data science at Yale, applying her expertise to environmental challenges, while also running a data-driven ice cream shop—proving that data principles can thrive in unconventional spaces.

For the full episode, listen to High Signal and learn from Elena’s journey of scaling data science with impact. You can also watch the episode here on YouTube and the podcast should be on your app of choice.

The Future of Developer Tools: Open-Source AI Code Assistants and Modular AI Systems 🛠️

What if your AI coding assistant let you build it your way—choosing the LLMs, workflows, and tools that best fit your needs?

Join me in conversation with Ty Dunn, co-founder of Continue, for a live-streamed episode of Vanishing Gradients, where we’ll explore how open-source AI code assistants and modular AI systems are transforming developer workflows.

📅 Date: Tuesday, January 21

🕘 Time: 3:00 PM PT

📍 Where: YouTube (Register below)

We’ll discuss:
🧩 Modularity for Developers: Building AI tools like LEGO blocks, enabling custom workflows and LLM selection.

💻 Seamless Workflow Integration: Why AI tools must fit into environments like VS Code without disruptive changes.

⚙️ Balancing Automation and Control: The importance of human-in-the-loop systems for reliable, customizable AI assistants.

🚀 Real-World Developer Benefits: How modular systems make coding faster, more flexible, and more productive.

About Ty Dunn

Ty is the co-founder of Continue, an open-source AI code assistant empowering developers to build smarter workflows. With a focus on flexibility and developer-first design, Ty ensures AI systems act as reliable partners, not black-box tools.

👉 Register for free here
Don't miss this chance to learn how modular AI systems are shaping the future of developer tools!

Look at Your Data: Debugging, Evaluating, and Iterating on Generative AI Systems 🔍

Everyone wants to build generative AI products that deliver real business value. But here’s the catch: most systems fall short because teams don’t know where to start when things go wrong.

Join me and Hamel Husain for a live-streamed fireside chat where we’ll tackle the critical steps to debugging, evaluating, and improving generative AI systems.

📅 Date: Tuesday, January 28

🕦 Time: 4:30 PM PT

📍 Where: YouTube (Register below)

In this session, we’ll cover:
🛠️ Error Analysis: Pinpointing the biggest pain points in LLM workflows.

📊 Evaluation Frameworks: Building evaluations that align with product goals.

🔍 Curiosity-Driven Data Exploration: How to use data and traces to iterate faster.

🧱 The Foundation of Reliability: Why debugging is the cornerstone of scalable AI systems.

About Hamel Husain

Hamel Husain (Parlance Labs, ex-GitHub, Airbnb, DataRobot) brings years of experience helping teams scale LLM-powered systems. His focus on robust error analysis and pragmatic debugging has delivered meaningful improvements for AI engineering teams.

💡 Why Attend?

This session is for anyone who has felt stuck trying to improve an LLM application. Whether you’re debugging multi-turn conversations, agentic systems, or designing evaluation frameworks, you’ll leave with actionable strategies to iterate and build systems that deliver.

📺 Bonus: This conversation was originally part of my Building LLM Applications course but is now free for everyone by popular demand!

👉 Register for free here
Don’t miss this chance to sharpen your debugging and evaluation skills!

From Infrastructure to Application: Lessons in Building Scalable ML Systems 🎙️

On January 16, I hosted a fireside chat with Ferras Hamad, Machine Learning Leader at DoorDash, whose career spans Netflix, Meta, Uber, and Yahoo. We explored the challenges and opportunities in building and scaling machine learning systems that bridge infrastructure and application layers.

Key topics we discussed included:

🛠️ From Infrastructure to Business Value: How companies like Netflix and Uber approach the ML lifecycle, from foundational infrastructure to delivering outcomes that drive impact.

🔄 The Convergence of ML Tools: How platforms are evolving to support both advanced users and non-specialists.

🤖 LLMs and In-Context Learning: Where large language models fit into traditional ML systems and how they’re reshaping production environments.

🤝 Team Collaboration: Why cross-functional relationships between data scientists, engineers, and platform teams are essential for success.

🚀 Skill Sets for the Future: The rise of “full-stack” ML professionals who can bridge roles like data scientist, ML engineer, and software engineer.

🎥 Clip Highlight: Ferras shares how LLMs integrate into traditional ML systems, including:
🌟 Where LLMs excel, like tagging long-tail items

🛡️ Using traditional models as guardrails for LLM outputs

🔍 Why understanding LLM limitations is critical to integration

Watch the clip here:

For the full conversation, where we go beyond LLMs to cover lessons from Ferras’ incredible career, check out the Outerbounds Fireside Chat.

Diving into MLOps with Databricks! 🚀

I’m excited to be taking Maria Vechtomova and Başak Eskili’s course, “End-to-end MLOps with Databricks,” starting at the end of the month!

After spending so much time focused on Generative AI, I’m ready to revisit the foundational yet supremely valuable operations side of machine learning. This is an opportunity to learn from some of the best about how to build robust, reliable systems—and to explore Databricks more deeply, a platform I know is powerful but haven’t invested enough time in yet.

Maria and Başak are leaders I’ve admired for their expertise and community building, and I’m thrilled to join this cohort.

If you’re interested in joining me, here’s a link for 10% off 💫

Vanishing Gradients