Stress-Testing LLMs: Evaluation Frameworks and Real-World Agents

Practical AI Playbook: Scaling, Evaluating, Influencing Users

Jul 22, 2025

Welcome to Vanishing Gradients!

From scaling GPUs through evaluation and agents to nudging human behaviour, here’s a single scroll packed with the people doing the work:

• A new podcast with Samuel Colvin (Pydantic, Logfire) on the engineering grind from 85 % demo to 100 % production
• A conversation with Zachary Mueller (Hugging Face) on scaling training beyond Colab to clusters
• A High Signal episode with Elisabeth Costa (Behavioural Insights Team) on pairing ML with behavioural science to shift outcomes
• Upcoming livestream podcasts: decentralised-AI architectures (Jul 28) and GenAI systems that make business decisions (Aug 15)
• Free workshops on AI Decision Making Under Uncertainty (Jul 30) and evaluating AI agents (Sep 30)
• A SciPy 2025 rewind: “very metal” tutorials, a talk on escaping proof-of-concept purgatory, plus repos & slides
• A last-chance cohort for Hamel Husain & Shreya Shankar’s evaluation course (starts tomorrow)

Quick links below to what’s coming up, what just dropped, and how to plug in:

📺 Live Online Events

📩 Can’t make it? Register anyway and we’ll send the recordings.

→ Jul 28 — Decentralized AI: Data, Governance, and Personalization at the Edge (Live podcast with Katharine Jarmul & Joe Reis)
→ Aug 15 — Building GenAI Systems That Make Business Decisions (Live podcast with Thomas Wiecki)
→ Jul 30 — Workshop: AI-Powered Decision Making Under Uncertainty (Allen Downey & Chris Fonnesbeck)
→ Sep 30 — Workshop: Evaluating AI Agents—From Demos to Dependability (Ravin Kumar)

🎙 Podcasts & Recordings

→ Making AI Actually Production-Ready — Samuel Colvin (Pydantic, Logfire)
→ How Everyone Can Now Scale LLM Training — Zachary Mueller (Hugging Face)
→ Using ML + Behavioral Science to Drive Real Change — Elisabeth Costa (Behavioural Insights Team)

🎓 Courses

→ Jul 22–Aug 15 — AI Evals for Engineers & Technical PMs with Hamel Husain & Shreya Shankar (use this link for $1,000 off)

→ Nov 3–28 — Our Building LLM Applications course (Cohort 4 now open—opened early because Cohort 3 is already halfway through and demand is high) — enroll by Aug 16 for $600 off 🤖

Thanks for reading Vanishing Gradients! This post is public so feel free to share it.

🎧 Making AI Actually Production-Ready (the Agentic Way)

“The amount of engineering it takes to move an AI demo from 85% to 100% production-ready is many multiples of what it took to build the demo.” — Samuel Colvin, Creator of Pydantic & Founder of Logfire

Samuel lays out exactly how to bridge that last-mile gap between a flashy LLM proof-of-concept and software you can trust in production:

→ Turn five quick 👍/👎 labels into a rubric that scores every live model call
→ Catch drift the moment your prompts or retrieval context change
→ Let a self-improving agent rewrite its own system prompt—and admit when it still lacks data
→ Measure success with business-level metrics, not vanity accuracy scores

🔗 Listen on your favourite platform:
→Spotify
→Apple Podcasts
→YouTube
→Show notes & other episodes

🎧 Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

Training massive models used to demand OpenAI- or DeepMind-level resources.

In 2025, that gate is gone: clusters of 4090s, platforms such as Modal, and open-weight models like LLaMA 3 and Qwen put serious scale within reach of any motivated builder. 🛠️

Zachary Mueller (Hugging Face) walks through what it really takes to capitalise on this moment:

→ When to leave Colab—and how to avoid drowning in infrastructure costs
→ How Accelerate unifies training and inference across multiple GPUs
→ Why “data parallelism” is only the starting line, and where things break next
→ Field lessons from coaching solo devs up to research labs
→ The myths people still believe about distributed training

🔗 Listen on your favourite platform:
→Spotify
→Apple Podcasts
→YouTube
→Show notes & more episodes

🎧 Using ML + Behavioral Science to Drive Real Change

Most ML teams build models. Far fewer design systems that actually change outcomes.

Use ML to understand the problem, behavioral science to intervene, and ML again to test what works.

In this High Signal episode, Elisabeth Costa—Chief of Innovation & Partnerships at the Behavioural Insights Team—explains how ML and behavioral science can work together to shift behavior, not just predict it:

→ Combining ML + behavioral science to solve end-to-end problems
→ Why testing matters after the intervention, not just before it
→ RCTs, feedback loops, and real-world experimentation
→ Designing interventions, not just optimising predictions

Watch, listen, or read:
→ Apple
→ Spotify
→ YouTube
→ More episodes & show notes

🎙️ Upcoming Livestream Podcast Recordings

→ Mon, Jul 28 @ 11 AM ET — Decentralized AI: Data, Governance, and Personalization at the Edge
Guests: Katharine Jarmul (privacy engineer & author) and Joe Reis (data-engineering educator; co-author Fundamentals of Data Engineering)

→ Practical patterns for federated learning, on-device inference, and distributed governance
→ Why some teams intentionally choose tougher architectures for privacy and compliance
→ How user-controlled models unlock new personalization strategies
→ Managing divergence, coordination, and infra overhead in the real world

🔗 RSVP here (if you can’t make it, register and we’ll share the recording)

→ Fri, Aug 15 @ 11 AM ET — Building GenAI Systems That Make Business Decisions
Guest: Thomas Wiecki (Founder, PyMC Labs; co-author of PyMC)

→ How LLM-generated survey responses hit ~90 % accuracy—even across demographics
→ Building a closed-loop system where GenAI both generates and critiques product & ad ideas
→ Automating media-spend analytics with a structured agent at PyMC Labs
→ Where GenAI excels—and where statistical structure & validation still matter
→ Combining GenAI with probabilistic modelling in production

🔗 RSVP here (if you can’t make it, register and we’ll share the recording)

🎓 SciPy 2025 – Packed Rooms, Old Friends & a “Very Metal” Tutorial

Nothing beats catching up with the SciPy community in person—familiar faces, hallway chats, and two overflow rooms full of people hacking on LLMs together. The vibe was so good one participant called my workshop “very metal” 🤘 … then upgraded me to “the Kardashian of LLMs.” I’m taking both as wins.

Tutorial swap
→ Building with LLMs Made Simple — Eric Ma led, I TA’d (LlamaBot + Ollama in real Python workflows)
→ Building LLM-Powered Apps for DS & SWEs — I led, Eric TA’d (prompting → monitoring → full SDLC)

Conference talk
→ Escaping Proof-of-Concept Purgatory — designing reliable GenAI systems that hold up in production

Repos & slides
• Eric’s tutorial repo
• My tutorial repo + slides
• My talk slides

🎥 Videos are in post-production—I’ll share them once they drop. Until then, clone the repos, riff on the notebooks, and let me know if you make them even more metal.

📏 Last-Chance Cohort: “AI Evals for Engineers & Technical PMs” with Hamel Husain & Shreya Shankar (starts tomorrow)

2025 might be the year of agents, but it’s definitely becoming the year of evaluation. Hamel Husain and Shreya Shankar’s hands-on course completely reshaped how I consult, teach, and build AI systems that actually work—and this is the final time they’ll run it live.

What the course covers
→ Fundamentals of LLM evaluation and the pitfalls most teams miss
→ Reference-based, reference-free, and execution metrics—knowing when each matters
→ Systematic error analysis and failure-mode exploration
→ Implementing custom evaluators that align with business outcomes

Every student also gets early access to the draft book Shreya is writing. Hamel and Shreya have kindly given my network over $1,000 off the list price.

→ Enroll before the doors closes.

I’ll be in the cohort again! Hope to see you there :)

🛠️ Free, Hands-On Workshops (live & interactive)

→ Wed, Jul 30 @ 11 AM ET — AI-Powered Decision Making Under Uncertainty
With Allen Downey (Olin College / PyMC Labs) & Chris Fonnesbeck (Vanderbilt / PyMC Labs)

→ Build Bayesian intuition that survives messy, real-world data
→ Estimate probabilities with informative priors and hierarchical models
→ Stress-test a decision-support workflow: rare-event prediction, A/B comparisons, confidence communication
→ All in PyMC & Jupyter—run locally or on Colab

🔗 RSVP here (if you can’t make it, register and we’ll share the recording)

→ Tue, Sep 30 @ 7 PM ET — Evaluating AI Agents: From Demos to Dependability
With Ravin Kumar (DeepMind, Google, ex-Tesla) & me

→ Trace tool use and model reasoning end-to-end
→ Simulate edge cases, catch silent failures, iterate fast
→ Judge whether an agent chose the right tool, executed the right logic, explained the result clearly
→ Runs locally with Gemma 3 + Ollama—no cloud dependencies

🔗 RSVP here (if you can’t make it, register and we’ll share the recording)

Bring your laptop—both sessions are fully hands-on. If you’ve been meaning to sharpen your Bayesian chops or make your agents actually dependable, these two hours each will pay for themselves many times over.

Want to Support Vanishing Gradients?

If you’ve been enjoying Vanishing Gradients and want to support my work, here are a few ways to do so:
🧑‍🏫 Join (or share) my AI course – I’m excited to be teaching Building LLM Applications for Data Scientists and Software Engineers again in November. If you or your team are working with LLMs and want to get hands-on, I’d love to have you. And if you know someone who might benefit, sharing it really helps. Early bird discount of $600 off until August 16.
📣 Spread the word – If you find this newsletter valuable, share it with a friend, colleague, or your team. More thoughtful readers = better conversations.
📅 Stay in the loop – Subscribe to the Vanishing Gradients calendar on lu.ma to get notified about livestreams, workshops, and events.
▶️ Subscribe to the YouTube channel – Get full episodes, livestreams, and AI deep dives. Subscribe here.
💡 Work with me – I help teams navigate AI, data, and ML strategy. If your company needs guidance, feel free to reach out by hitting reply.

Thanks for reading Vanishing Gradients! Subscribe for free to receive new posts and support my work.

If you’re enjoying it, consider sharing it, dropping a comment, or giving it a like—it helps more people find it.

Until next time ✌️

Hugo

Vanishing Gradients

Stress-Testing LLMs: Evaluation Frameworks and Real-World Agents

Practical AI Playbook: Scaling, Evaluating, Influencing Users

🎧 Making AI Actually Production-Ready (the Agentic Way)

🎧 Scaling AI: From Colab to Clusters — A Practitioner’s Guide to Distributed Training and Inference

🎧 Using ML + Behavioral Science to Drive Real Change

🎙️ Upcoming Livestream Podcast Recordings

🎓 SciPy 2025 – Packed Rooms, Old Friends & a “Very Metal” Tutorial

📏 Last-Chance Cohort: “AI Evals for Engineers & Technical PMs” with Hamel Husain & Shreya Shankar (starts tomorrow)

🛠️ Free, Hands-On Workshops (live & interactive)

Want to Support Vanishing Gradients?

Discussion about this post