Building Reliable and Robust ML/AI Pipelines
From ML to AI Eng, Navigating the Shift to Foundation Models
Hi all! I’ve recently started a newsletter around all things data science, ML, and AI, primarily to keep track of interesting things in the space and what I’ve been up to. This is an experiment so please do let me know what you’d like to see here. There’s a lot to share this week so let’s jump right in.
Building Reliable and Robust ML/AI Pipelines with Shreya Shankar
I recently did a podcast with Shreya Shankar, a researcher at UC Berkeley focusing on data management systems with a human-centered approach. Shreya's work is at the cutting edge of human-computer interaction (HCI) and AI, particularly in the realm of large language models (LLMs). Her background also includes being the first ML engineer at Viaduct, doing research engineering at Google Brain, and software engineering at Facebook.
In this episode, we dive deep into the world of LLMs and the critical challenges of building reliable AI pipelines. You can listen to the episode here or on your app of choice. You can also watch the livestream here:
This is the 5th time I’ve spoken with Shreya publicly and it has always been a joy and I’ve learned so much. If you’re interested in checking out some of our previous chats, I’ve linked to them all in this twitter thread.
From ML to AI Eng, Navigating the Shift to Foundation Models with Chip Huyen
Last week, I did a fireside chat for Outerbounds with Chip Huyen, a writer and computer scientist currently at Voltron Data, working on GPU-native data processing and open data standards (Ibis, Apache Arrow, Substrait). Previously, she built machine learning tools at NVIDIA, Snorkel AI, and Netflix.
These are the topics we covered, broadly speaking (with the YT timestamps):
Funnily, the morning of the livestream, Chip had published this post on patterns she has observed in Building Generative AI Systems and we were also able to dive deep into that. For a sneak peek, check out this clip, in which we explore
🏗️ Common GenAI platform components
❌ AI failure types: information vs. behavior-based
🔍 RAG and context enhancement strategies
🛠️ Output improvement techniques
🔄 Fine-tuning complexities
📊 Output format considerations
The NLP and AI Revolution with spaCy Creators Ines Montani and Matthew Honnibal
I’ll be recording a Vanishing Gradients livestream with Ines and Matt from spaCy and Explosion. I’m really pretty excited about this for many reasons. Here are a few:
their work on NLP over the years is fertile ground for thinking through how to incorporate GenAI, ML, classic NLP, and software to build robust AI systems;
their work in OSS has been inspirational for me in several regards, including how important UX and good abstraction layers are for developer tooling;
we all have a lot to learn from their journey with respect to how OSS companies can be built and maintained sustainable.
Oh, and they’re also total legends! You can register for free here.
Also, if you’re interested in some of their more recent work, check out the following:
Human-in-the-loop distillation: LLMs challenge industry workflows that need modularity, transparency and data privacy. But models don't have to be black boxes – you can distill them into better, smaller and faster components you can control and run in-house.
How S&P Global is making markets more transparent with NLP, spaCy and Prodigy: Case study on real-time commodities trading insights using human-in-the-loop distillation.
The AI Revolution Will Not Be Monopolized: Open source and interoperability means there's no monopoly to be gained in AI and economies of scale only matter if you buy into the "one model to rule them all" approach.
Back to our roots: Ines and Matt are back to running Explosion as a smaller, independent-minded and self-sufficient company and focusing on our core stack, spaCy and Prodigy. I honestly think it’s wonderful that they’re sharing their learnings about building companies around OSS technologies. We’re all still learning so much.
What else is up?
I’ll also be doing a livestream this week with Dan Becker and Hamel Husain, the instructors of the immensely successful "Mastering LLMs: A Conference For Developers & Data Scientists," as they share their experiences and insights from teaching over 2,000 students. The course, which initially anticipated a few hundred participants, quickly expanded to include 10-20 guest speakers and received support from industry leaders like OpenAI, Hugging Face, Modal, and Replicate. You can find more details and register here.
You can find a lot of educational resources from their course here. I’m excited to dive deeper into all of this stuff. If you check them out, let me know any that you like!
Here’s a wonderful talk by Johno Whitaker (Answer AI) on Napkin Math(s) for Fine-Tuning:
I’ll be announcing more livestreams, events, and podcasts soon, so subscribe to the Vanishing Gradients lu.ma calendar to stay up to date. Also subscribe to our YouTube channel, where we livestream, if that’s your thing!
That’s it for now. Please let me know what you’d like to hear more of, what you’d like to hear less of, and any other ways I can make this newsletter more relevant for you,
Hugo