Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The eval-driven approach is something I learned the hard way. Built an agent system that looked great in demos. Ran it overnight, and it accumulated subtle errors that compounded. Added structured self-evaluation after each task and the reliability changed immediately.

The part teams skip: evaluating the evaluation. Your eval criteria drift over time as the system changes. Someone has to maintain the evals, not just write them once. That maintenance cost is why teams eventually drop formal evaluation and go back to vibes.

No posts

Ready for more?