Running AI in production: drift, evals, and the boring stuff that matters
The interesting work in AI is mostly after deployment, not before. Here's what 'mostly' means.
There's a tendency among AI consultants — including, briefly, us, when we were getting our practice off the ground — to focus marketing on the build phase. The kickoff, the integration, the launch. The truth is that the build phase is the easy part. The interesting work is in operations, and most engagements that go badly go badly there.
Three operational practices separate AI deployments that age well from the ones that drift toward irrelevance. We run all three on every engagement past Tier I.
The first is evals. An eval is a test suite — but for AI behaviors. We define a set of representative inputs (real or synthesized from real conversations) and a set of acceptance criteria, and we run them on a cadence. When the model provider releases a new version. When we change a prompt. When the business changes a policy. The eval suite is the early-warning system for behavior drift, and a deployment without one is operating blind.
The second is drift monitoring. Even without model changes, AI behavior shifts subtly over time as input distributions change. The kind of customer questions you got in October look different from the ones in March. We monitor metrics that are sensitive to this — confidence distributions, handoff rates, customer-satisfaction signals — and we tune when they shift.
The third is the on-call rotation. When the AI behaves unexpectedly in production, someone needs to respond. For most of our clients, that's us — within an agreed SLA — rather than a panicked email to whoever set the system up months ago. The fact that we operate AI for many clients gives us pattern recognition that a single-engagement consultant doesn't have.
None of this is glamorous. Owners want to talk about what AI can do, not about how we keep it doing it well. But the difference between a deployment that's still earning its keep two years in and one that's quietly degrading is mostly here, in the unglamorous part.
