Tag: evals
All the articles with the tag "evals".
-
How do you know a correction held? Instrumenting an agent in production
Functional scars make a correction persist. They don't tell you whether the system as a whole is getting better. So I instrumented every session — deterministic, zero-token, never blocking — and let a monthly pass turn the evidence into mechanism changes. The first thing the data caught was me.