Evaluations Guide

Public

Evaluation is a product discipline. Use small, curated datasets and automate regression checks to catch drifts early.

Offline Evals

Golden datasets with unambiguous scoring criteria.
Heuristic or model judges are acceptable if calibrated.
Run on every change to prompts, tools, or model versions.

Online Metrics

Task success rate and deflection rate where applicable.
Latency and cost budgets with alerts.
User feedback signals (thumbs, edits, escalations).

Regression Testing

Keep a handful of critical examples that must not regress. Require green checks before promotion to Pilot or Production.

Note

Avoid overfitting to the eval set. Periodically refresh datasets to reflect real user behavior.

Related docs