Changes to prompts, models, or tool wiring must flow through a simple but strict lifecycle. This ensures measurable quality and predictable behavior for end users.
Stages
- Experiment: freeform iteration; logs are sampled; offline evals encouraged but optional.
- Staging: frozen prompt version, dataset defined; offline eval required.
- Pilot: gated rollout to a small cohort; online metrics monitored; rollback plan ready.
- Production: broad rollout with SLOs; changes require review and re-evaluation.
Prompt Management
- Prompts are versioned in the Registry with semantic labels and changelogs.
- Structured templates support variables, safety inserts, and system instructions.
- Approval is required before promotion to Production.
Evaluation
Offline evals use curated datasets with golden answers or heuristics. Regression tests run on each change. Online metrics (task success rate, handoff rate, latency, cost) are tracked during Pilot and Production.
Tip
Keep datasets tiny but representative. Ten great examples beat one hundred noisy ones.