A company used Promptfoo as a CI gate to catch prompt regressions but found it insufficient as an evaluation framework because it relied on an unvalidated GPT-4 judge that missed critical failures. They implemented a separate weekly judge-validation pipeline that samples production traces, collects human labels, and compares them against the judge's scores using Cohen's kappa to detect judge drift. Structural improvements to the judge scoring (separate criteria, citation requirement, rubric-based scoring) improved agreement with humans from 0.47 to 0.68 kappa over 6 weeks, reducing costly production errors and customer escalations.
Use Case
Opening the operator briefing
Pulling the full operator breakdown, tooling context, and verification notes.
