---
title: Agent Eval Suite Langsmith
category: product
entity_type: skill
price: $15
canonical: https://forgehouse.ai/skills/agent-eval-suite-langsmith/
lang: en
hreflang_alt: https://forgehouse.ai/tr/skiller/agent-eval-suite-langsmith/
last_updated: 2026-06-20
---

# Agent Eval Suite Langsmith

> Production agent eval suite LangSmith dataset curation + Promptfoo assertion framework +…

A production-grade evaluation suite for AI agents that combines curated golden datasets, an assertion framework, and a CI gate. It replaces subjective manual spot-checks with an automated quality gate on every change: regression, adversarial, and calibration tests must pass before an agent ships.

## Use cases
- Adding regression tests for agents before merging changes
- Building a curated golden dataset of 50+ examples per agent
- Scoring agent output with an independent LLM-as-judge rubric
- Blocking merges automatically when pass rate or calibration drops
- Red-teaming agents with prompt injection and jailbreak cases
- Measuring overconfidence with Brier score calibration

## Benefits
- Catch quality regressions before they reach users, not three weeks after
- Move from roughly 20% spot-check coverage to 100% on every change
- Replace 'looks good to me' with objective pass-rate and calibration gates
- Surface exactly which test failed and why, with full reasoning traces

## What’s included
- Golden dataset curation script with built-in PII stripping
- Assertion config (regex, JSON schema, LLM-as-judge, embedding similarity)
- CI workflow that gates merges on pass rate and calibration thresholds
- Brier score and expected calibration error computation
- Adversarial test categories (injection, jailbreak, exfiltration, hijack, overflow)
- Anti-pattern catalog covering data leakage, judge bias, and snapshot gaps

## Who it’s for
AI engineering teams running multiple production agents who need automated, objective quality gates instead of manual review.

## How it runs
Every pull request that touches the agent has to clear this gate: a checksummed golden dataset, four stacked assertion types and a 95% pass bar. Here is how it runs:
1. Harvests golden examples from production traces: pulls high-scored runs, strips personal data (emails, phones, ID numbers, API keys) with regex, and commits the dataset to Git with a SHA256 checksum
2. On every pull request, verifies the dataset hash against the committed checksum first, so nobody can quietly tamper with the test set
3. Runs the eval suite in parallel (50+ examples per agent, concurrency 10) against the live production system prompt referenced by file path, never a stale copy
4. Chains four assertion types per output: regex bans on forbidden words, JSON schema validation, an independent judge model scoring a 5-point rubric, and embedding similarity against the expected answer
5. Computes pass rate, Brier score and calibration error from the results; the merge gate blocks below 95% pass or above 0.15 Brier, because an overconfident agent is a liability
6. Posts the report as a PR comment, uploads findings to the repo security tab, and fires an instant alert when the nightly full run catches a regression

## FAQ
### Does it slot into my existing CI, or do I need a separate pipeline?
It is built as a CI gate, so regression and adversarial tests run on every change inside the pipeline you already merge through. A change that fails the gate does not ship.

### Can I trust an LLM-as-judge to score another model's output?
That is exactly why calibration tests sit alongside the regression and adversarial ones, checking the judge against known-correct examples. The rubric is also independent of the agent being graded, so it is not marking its own work.

### Does it write the golden dataset and fix the failures for me?
No, you curate the 50+ examples per agent that define correct behavior, and the suite enforces them. It flags regressions; fixing the agent is your engineering work.

## Price
$15, one-time, no subscription. VAT included.

Related guide: [AI and LLM engineering](https://forgehouse.ai/guides/ai-llm-engineering/)
