Does this lock me into LangSmith or a specific model provider?

No. The core is provider-agnostic: automated metrics like BLEU, ROUGE, BERTScore and groundedness, LLM-as-Judge patterns, and the A/B statistics. LangSmith appears as one integration for tracing and benchmark runs, not a requirement.

We already eyeball outputs before shipping. What does this add?

Statistical footing. Instead of 'it feels better,' you compare variants with t-tests, p-values, and Cohen's d effect size, and the regression detector checks new runs against versioned baselines. A quiet quality drop gets caught before deploy, not after user complaints.

Will it tell me which prompt to write?

No. It measures whether a prompt or model change is actually better; it does not generate the change. You bring the variants, it brings the evidence, including a human annotation framework with inter-rater agreement when automated metrics are not enough.

By email right after purchase: ready to run, downloaded instantly, no setup wait.

One-time or subscription?

A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.

Skill AI & LLM →

LLM Evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics…

A comprehensive evaluation toolkit for LLM applications, combining automated metrics, human review, LLM-as-Judge scoring, and statistical A/B testing. It lets you prove that a prompt or model change is actually better, not just feels better, and catch regressions before they reach production.

$15 one-time

Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

Type Skill
Category AI & LLM
Delivery Email · instant
License One-time

Run preview

forgehouse, llm-evaluation

Inside the run · no black box

See the actual work before you buy it.

When the score rises but users get angrier, your eval set stopped representing production. Metrics are assembled per task, judges never grade their own model, and any 5 percent regression blocks the deploy.

Assembles an EvaluationSuite from the metrics that match the task: BLEU and ROUGE for generation overlap, BERTScore for semantic similarity, plus custom metrics like groundedness (NLI entailment against context), toxicity and factuality, because a single metric always has a blind spot.
Runs the model over a versioned test dataset and aggregates per-metric mean, std, min and max. The test set stays held out: it is never used to tune prompts, and 20 percent of examples rotate every quarter to prevent contamination and eval hacking.
Adds an LLM-as-judge pass with a stronger model as referee, never the model judging itself. Pairwise comparison is preferred over pointwise, the A/B order is randomized and each pair is also scored in reverse to cancel position bias, and the judge prompt carries an explicit rubric with 1 to 10 scales and required reasoning.
Samples 10 to 20 percent of outputs for human review with concrete annotation guidelines, then checks inter-rater agreement: Cohen's Kappa above 0.6 before human labels count, and correlation above 0.85 between human and judge scores before the judge is trusted at scale.
Compares new results against the versioned baseline with the RegressionDetector: any metric dropping more than 5 percent blocks the deployment, and A/B claims only stand when p is under 0.05 and Cohen's d is at least 0.2.
Feeds production failures back into the eval dataset so the suite keeps representing real traffic, and watches for the eval-hacking signal: score going up while user satisfaction goes down means the dataset no longer represents production.

Use cases · what happens when you plug it in

One power source. 6 lines out.

llm-evaluation · core

core active · 6 lines

Comparing different models or prompt variants objectively

✓ comparing different models
Detecting performance regressions before deployment

✓ detecting performance re…
Validating that a prompt change is a real, measurable improvement

✓ validating that a prompt
Building evaluation baselines and tracking quality over time

✓ building evaluation base…
Measuring RAG retrieval quality with MRR, NDCG, and precision@K

✓ measuring rag retrieval
Integrating eval suites into a CI/CD pipeline

✓ integrating eval suites

Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

Replace 'it feels better' with statistically significant proof
license: perpetual
Block regressions automatically before they ship to users
license: perpetual
Triangulate quality across multiple metrics to avoid blind spots
license: perpetual
Scale human-quality judgment affordably with LLM-as-Judge
license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Automated metric implementations: BLEU, ROUGE, BERTScore, groundedness

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

Not for you if you'd rather rent a tool than own one.
Not for you if you want someone else to run your stack.
Not for you if you're happy guessing.

Still here? Good.

ML engineers and AI teams who need rigorous, repeatable evaluation of LLM application quality.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

Claude Native format
ChatGPT Adapts via open standards
Gemini Adapts via open standards
Cursor Adapts via open standards
Copilot Adapts via open standards

Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.

catch a spark: the forge will answer

Does this lock me into LangSmith or a specific model provider?

No. The core is provider-agnostic: automated metrics like BLEU, ROUGE, BERTScore and groundedness, LLM-as-Judge patterns, and the A/B statistics. LangSmith appears as one integration for tracing and benchmark runs, not a requirement.
We already eyeball outputs before shipping. What does this add?

Statistical footing. Instead of 'it feels better,' you compare variants with t-tests, p-values, and Cohen's d effect size, and the regression detector checks new runs against versioned baselines. A quiet quality drop gets caught before deploy, not after user complaints.
Will it tell me which prompt to write?

No. It measures whether a prompt or model change is actually better; it does not generate the change. You bring the variants, it brings the evidence, including a human annotation framework with inter-rater agreement when automated metrics are not enough.
How is it delivered?

By email right after purchase: ready to run, downloaded instantly, no setup wait.
One-time or subscription?

A one-time purchase; no subscription or hidden fees. VAT (20%) is included.
Can I get a refund?

As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.

LLM Evaluation

See the actual work before you buy it.

One power source. 6 lines out.

Yours to keep.

The rented stack

Your forge

Everything in the box.

This wasn't forged for everyone.

Works with

Catch what's on your mind.

Does this lock me into LangSmith or a specific model provider?

We already eyeball outputs before shipping. What does this add?

Will it tell me which prompt to write?

How is it delivered?

One-time or subscription?

Can I get a refund?

Related products

Agent Eval Suite Langsmith

Brain Context Engineering

Brain Memory Hybrid Search

Claude Agent Template Library