Skill AI & LLM →

LLM Evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics…

A comprehensive evaluation toolkit for LLM applications, combining automated metrics, human review, LLM-as-Judge scoring, and statistical A/B testing. It lets you prove that a prompt or model change is actually better, not just feels better, and catch regressions before they reach production.

$15 one-time
Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

  • Type Skill
  • Category AI & LLM
  • Delivery Email · instant
  • License One-time
Run preview
forgehouse, llm-evaluation

Inside the run · no black box

See the actual work before you buy it.

When the score rises but users get angrier, your eval set stopped representing production. Metrics are assembled per task, judges never grade their own model, and any 5 percent regression blocks the deploy.

  1. Assembles an EvaluationSuite from the metrics that match the task: BLEU and ROUGE for generation overlap, BERTScore for semantic similarity, plus custom metrics like groundedness (NLI entailment against context), toxicity and factuality, because a single metric always has a blind spot.
  2. Runs the model over a versioned test dataset and aggregates per-metric mean, std, min and max. The test set stays held out: it is never used to tune prompts, and 20 percent of examples rotate every quarter to prevent contamination and eval hacking.
  3. Adds an LLM-as-judge pass with a stronger model as referee, never the model judging itself. Pairwise comparison is preferred over pointwise, the A/B order is randomized and each pair is also scored in reverse to cancel position bias, and the judge prompt carries an explicit rubric with 1 to 10 scales and required reasoning.
  4. Samples 10 to 20 percent of outputs for human review with concrete annotation guidelines, then checks inter-rater agreement: Cohen's Kappa above 0.6 before human labels count, and correlation above 0.85 between human and judge scores before the judge is trusted at scale.
  5. Compares new results against the versioned baseline with the RegressionDetector: any metric dropping more than 5 percent blocks the deployment, and A/B claims only stand when p is under 0.05 and Cohen's d is at least 0.2.
  6. Feeds production failures back into the eval dataset so the suite keeps representing real traffic, and watches for the eval-hacking signal: score going up while user satisfaction goes down means the dataset no longer represents production.
Use cases · what happens when you plug it in

One power source. 6 lines out.

llm-evaluation · core

core active · 6 lines

  1. Comparing different models or prompt variants objectively

    ✓ comparing different models
  2. Detecting performance regressions before deployment

    ✓ detecting performance re…
  3. Validating that a prompt change is a real, measurable improvement

    ✓ validating that a prompt
  4. Building evaluation baselines and tracking quality over time

    ✓ building evaluation base…
  5. Measuring RAG retrieval quality with MRR, NDCG, and precision@K

    ✓ measuring rag retrieval
  6. Integrating eval suites into a CI/CD pipeline

    ✓ integrating eval suites
Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

  1. Replace 'it feels better' with statistically significant proof

    license: perpetual
  2. Block regressions automatically before they ship to users

    license: perpetual
  3. Triangulate quality across multiple metrics to avoid blind spots

    license: perpetual
  4. Scale human-quality judgment affordably with LLM-as-Judge

    license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Automated metric implementations: BLEU, ROUGE, BERTScore, groundedness

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

  • Not for you if you'd rather rent a tool than own one.
  • Not for you if you want someone else to run your stack.
  • Not for you if you're happy guessing.
Still here? Good.

ML engineers and AI teams who need rigorous, repeatable evaluation of LLM application quality.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

  • Claude Native format
  • ChatGPT Adapts via open standards
  • Gemini Adapts via open standards
  • Cursor Adapts via open standards
  • Copilot Adapts via open standards
Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.
catch a spark: the forge will answer

  1. Does this lock me into LangSmith or a specific model provider?

    No. The core is provider-agnostic: automated metrics like BLEU, ROUGE, BERTScore and groundedness, LLM-as-Judge patterns, and the A/B statistics. LangSmith appears as one integration for tracing and benchmark runs, not a requirement.

  2. We already eyeball outputs before shipping. What does this add?

    Statistical footing. Instead of 'it feels better,' you compare variants with t-tests, p-values, and Cohen's d effect size, and the regression detector checks new runs against versioned baselines. A quiet quality drop gets caught before deploy, not after user complaints.

  3. Will it tell me which prompt to write?

    No. It measures whether a prompt or model change is actually better; it does not generate the change. You bring the variants, it brings the evidence, including a human annotation framework with inter-rater agreement when automated metrics are not enough.

  4. How is it delivered?

    By email right after purchase: ready to run, downloaded instantly, no setup wait.

  5. One-time or subscription?

    A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

  6. Can I get a refund?

    As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.