A/B Test Setup

Plan, design, or implement an A/B test or experiment.

A disciplined framework for planning, designing, and analyzing A/B tests that produce statistically valid, actionable results. It enforces hypothesis-first design, pre-committed sample sizes, and the validity guardrails most teams skip, so your 'winning' variant is real, not noise.

$15 one-time
Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

  • Type Skill
  • Category Growth & CRO
  • Delivery Email · instant
  • License One-time
Run preview
forgehouse, ab-test-setup

Inside the run · no black box

See the actual work before you buy it.

Most experiments die from peeking and wishful math. This one starts with a written hypothesis and a locked sample size, then earns its verdict step by step:

  1. Writes the hypothesis in a fixed frame before anything else: because of [observation], we believe [change] will cause [outcome] for [audience], measured by [metric]. No written hypothesis, no test.
  2. Calculates the required sample size up front from baseline conversion rate, minimum detectable effect, 95% significance and 80% power, then derives the test duration from daily traffic. The test does not stop before that number is reached.
  3. Defines three metric layers: one primary metric that calls the test, secondary metrics that explain why it moved, and guardrail metrics (revenue, bounce, downstream conversion) that kill the test if they degrade.
  4. Picks the implementation path (client-side, server-side or feature flag) and walks the pre-launch checklist: variants QA'd, tracking verified, users see the same variant on return visits.
  5. Runs an SRM check before reading any result: pulls the actual assignment counts per variant and chi-squares them against the designed split. p below 0.01 means randomization or tracking is broken and the result is not interpreted at all.
  6. Reads the outcome on three axes (sample size reached, statistically significant, practically meaningful), then segments by device, traffic source and new/returning to catch Simpson's paradox before any winner is declared, and logs the test in the learning repository so failed ideas are never re-run.
Use cases · what happens when you plug it in

One power source. 6 lines out.

ab-test-setup · core

core active · 6 lines

  1. Writing a strong, falsifiable hypothesis before touching the page

    ✓ writing a strong, falsif…
  2. Calculating required sample size and test duration up front

    ✓ calculating required sam…
  3. Choosing between client-side, server-side, and feature-flag implementation

    ✓ choosing between client-…
  4. Catching a broken experiment with a sample ratio mismatch check

    ✓ catching a broken experi…
  5. Segmenting results by device and source to avoid Simpson's Paradox

    ✓ segmenting results by de…
  6. Documenting every test into a searchable learning repository

    ✓ documenting every test i…
Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

  1. Stop shipping false positives from peeking and early stopping

    license: perpetual
  2. Spend limited test slots on the highest-impact changes via ICE scoring

    license: perpetual
  3. Trust your results because validity is verified before any winner is declared

    license: perpetual
  4. Compound learnings by spreading winning patterns across the whole site

    license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Hypothesis structure template (observation, change, effect, audience, metric)

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

  • Not for you if you'd rather rent a tool than own one.
  • Not for you if you want someone else to run your stack.
  • Not for you if you're happy guessing.
Still here? Good.

Growth, CRO, and product teams that want experiments grounded in statistical rigor instead of gut feeling and premature wins.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

  • Claude Native format
  • ChatGPT Adapts via open standards
  • Gemini Adapts via open standards
  • Cursor Adapts via open standards
  • Copilot Adapts via open standards
Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.
catch a spark: the forge will answer

  1. Does this lock me into a specific testing platform like Optimizely or VWO?

    No, it covers client-side, server-side and feature-flag implementations, so you pick the method that fits your stack. The framework sits above the tool, not inside one.

  2. I can already tell when a variant is winning, so why pre-commit a sample size?

    Calling a winner early is the easiest way to mistake noise for a result, which is what the pre-committed sample size prevents. Without it, the same data can look like a win one day and a loss the next.

  3. Will it build the variant and run the experiment for me?

    No, it plans the hypothesis, sizes the test and analyzes the outcome. Building the variant and serving traffic stay on your side.

  4. How is it delivered?

    By email right after purchase: ready to run, downloaded instantly, no setup wait.

  5. One-time or subscription?

    A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

  6. Can I get a refund?

    As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.