project · Experimentation Guardrail Agent stack · Python · scipy · scikit-learn · Anthropic Claude · pytest author · Maksim Silchenko

A case study in decision-aware experimentation

An A/B test can be broken
and still report a clean p-value.

Companies run thousands of A/B tests; a measurable fraction are quietly invalid before anyone reads the result - a leaky feature flag, a re-used randomisation seed, an unmeasured confounder. The metric test still runs. The p-value still prints. ab-guardrail is a command-line tool that reads the part the p-value can’t see: it checks the allocation for Sample Ratio Mismatch, runs the right metric test for the data’s shape, corrects for multiple testing, and propensity-matches away confounding - then returns a verdict, not a number. Routing and statistics are a deterministic pipeline; an LLM is invoked once, at the end, to write the summary.

10⁻¹³²

SRM chi-square p-value flagged on the compromised demo - a 61/39 split that no metric test would catch.

LLM call on the default path - the closing summary. Routing & statistics are deterministic Python.

Pytest cases - including messy-data and Criteo-adapter suites - green in CI on Python 3.10-3.12.

Peer-reviewed methods implemented - SRM, Newcombe, Welch, BH-FDR, PSM, Rosenbaum bounds, CUPED.

01 · The question

What is the most under-checked failure mode in industry A/B testing - and why does the metric test never see it?

A standard A/B-test analysis answers one question: did the metric move? It runs a t-test, prints a p-value, and stops. But that pipeline is blind to a whole class of failures that happen before the metric is ever computed.

The headline case is Sample Ratio Mismatch (SRM). If a feature flag silently drops part of the control arm, the experiment ships at, say, 45/55 instead of 50/50. The t-test on the metric runs perfectly - and its answer is biased, because the two groups are no longer comparable. Microsoft’s experimentation team found roughly 6% of their online tests carried an SRM, with systematically distorted effect estimates.

So the project refuses the brief as stated. Instead of “is the metric significant?” it answers “is this experiment safe to interpret at all?” - and returns one of three verdicts: SAFE TO ROLL OUT, EXPERIMENT COMPROMISED, or NO SIGNIFICANT EFFECT.

Four guardrails · what each one catches

a naive t-test sees none of the rows below

02 · Architecture

Four stages. The LLM picks the plan; deterministic code runs the statistics.

The design rule is a hard line: the LLM never computes a number. It reads column names and a few sample rows, decides which column is the variant and which are covariates, and calls tools. Every p-value, CI and effect size is produced by scipy.stats and scikit-learn. Letting an LLM weigh evidence or pick a threshold is the fast track to hallucinated significance.

01load

Defensive ingestion & data-quality report

data_loader.py ingests the CSV the way a real e-commerce export demands: malformed rows skipped, double-logged duplicates and all-null columns dropped, dirty numeric tokens coerced. Every repair is counted into a DataQualityReport - nothing is fixed silently - and the cleaned frame is profiled into a typed DatasetProfile.

DataQualityReport · DatasetProfile · raises DataValidationError

02route

Deterministic routing - no LLM in the hot path

A hardcoded heuristic identifies the variant column, the metrics and the pre-treatment covariates. Paying LLM latency to dispatch data to three deterministic functions is poor production engineering, so the default path does not. An opt-in --mode agent uses a Claude tool-use loop instead, for unfamiliar schemas where the heuristic cannot infer roles.

agent.py · infer_schema_heuristic() · --mode agent for tool-use

03guardrails

Three statistical engines, fully deterministic

srm.py - chi-square goodness-of-fit. metric_tests.py - Welch + Mann-Whitney or chi-square + Newcombe CI, plus BH-FDR correction and CUPED. causal.py - PSM with common-support trimming, bootstrap SE and Rosenbaum bounds.

pure functions · typed dataclass results · 36 pytest cases

04report

Verdict logic & Markdown render

report.py applies a deterministic verdict rule (the LLM does not vote), renders a Markdown report with a love-plot of pre/post covariate balance, and Claude writes a plain-English summary constrained strictly to the computed numbers.

SAFE TO ROLL OUT · EXPERIMENT COMPROMISED · NO SIGNIFICANT EFFECT

03 · The twist

Two datasets, same schema. One ships. One is quietly broken.

The repo ships two synthetic 12,000-user experiments with an identical schema. clean_experiment.csv is a properly randomised test with a real lift. compromised_experiment.csv carries two simultaneous defects: a 61/39 allocation (SRM) and treatment that systematically captures higher-value users (confounding). A naive t-test would happily report a “significant” revenue lift on the broken one. The agent does not.

clean properly randomised test

SAFE

SRM p = 0.94 · conversion +3.35pp (p ≈ 7×10⁻⁸) · PSM ATT +2.25pp, same sign

Allocation is 50.0/50.0. The lift survives propensity matching and Rosenbaum sensitivity. The verdict logic clears it to roll out - no guardrail fired.

compromised SRM + confounding

COMPROMISED

SRM p ≈ 2×10⁻¹³² · 61/39 split · naive revenue +$0.57 collapses to +$0.42 ATT

The SRM check fires at a p-value with 132 leading zeros. Under the verdict rule, SRM detected ends the analysis: the experiment is uninterpretable until the randomiser is fixed, whatever the metric says.

The model was never the problem. The randomiser was - and a metric test, run on a 61/39 split, will still hand you a number that looks like a decision.

M. Silchenko · case-study verdict

04 · The three engines

Three statistical engines, each chosen to be defensible in a stats interview.

Every method here was picked because the obvious alternative has a known weakness. A Wald interval under-covers near 0 and 1; a paired-t SE is biased under matching-with-replacement. The repo uses the corrected form and says so, with the citation, in the docstring.

SRM check

srm.py · chi-square goodness-of-fit

step · ALLOCATION INTEGRITY

chi2 = Σ (Oᵢ − Eᵢ)² / Eᵢ
df   = 1
SRM  := p < 0.001

# stricter than 0.05 on purpose:
# an SRM investigation is
# expensive; don't chase noise

Catchesbroken randomiser

Small-cell guardEᵢ ≥ 5 enforced

Verdict weightdominant · ends analysis

ii.

Metric tests

metric_tests.py · shape-aware

step · EFFECT ESTIMATION

binary     -> chi-square 2×2
            + Newcombe CI (1998)
continuous -> Welch's t-test
            + Mann-Whitney U
family     -> BH-FDR correction
optional   -> CUPED variance cut

CI methodNewcombe hybrid-score

Non-parametric checkMann-Whitney U

MultiplicityBH-FDR · opt Bonferroni

iii.

Propensity matching

causal.py · 1-NN with caliper

step · CAUSAL ADJUSTMENT

p̂   = LogisticRegression(X)
match = 1-NN, caliper
        0.2·SD(logit p̂)
SE    = bootstrap (not paired-t)
Γ     = Rosenbaum bounds
trim  = common-support region

EstimandATT

SEbootstrap · 500 resamples

SensitivityRosenbaum Γ bounds

The naive paired-t standard error is biased under matching with replacement because re-used control units induce dependence (Abadie & Imbens, 2006). The repo bootstraps instead, and the report prints both so the bias is visible. The four interactive panels below let you drive each engine’s core mathematics directly.

★ interactive 1 · srm detector

A 1% imbalance is invisible. At scale, it is a chi-square catastrophe.

SRM is counter-intuitive because a tiny ratio error becomes overwhelming evidence once the sample is large. Move the two sliders - total users and the observed share landing in control - and watch the chi-square goodness-of-fit p-value. The threshold for declaring an SRM is p < 0.001.

Total users in the experiment

40,000

2k400k

Observed share in control arm

50.0%

44%50%56%

Expected (planned): 50 / 50
Observed counts: 20,000 / 20,000
chi-square, 1 degree of freedom

Chi-square statistic

0.00

Σ (O−E)²/E

p-value

1.00

consistent with 50/50

No SRM the experiment is safe to interpret

test chi-square goodness-of-fit vs. a planned 50/50 split, df = 1 · threshold SRM declared at p < 0.001 · try hold the split at 49/51 and push N upward - a half-percent error becomes decisive past ~150k users

05 · Where the LLM fits

The LLM writes the summary. It does not run the pipeline.

A tool-use loop could put the LLM in charge of routing - letting the model decide which check to run on which column. It works, but it is the wrong call for production: paying LLM latency and cost to dispatch data to three deterministic functions is not how a production data tool should be built.

So the default pipeline mode routes deterministically - a hardcoded heuristic - and invokes the LLM exactly once, at the very end, to turn the finished results JSON into a plain-English summary. That is the one place a language model genuinely adds value. The verdict itself is decided by deterministic code; the summary may not contradict it or introduce a number that is not in the JSON.

The tool-use agent still ships, as opt-in --mode agent, for datasets whose column roles a heuristic cannot infer. Knowing when not to spend the LLM is part of the design.

load · deterministic

Defensive CSV load: malformed rows skipped, duplicates and all-null columns dropped, dirty numerics coerced. A data-quality report records every repair.

route · deterministic

A hardcoded heuristic picks the variant column, metrics, and covariates. No LLM call.

infer_schema_heuristic(profile)

guardrails · deterministic

SRM, metric tests, BH-FDR, and PSM run as a plain Python pipeline. Every number from scipy / scikit-learn.

verdict · deterministic

A priority-ordered rule set in report.py decides SAFE / COMPROMISED / NO EFFECT. The model does not vote.

summary · the one LLM call

Claude receives the finished JSON and writes a 4-6 sentence executive summary - bounded strictly to the computed numbers.

narrate(results_payload)

★ interactive 2 · confounding studio

When assignment leaks from a covariate, the naive estimate lies. PSM reads through it.

Set a true treatment effect, then dial up how strongly the variant assignment is confounded with a pre-treatment covariate (higher-value users drifting into treatment). The naive difference of means absorbs the confounding bias; the PSM ATT matches it away and recovers the truth. When the two disagree on sign, the verdict logic flags the experiment.

True treatment effect

+0.40

−2.00+2.0

Confounding strength

0.55

nonesevere

Covariate → outcome link held at ×4.0
Confounding bias added to the naive estimate:
+2.20 = 4.0 · (X̅ᷲ − X̅┊)

Naive Δ-mean

+2.60

unadjusted · biased

PSM ATT

+0.42

matched on the covariate

PSM agrees with naive no sign flip · effect is robust

model Y = 4·X + βᷲ·T + noise, with P(treated) rising in X · naive E[Y|T=1] − E[Y|T=0] · PSM ATT recovered after matching on X · try set the true effect to 0 with heavy confounding - the naive estimate invents a winner

06 · Diagnostics & robustness

Every choice that could leak signal is tested out loud.

The PSM module does not stop at a point estimate. It trims to the common-support region, bootstraps the standard error, and reports how strong an unmeasured confounder would have to be to overturn the result.

Common-support trimming

enforced

Treated units with a propensity above the control maximum - and vice versa - are dropped before matching. Outside the overlap region, no causal claim is defensible, so the agent refuses to make one.

trimmed unitsreported per arm

no overlapraises StatisticalCheckError

Bootstrap standard error

corrected

The naive paired-t SE is biased under matching with replacement - re-used controls induce dependence (Abadie & Imbens, 2006). The module bootstraps the treated units, re-matches, and reports both SEs so the bias is visible.

resamples500 · re-match each

CI2.5 / 97.5 percentile

Rosenbaum bounds

sensitivity

A Wilcoxon signed-rank sensitivity analysis: at each hidden-bias level Γ, the worst-case two-sided p-value. The critical Γ is the strength of unmeasured confounding needed to overturn the ATT’s significance.

gridΓ = 1.0 … 3.0

reportscritical Γ

LLM kept out of the math

by design

Routing and statistics are deterministic Python. The LLM never computes a p-value, weighs evidence, or picks the verdict - that is a deterministic rule in report.py. Its one job is a closing summary bounded to the computed numbers.

verdictdeterministic · code

numbers in summarymust exist in payload

Defensive data loading

production-shaped

Real e-commerce exports carry malformed rows, double-logged duplicates, all-null columns and dirty numeric tokens. The loader repairs each defect, counts it, and surfaces a data-quality report - nothing is fixed silently.

malformed / duplicate rowsskipped · dropped · counted

dirty numericscoerced → missing

Validated on the Criteo dataset

14M rows

A shipped adapter maps the real Criteo Uplift dataset (13.98M-row randomised ad experiment) onto the tool’s schema. On a 300k-row sample the guardrail returns SAFE TO ROLL OUT - the correct call for a genuine RCT: SRM passes, both metrics lift significantly, and PSM barely moves the estimate, the expected signature of true randomisation. The exercise also caught three real-data bugs the synthetic demos could not.

Criteo sample300,204 rows

SRMχ² p ≈ 0.98 · passes

PSM ATT vs naive0.0117 → 0.0101 (visit)

verdictSAFE TO ROLL OUT

★ interactive 3 · multiple-testing lab

Check enough metrics and a false winner is almost guaranteed.

Each individual test at α = 0.05 is fine. Run a family of them and the probability that at least one fires by pure chance climbs fast. Drag the number of metrics and the per-test α - the curve is the chance of a bogus “significant” result under the null. Benjamini-Hochberg and Bonferroni hold the family in check.

Metrics tested in the experiment

140

Per-test α

0.05

0.010.10

Uncorrected family-wise error:
45.96% = 1 − (1 − α)ᴰ
Bonferroni holds family-wise error ≤ 0.05

P(≥1 false winner) · raw

45.96%

uncorrected

Same, with BH-FDR

≤ 5%

false-discovery rate held

Correction worth it raw error far above 5%

curve P(at least one false positive) = 1 − (1 − α)ᴰ under the global null · flat line Bonferroni family-wise bound · repo BH-FDR is the default in apply_multiple_testing_correction()

★ interactive 4 · cuped variance reduction

A pre-experiment covariate shrinks the confidence interval - for free, no bias.

CUPED (Deng et al., WSDM 2013) subtracts the part of the outcome that a pre-experiment covariate already explained. The point estimate is unchanged in expectation; the variance drops by a factor of (1 − ρ²). Drag the correlation between the covariate and the outcome and watch the 95% interval contract.

ρ · corr(pre-experiment covariate, outcome)

0.60

0.000.95

θ = Cov(Y, X) / Var(X)
Y′ = Y − θ(X − X̅)
Var(Y′) / Var(Y) = 0.64 = 1 − ρ²

Variance reduction

36%

= ρ²

CI width vs. baseline

80%

= √(1 − ρ²)

Same estimate, tighter CI point estimate unchanged in expectation

bars 95% confidence interval on the treatment effect, before and after CUPED · identity Var(Y′) = Var(Y)·(1 − ρ²) · repo apply_cuped() in metric_tests.py · exposed as --cuped

07 · The verdict

One rule set. Applied in order. The model does not vote.

The verdict is a deterministic function in report.py. The LLM writes the summary; it never decides the outcome. The rules are applied in priority order, and the first one that matches wins.

SRM dominates everything. If the allocation fails the chi-square check, the experiment is compromised - no metric result can rescue it, because the groups are not comparable.

Below SRM sit the causal rules: a sign flip between the naive estimate and the PSM ATT, or a significant naive effect that collapses below half its size once matched and whose bootstrap CI now covers zero. Either pattern means confounding, and the verdict turns.

Only if no guardrail fires does a significant primary metric earn a SAFE TO ROLL OUT.

Rule 1 · SRM detected (χ² p < 0.001)COMPROMISED

Rule 2 · PSM sign flip vs. naiveCOMPROMISED

Rule 3 · ATT < 50% of naive, CI ∋ 0COMPROMISED

Rule 4 · primary metric significantSAFE TO ROLL OUT

Rule 5 · otherwiseNO SIGNIFICANT EFFECT

clean_experiment.csv →SAFE TO ROLL OUT

compromised_experiment.csv →EXPERIMENT COMPROMISED

08 · What this project demonstrates

Decision-aware experimentation

Reframing “is the metric significant?” as “is this experiment safe to interpret?” - SRM-first, with a three-state verdict instead of a bare p-value.

Causal inference

Propensity score matching with caliper, common-support trimming, bootstrap standard errors, and Rosenbaum sensitivity bounds for hidden bias.

Statistical rigour

Newcombe hybrid-score intervals over Wald, Welch over Student, Mann-Whitney as a non-parametric check, BH-FDR for multiplicity, CUPED for variance reduction.

Pragmatic LLM engineering

The LLM is scoped to one job - the closing summary - because routing data to deterministic functions does not need a model. A tool-use agent ships opt-in for unfamiliar schemas: knowing when not to spend the LLM is the point.

Production-shaped Python

Typed dataclasses, a custom exception hierarchy, pure guardrail functions, a 36-test pytest suite, ruff lint, and an installable package with a console entry point.

Reproducibility

A deterministic no-LLM mode for byte-stable CI, GitHub Actions across Python 3.10-3.12, pinned dependencies, and a synthetic data generator shipped with the repo.

09 · Honest limits

Four caveats that travel with the tool.

Synthetic demo data

The two shipped datasets are generated, with a known ground truth, to make the SRM and confounding behaviour visible and testable. On real data the agent is only as good as the covariates it is given.

PSM is not randomisation

Propensity matching only adjusts for observed covariates. It is reported as a robustness check with Rosenbaum bounds attached - not as a substitute for a correctly randomised experiment.

Out of scope by construction

The agent does not detect novelty effects, peeking, or network/SUTVA violations between arms. Those are problems of experiment design, not post-hoc analysis.

Bootstrap, not Abadie-Imbens

The matched-pair SE is a treated-unit bootstrap, the common recipe - not the analytical Abadie-Imbens estimator. For a published causal estimate, the analytical form would be the next step.

End of case study

Read the source, the tests, and the two demo reports.

The repository contains the three guardrail engines, the deterministic routing pipeline, the opt-in tool-use agent, the verdict logic, a 36-test pytest suite running in CI on Python 3.10-3.12, the defensive CSV loader, the three demo datasets, and the Criteo adapter.

Open the repository ↗

An A/B test can be brokenand still report a clean p-value.