M. Silchenko · Case Study
GitHub
Pipeline · LLM Summary
project · Experimentation Guardrail Agent stack · Python · scipy · scikit-learn · Anthropic Claude · pytest author · Maksim Silchenko
A case study in decision-aware experimentation

An A/B test can be broken
and still report a clean p-value.

Companies run thousands of A/B tests; a measurable fraction are quietly invalid before anyone reads the result - a leaky feature flag, a re-used randomisation seed, an unmeasured confounder. The metric test still runs. The p-value still prints. ab-guardrail is a command-line tool that reads the part the p-value can’t see: it checks the allocation for Sample Ratio Mismatch, runs the right metric test for the data’s shape, corrects for multiple testing, and propensity-matches away confounding - then returns a verdict, not a number. Routing and statistics are a deterministic pipeline; an LLM is invoked once, at the end, to write the summary.

10−132
SRM chi-square p-value flagged on the compromised demo - a 61/39 split that no metric test would catch.
1
LLM call on the default path - the closing summary. Routing & statistics are deterministic Python.
36
Pytest cases - including messy-data and Criteo-adapter suites - green in CI on Python 3.10-3.12.
7
Peer-reviewed methods implemented - SRM, Newcombe, Welch, BH-FDR, PSM, Rosenbaum bounds, CUPED.
01 · The question

What is the most under-checked failure mode in industry A/B testing - and why does the metric test never see it?

A standard A/B-test analysis answers one question: did the metric move? It runs a t-test, prints a p-value, and stops. But that pipeline is blind to a whole class of failures that happen before the metric is ever computed.

The headline case is Sample Ratio Mismatch (SRM). If a feature flag silently drops part of the control arm, the experiment ships at, say, 45/55 instead of 50/50. The t-test on the metric runs perfectly - and its answer is biased, because the two groups are no longer comparable. Microsoft’s experimentation team found roughly 6% of their online tests carried an SRM, with systematically distorted effect estimates.

So the project refuses the brief as stated. Instead of “is the metric significant?” it answers “is this experiment safe to interpret at all?” - and returns one of three verdicts: SAFE TO ROLL OUT, EXPERIMENT COMPROMISED, or NO SIGNIFICANT EFFECT.

Four guardrails · what each one catches
a naive t-test sees none of the rows below
CHECK FAILURE IT CATCHES Chi-square SRM goodness-of-fit · p < 0.001 broken randomiser · leaky flag the dominant silent failure Welch + Mann-Whitney + Newcombe CI for binary heavy tails · boundary proportion a Wald CI under-covers Benjamini-Hochberg FDR correction across metrics metric-shopping · false winners each test fine, the family is not Propensity Score Matching + bootstrap SE · Rosenbaum bounds confounding · hidden bias the naive Δ-mean is biased accent rows = the two failures a metric test is structurally blind to
02 · Architecture

Four stages. The LLM picks the plan; deterministic code runs the statistics.

The design rule is a hard line: the LLM never computes a number. It reads column names and a few sample rows, decides which column is the variant and which are covariates, and calls tools. Every p-value, CI and effect size is produced by scipy.stats and scikit-learn. Letting an LLM weigh evidence or pick a threshold is the fast track to hallucinated significance.

01load
Defensive ingestion & data-quality report
data_loader.py ingests the CSV the way a real e-commerce export demands: malformed rows skipped, double-logged duplicates and all-null columns dropped, dirty numeric tokens coerced. Every repair is counted into a DataQualityReport - nothing is fixed silently - and the cleaned frame is profiled into a typed DatasetProfile.
DataQualityReport · DatasetProfile · raises DataValidationError
02route
Deterministic routing - no LLM in the hot path
A hardcoded heuristic identifies the variant column, the metrics and the pre-treatment covariates. Paying LLM latency to dispatch data to three deterministic functions is poor production engineering, so the default path does not. An opt-in --mode agent uses a Claude tool-use loop instead, for unfamiliar schemas where the heuristic cannot infer roles.
agent.py · infer_schema_heuristic() · --mode agent for tool-use
03guardrails
Three statistical engines, fully deterministic
srm.py - chi-square goodness-of-fit. metric_tests.py - Welch + Mann-Whitney or chi-square + Newcombe CI, plus BH-FDR correction and CUPED. causal.py - PSM with common-support trimming, bootstrap SE and Rosenbaum bounds.
pure functions · typed dataclass results · 36 pytest cases
04report
Verdict logic & Markdown render
report.py applies a deterministic verdict rule (the LLM does not vote), renders a Markdown report with a love-plot of pre/post covariate balance, and Claude writes a plain-English summary constrained strictly to the computed numbers.
SAFE TO ROLL OUT · EXPERIMENT COMPROMISED · NO SIGNIFICANT EFFECT
03 · The twist

Two datasets, same schema. One ships. One is quietly broken.

The repo ships two synthetic 12,000-user experiments with an identical schema. clean_experiment.csv is a properly randomised test with a real lift. compromised_experiment.csv carries two simultaneous defects: a 61/39 allocation (SRM) and treatment that systematically captures higher-value users (confounding). A naive t-test would happily report a “significant” revenue lift on the broken one. The agent does not.

clean properly randomised test
SAFE
SRM p = 0.94 · conversion +3.35pp (p ≈ 7×10−8) · PSM ATT +2.25pp, same sign
Allocation is 50.0/50.0. The lift survives propensity matching and Rosenbaum sensitivity. The verdict logic clears it to roll out - no guardrail fired.
compromised SRM + confounding
COMPROMISED
SRM p ≈ 2×10−132 · 61/39 split · naive revenue +$0.57 collapses to +$0.42 ATT
The SRM check fires at a p-value with 132 leading zeros. Under the verdict rule, SRM detected ends the analysis: the experiment is uninterpretable until the randomiser is fixed, whatever the metric says.
The model was never the problem. The randomiser was - and a metric test, run on a 61/39 split, will still hand you a number that looks like a decision.
M. Silchenko · case-study verdict
04 · The three engines

Three statistical engines, each chosen to be defensible in a stats interview.

Every method here was picked because the obvious alternative has a known weakness. A Wald interval under-covers near 0 and 1; a paired-t SE is biased under matching-with-replacement. The repo uses the corrected form and says so, with the citation, in the docstring.

i.
SRM check
srm.py · chi-square goodness-of-fit
step · ALLOCATION INTEGRITY
chi2 = Σ (Oᵢ − Eᵢ)² / Eᵢ
df   = 1
SRM  := p < 0.001

# stricter than 0.05 on purpose:
# an SRM investigation is
# expensive; don't chase noise
Catchesbroken randomiser
Small-cell guardEᵢ ≥ 5 enforced
Verdict weightdominant · ends analysis
ii.
Metric tests
metric_tests.py · shape-aware
step · EFFECT ESTIMATION
binary     -> chi-square 2×2
            + Newcombe CI (1998)
continuous -> Welch's t-test
            + Mann-Whitney U
family     -> BH-FDR correction
optional   -> CUPED variance cut
CI methodNewcombe hybrid-score
Non-parametric checkMann-Whitney U
MultiplicityBH-FDR · opt Bonferroni
iii.
Propensity matching
causal.py · 1-NN with caliper
step · CAUSAL ADJUSTMENT
p̂   = LogisticRegression(X)
match = 1-NN, caliper
        0.2·SD(logit p̂)
SE    = bootstrap (not paired-t)
Γ     = Rosenbaum bounds
trim  = common-support region
EstimandATT
SEbootstrap · 500 resamples
SensitivityRosenbaum Γ bounds

The naive paired-t standard error is biased under matching with replacement because re-used control units induce dependence (Abadie & Imbens, 2006). The repo bootstraps instead, and the report prints both so the bias is visible. The four interactive panels below let you drive each engine’s core mathematics directly.

★ interactive 1 · srm detector

A 1% imbalance is invisible. At scale, it is a chi-square catastrophe.

SRM is counter-intuitive because a tiny ratio error becomes overwhelming evidence once the sample is large. Move the two sliders - total users and the observed share landing in control - and watch the chi-square goodness-of-fit p-value. The threshold for declaring an SRM is p < 0.001.

Total users in the experiment
40,000
2k400k
Observed share in control arm
50.0%
44%50%56%
Expected (planned): 50 / 50
Observed counts: 20,000 / 20,000
chi-square, 1 degree of freedom
Chi-square statistic
0.00
Σ (O−E)²/E
p-value
1.00
consistent with 50/50
No SRM the experiment is safe to interpret

test chi-square goodness-of-fit vs. a planned 50/50 split, df = 1 · threshold SRM declared at p < 0.001 · try hold the split at 49/51 and push N upward - a half-percent error becomes decisive past ~150k users

05 · Where the LLM fits

The LLM writes the summary. It does not run the pipeline.

A tool-use loop could put the LLM in charge of routing - letting the model decide which check to run on which column. It works, but it is the wrong call for production: paying LLM latency and cost to dispatch data to three deterministic functions is not how a production data tool should be built.

So the default pipeline mode routes deterministically - a hardcoded heuristic - and invokes the LLM exactly once, at the very end, to turn the finished results JSON into a plain-English summary. That is the one place a language model genuinely adds value. The verdict itself is decided by deterministic code; the summary may not contradict it or introduce a number that is not in the JSON.

The tool-use agent still ships, as opt-in --mode agent, for datasets whose column roles a heuristic cannot infer. Knowing when not to spend the LLM is part of the design.

1
load · deterministic
Defensive CSV load: malformed rows skipped, duplicates and all-null columns dropped, dirty numerics coerced. A data-quality report records every repair.
2
route · deterministic
A hardcoded heuristic picks the variant column, metrics, and covariates. No LLM call.
infer_schema_heuristic(profile)
3
guardrails · deterministic
SRM, metric tests, BH-FDR, and PSM run as a plain Python pipeline. Every number from scipy / scikit-learn.
4
verdict · deterministic
A priority-ordered rule set in report.py decides SAFE / COMPROMISED / NO EFFECT. The model does not vote.
5
summary · the one LLM call
Claude receives the finished JSON and writes a 4-6 sentence executive summary - bounded strictly to the computed numbers.
narrate(results_payload)
★ interactive 2 · confounding studio

When assignment leaks from a covariate, the naive estimate lies. PSM reads through it.

Set a true treatment effect, then dial up how strongly the variant assignment is confounded with a pre-treatment covariate (higher-value users drifting into treatment). The naive difference of means absorbs the confounding bias; the PSM ATT matches it away and recovers the truth. When the two disagree on sign, the verdict logic flags the experiment.

True treatment effect
+0.40
−2.00+2.0
Confounding strength
0.55
nonesevere
Covariate → outcome link held at ×4.0
Confounding bias added to the naive estimate:
+2.20 = 4.0 · (X̅ᷲ − X̅┊)
Naive Δ-mean
+2.60
unadjusted · biased
PSM ATT
+0.42
matched on the covariate
PSM agrees with naive no sign flip · effect is robust

model Y = 4·X + βᷲ·T + noise, with P(treated) rising in X · naive E[Y|T=1] − E[Y|T=0] · PSM ATT recovered after matching on X · try set the true effect to 0 with heavy confounding - the naive estimate invents a winner

06 · Diagnostics & robustness

Every choice that could leak signal is tested out loud.

The PSM module does not stop at a point estimate. It trims to the common-support region, bootstraps the standard error, and reports how strong an unmeasured confounder would have to be to overturn the result.

Common-support trimming
enforced
Treated units with a propensity above the control maximum - and vice versa - are dropped before matching. Outside the overlap region, no causal claim is defensible, so the agent refuses to make one.
trimmed unitsreported per arm
no overlapraises StatisticalCheckError
Bootstrap standard error
corrected
The naive paired-t SE is biased under matching with replacement - re-used controls induce dependence (Abadie & Imbens, 2006). The module bootstraps the treated units, re-matches, and reports both SEs so the bias is visible.
resamples500 · re-match each
CI2.5 / 97.5 percentile
Rosenbaum bounds
sensitivity
A Wilcoxon signed-rank sensitivity analysis: at each hidden-bias level Γ, the worst-case two-sided p-value. The critical Γ is the strength of unmeasured confounding needed to overturn the ATT’s significance.
gridΓ = 1.0 … 3.0
reportscritical Γ
LLM kept out of the math
by design
Routing and statistics are deterministic Python. The LLM never computes a p-value, weighs evidence, or picks the verdict - that is a deterministic rule in report.py. Its one job is a closing summary bounded to the computed numbers.
verdictdeterministic · code
numbers in summarymust exist in payload
Defensive data loading
production-shaped
Real e-commerce exports carry malformed rows, double-logged duplicates, all-null columns and dirty numeric tokens. The loader repairs each defect, counts it, and surfaces a data-quality report - nothing is fixed silently.
malformed / duplicate rowsskipped · dropped · counted
dirty numericscoerced → missing
Validated on the Criteo dataset
14M rows
A shipped adapter maps the real Criteo Uplift dataset (13.98M-row randomised ad experiment) onto the tool’s schema. On a 300k-row sample the guardrail returns SAFE TO ROLL OUT - the correct call for a genuine RCT: SRM passes, both metrics lift significantly, and PSM barely moves the estimate, the expected signature of true randomisation. The exercise also caught three real-data bugs the synthetic demos could not.
Criteo sample300,204 rows
SRMχ² p ≈ 0.98 · passes
PSM ATT vs naive0.0117 → 0.0101 (visit)
verdictSAFE TO ROLL OUT
★ interactive 3 · multiple-testing lab

Check enough metrics and a false winner is almost guaranteed.

Each individual test at α = 0.05 is fine. Run a family of them and the probability that at least one fires by pure chance climbs fast. Drag the number of metrics and the per-test α - the curve is the chance of a bogus “significant” result under the null. Benjamini-Hochberg and Bonferroni hold the family in check.

Metrics tested in the experiment
12
140
Per-test α
0.05
0.010.10
Uncorrected family-wise error:
45.96% = 1 − (1 − α)ᴰ
Bonferroni holds family-wise error ≤ 0.05
P(≥1 false winner) · raw
45.96%
uncorrected
Same, with BH-FDR
≤ 5%
false-discovery rate held
Correction worth it raw error far above 5%

curve P(at least one false positive) = 1 − (1 − α)ᴰ under the global null · flat line Bonferroni family-wise bound · repo BH-FDR is the default in apply_multiple_testing_correction()

★ interactive 4 · cuped variance reduction

A pre-experiment covariate shrinks the confidence interval - for free, no bias.

CUPED (Deng et al., WSDM 2013) subtracts the part of the outcome that a pre-experiment covariate already explained. The point estimate is unchanged in expectation; the variance drops by a factor of (1 − ρ²). Drag the correlation between the covariate and the outcome and watch the 95% interval contract.

ρ · corr(pre-experiment covariate, outcome)
0.60
0.000.95
θ = Cov(Y, X) / Var(X)
Y′ = Y − θ(X − X̅)
Var(Y′) / Var(Y) = 0.64 = 1 − ρ²
Variance reduction
36%
= ρ²
CI width vs. baseline
80%
= √(1 − ρ²)
Same estimate, tighter CI point estimate unchanged in expectation

bars 95% confidence interval on the treatment effect, before and after CUPED · identity Var(Y′) = Var(Y)·(1 − ρ²) · repo apply_cuped() in metric_tests.py · exposed as --cuped

07 · The verdict

One rule set. Applied in order. The model does not vote.

The verdict is a deterministic function in report.py. The LLM writes the summary; it never decides the outcome. The rules are applied in priority order, and the first one that matches wins.

SRM dominates everything. If the allocation fails the chi-square check, the experiment is compromised - no metric result can rescue it, because the groups are not comparable.

Below SRM sit the causal rules: a sign flip between the naive estimate and the PSM ATT, or a significant naive effect that collapses below half its size once matched and whose bootstrap CI now covers zero. Either pattern means confounding, and the verdict turns.

Only if no guardrail fires does a significant primary metric earn a SAFE TO ROLL OUT.

Rule 1 · SRM detected (χ² p < 0.001)COMPROMISED
Rule 2 · PSM sign flip vs. naiveCOMPROMISED
Rule 3 · ATT < 50% of naive, CI ∋ 0COMPROMISED
Rule 4 · primary metric significantSAFE TO ROLL OUT
Rule 5 · otherwiseNO SIGNIFICANT EFFECT
clean_experiment.csv →SAFE TO ROLL OUT
compromised_experiment.csv →EXPERIMENT COMPROMISED
08 · What this project demonstrates
01
Decision-aware experimentation
Reframing “is the metric significant?” as “is this experiment safe to interpret?” - SRM-first, with a three-state verdict instead of a bare p-value.
02
Causal inference
Propensity score matching with caliper, common-support trimming, bootstrap standard errors, and Rosenbaum sensitivity bounds for hidden bias.
03
Statistical rigour
Newcombe hybrid-score intervals over Wald, Welch over Student, Mann-Whitney as a non-parametric check, BH-FDR for multiplicity, CUPED for variance reduction.
04
Pragmatic LLM engineering
The LLM is scoped to one job - the closing summary - because routing data to deterministic functions does not need a model. A tool-use agent ships opt-in for unfamiliar schemas: knowing when not to spend the LLM is the point.
05
Production-shaped Python
Typed dataclasses, a custom exception hierarchy, pure guardrail functions, a 36-test pytest suite, ruff lint, and an installable package with a console entry point.
06
Reproducibility
A deterministic no-LLM mode for byte-stable CI, GitHub Actions across Python 3.10-3.12, pinned dependencies, and a synthetic data generator shipped with the repo.
09 · Honest limits

Four caveats that travel with the tool.

01
Synthetic demo data

The two shipped datasets are generated, with a known ground truth, to make the SRM and confounding behaviour visible and testable. On real data the agent is only as good as the covariates it is given.

02
PSM is not randomisation

Propensity matching only adjusts for observed covariates. It is reported as a robustness check with Rosenbaum bounds attached - not as a substitute for a correctly randomised experiment.

03
Out of scope by construction

The agent does not detect novelty effects, peeking, or network/SUTVA violations between arms. Those are problems of experiment design, not post-hoc analysis.

04
Bootstrap, not Abadie-Imbens

The matched-pair SE is a treated-unit bootstrap, the common recipe - not the analytical Abadie-Imbens estimator. For a published causal estimate, the analytical form would be the next step.

End of case study

Read the source, the tests, and the two demo reports.

The repository contains the three guardrail engines, the deterministic routing pipeline, the opt-in tool-use agent, the verdict logic, a 36-test pytest suite running in CI on Python 3.10-3.12, the defensive CSV loader, the three demo datasets, and the Criteo adapter.

Open the repository