Companies run thousands of A/B tests; a measurable fraction are quietly invalid before anyone reads the result - a leaky feature flag, a re-used randomisation seed, an unmeasured confounder. The metric test still runs. The p-value still prints. ab-guardrail is a command-line tool that reads the part the p-value can’t see: it checks the allocation for Sample Ratio Mismatch, runs the right metric test for the data’s shape, corrects for multiple testing, and propensity-matches away confounding - then returns a verdict, not a number. Routing and statistics are a deterministic pipeline; an LLM is invoked once, at the end, to write the summary.
A standard A/B-test analysis answers one question: did the metric move? It runs a t-test, prints a p-value, and stops. But that pipeline is blind to a whole class of failures that happen before the metric is ever computed.
The headline case is Sample Ratio Mismatch (SRM). If a feature flag silently drops part of the control arm, the experiment ships at, say, 45/55 instead of 50/50. The t-test on the metric runs perfectly - and its answer is biased, because the two groups are no longer comparable. Microsoft’s experimentation team found roughly 6% of their online tests carried an SRM, with systematically distorted effect estimates.
So the project refuses the brief as stated. Instead of “is the metric significant?” it answers “is this experiment safe to interpret at all?” - and returns one of three verdicts: SAFE TO ROLL OUT, EXPERIMENT COMPROMISED, or NO SIGNIFICANT EFFECT.
The design rule is a hard line: the LLM never computes a number. It reads column names and a few sample rows, decides which column is the variant and which are covariates, and calls tools. Every p-value, CI and effect size is produced by scipy.stats and scikit-learn. Letting an LLM weigh evidence or pick a threshold is the fast track to hallucinated significance.
The repo ships two synthetic 12,000-user experiments with an identical schema. clean_experiment.csv is a properly randomised test with a real lift. compromised_experiment.csv carries two simultaneous defects: a 61/39 allocation (SRM) and treatment that systematically captures higher-value users (confounding). A naive t-test would happily report a “significant” revenue lift on the broken one. The agent does not.
Every method here was picked because the obvious alternative has a known weakness. A Wald interval under-covers near 0 and 1; a paired-t SE is biased under matching-with-replacement. The repo uses the corrected form and says so, with the citation, in the docstring.
chi2 = Σ (Oᵢ − Eᵢ)² / Eᵢ df = 1 SRM := p < 0.001 # stricter than 0.05 on purpose: # an SRM investigation is # expensive; don't chase noise
binary -> chi-square 2×2
+ Newcombe CI (1998)
continuous -> Welch's t-test
+ Mann-Whitney U
family -> BH-FDR correction
optional -> CUPED variance cut
p̂ = LogisticRegression(X)
match = 1-NN, caliper
0.2·SD(logit p̂)
SE = bootstrap (not paired-t)
Γ = Rosenbaum bounds
trim = common-support region
The naive paired-t standard error is biased under matching with replacement because re-used control units induce dependence (Abadie & Imbens, 2006). The repo bootstraps instead, and the report prints both so the bias is visible. The four interactive panels below let you drive each engine’s core mathematics directly.
SRM is counter-intuitive because a tiny ratio error becomes overwhelming evidence once the sample is large. Move the two sliders - total users and the observed share landing in control - and watch the chi-square goodness-of-fit p-value. The threshold for declaring an SRM is p < 0.001.
test chi-square goodness-of-fit vs. a planned 50/50 split, df = 1 · threshold SRM declared at p < 0.001 · try hold the split at 49/51 and push N upward - a half-percent error becomes decisive past ~150k users
A tool-use loop could put the LLM in charge of routing - letting the model decide which check to run on which column. It works, but it is the wrong call for production: paying LLM latency and cost to dispatch data to three deterministic functions is not how a production data tool should be built.
So the default pipeline mode routes deterministically - a hardcoded heuristic - and invokes the LLM exactly once, at the very end, to turn the finished results JSON into a plain-English summary. That is the one place a language model genuinely adds value. The verdict itself is decided by deterministic code; the summary may not contradict it or introduce a number that is not in the JSON.
The tool-use agent still ships, as opt-in --mode agent, for datasets whose column roles a heuristic cannot infer. Knowing when not to spend the LLM is part of the design.
Set a true treatment effect, then dial up how strongly the variant assignment is confounded with a pre-treatment covariate (higher-value users drifting into treatment). The naive difference of means absorbs the confounding bias; the PSM ATT matches it away and recovers the truth. When the two disagree on sign, the verdict logic flags the experiment.
model Y = 4·X + βᷲ·T + noise, with P(treated) rising in X · naive E[Y|T=1] − E[Y|T=0] · PSM ATT recovered after matching on X · try set the true effect to 0 with heavy confounding - the naive estimate invents a winner
The PSM module does not stop at a point estimate. It trims to the common-support region, bootstraps the standard error, and reports how strong an unmeasured confounder would have to be to overturn the result.
Each individual test at α = 0.05 is fine. Run a family of them and the probability that at least one fires by pure chance climbs fast. Drag the number of metrics and the per-test α - the curve is the chance of a bogus “significant” result under the null. Benjamini-Hochberg and Bonferroni hold the family in check.
curve P(at least one false positive) = 1 − (1 − α)ᴰ under the global null · flat line Bonferroni family-wise bound · repo BH-FDR is the default in apply_multiple_testing_correction()
CUPED (Deng et al., WSDM 2013) subtracts the part of the outcome that a pre-experiment covariate already explained. The point estimate is unchanged in expectation; the variance drops by a factor of (1 − ρ²). Drag the correlation between the covariate and the outcome and watch the 95% interval contract.
bars 95% confidence interval on the treatment effect, before and after CUPED · identity Var(Y′) = Var(Y)·(1 − ρ²) · repo apply_cuped() in metric_tests.py · exposed as --cuped
The verdict is a deterministic function in report.py. The LLM writes the summary; it never decides the outcome. The rules are applied in priority order, and the first one that matches wins.
SRM dominates everything. If the allocation fails the chi-square check, the experiment is compromised - no metric result can rescue it, because the groups are not comparable.
Below SRM sit the causal rules: a sign flip between the naive estimate and the PSM ATT, or a significant naive effect that collapses below half its size once matched and whose bootstrap CI now covers zero. Either pattern means confounding, and the verdict turns.
Only if no guardrail fires does a significant primary metric earn a SAFE TO ROLL OUT.
The two shipped datasets are generated, with a known ground truth, to make the SRM and confounding behaviour visible and testable. On real data the agent is only as good as the covariates it is given.
Propensity matching only adjusts for observed covariates. It is reported as a robustness check with Rosenbaum bounds attached - not as a substitute for a correctly randomised experiment.
The agent does not detect novelty effects, peeking, or network/SUTVA violations between arms. Those are problems of experiment design, not post-hoc analysis.
The matched-pair SE is a treated-unit bootstrap, the common recipe - not the analytical Abadie-Imbens estimator. For a published causal estimate, the analytical form would be the next step.
The repository contains the three guardrail engines, the deterministic routing pipeline, the opt-in tool-use agent, the verdict logic, a 36-test pytest suite running in CI on Python 3.10-3.12, the defensive CSV loader, the three demo datasets, and the Criteo adapter.
Open the repository ↗