#__bundler_loading, #__bundler_thumbnail { display: none !important; } .ns-wrap { max-width: 820px; margin: 0 auto; padding: 56px 24px 96px; font: 16px/1.65 Georgia, 'Times New Roman', serif; color: #2b2820; background: #f4efe6; } .ns-wrap * { box-sizing: border-box; } .ns-eyebrow { font: 600 11px/1.4 ui-monospace, 'SF Mono', Menlo, monospace; letter-spacing: 0.14em; text-transform: uppercase; color: #9a9384; margin: 44px 0 10px; } .ns-wrap h1 { font-size: 38px; line-height: 1.15; margin: 6px 0 14px; font-weight: 700; } .ns-wrap h2 { font-size: 24px; line-height: 1.25; margin: 8px 0 14px; font-weight: 700; } .ns-wrap h3 { font-size: 17px; margin: 22px 0 6px; font-weight: 700; } .ns-accent { color: #c8421f; } .ns-sub { font-size: 17px; color: #4a4636; margin: 0 0 18px; } .ns-meta { font: 12px/1.7 ui-monospace, Menlo, monospace; color: #6f6a5c; margin: 0 0 8px; } .ns-wrap p { margin: 0 0 14px; } .ns-stats { display: flex; flex-wrap: wrap; gap: 18px; margin: 22px 0 8px; padding: 20px 0; border-top: 1px solid #d8d0c0; border-bottom: 1px solid #d8d0c0; } .ns-stat { flex: 1 1 160px; } .ns-stat .v { font-size: 26px; font-weight: 700; } .ns-stat .l { font: 12px/1.45 ui-monospace, Menlo, monospace; color: #6f6a5c; } .ns-wrap table { width: 100%; border-collapse: collapse; margin: 14px 0 20px; font-size: 14px; } .ns-wrap th, .ns-wrap td { text-align: left; padding: 7px 10px; border-bottom: 1px solid #ddd5c5; vertical-align: top; } .ns-wrap th { font: 600 12px/1.4 ui-monospace, Menlo, monospace; text-transform: uppercase; letter-spacing: 0.04em; color: #6f6a5c; } .ns-wrap pre { background: #ebe5d8; padding: 14px 16px; border-radius: 6px; font: 13px/1.55 ui-monospace, Menlo, monospace; overflow-x: auto; color: #2b2820; margin: 10px 0 18px; } .ns-wrap code { font: 13px ui-monospace, Menlo, monospace; background: #ebe5d8; padding: 1px 5px; border-radius: 3px; } .ns-wrap a { color: #c8421f; } .ns-note { font-size: 13px; color: #6f6a5c; border-left: 2px solid #c8421f; padding-left: 14px; margin: 18px 0; } .ns-rule { border: 0; border-top: 1px solid #d8d0c0; margin: 36px 0; }

project · Olist Bayesian A/B Testing | domain · causal inference on a marketplace

stack · PyMC · DuckDB · SQL · ArviZ | author · Maksim Silchenko

A case study in causal specification

What does pooled A/B testing pool away?

On a 97,277-order marketplace panel, a flat A/B test reported a hypothetical free-shipping policy as a −2 pp drop in on-time delivery and recommended killing it. A hierarchical Bayesian difference-in-differences on the same data recovered a +1.5 pp lift with 99.7% posterior probability. The headline error was not magnitude. It was sign.

99.7%

P(δ > 0) for the policy effect on on-time delivery

Product categories with their own partial-pooled posterior

Hierarchical Bayesian models: binomial, hurdle, ordered logit

−R$ 452K

Net cost-benefit envelope. The lift is real; the economics still lose.

This is a plain-HTML executive summary of an interactive case study. The full version with interactive charts requires JavaScript. Source code and the complete methodology report are in the repository: github.com/thylinao1/Olist-Bayesian-AB-Testing.

01 · The question

Would a free-shipping-above-R$ 150 policy actually lift the outcomes a marketplace team cares about?

The public Olist Brazilian E-commerce dataset publishes nine relational tables: 99,441 orders, 1.5M order-line records, one million geolocation points. There is no randomised controlled trial in the data. To stay honest, the project is framed as the inferential machinery that would be used if Olist ran this experiment. The exposure variable, (purchase_week ≥ W*) AND (items_subtotal ≥ R$ 150), is constructed from real columns; the modelling pipeline is the one that would deploy if the assignment were real.

Four outcomes were measured against the same 97,277-order panel and the same DAG-derived adjustment set {C, S, G, M} (product category, seller tier, customer state, calendar month): on-time delivery, 180-day repeat-purchase probability, conditional spend given a repeat, and the 1-to-5 customer review score. Mediators (freight, delivery time, whether a review was left) are deliberately not conditioned on, to avoid post-treatment bias.

02 · Feature pipeline

A medallion SQL stack on DuckDB. End-to-end rebuild in five seconds.

Each layer is idempotent and inspectable; python -m src.etl goes from raw CSVs to a modelling-ready panel without leaving SQL.

Layer	What it does
bronze	Nine CSVs to DuckDB. Pinned types, no joins, no transforms; lossless and reproducible.
silver	Cleaning, typo fixes, deduplication. The Olist `customer_id` grain trap is resolved by collapsing on `customer_unique_id` (one row per real person, not per shipping address), leaving 96,096 customers.
gold	`fact_orders`, `dim_customer`, `dim_seller`, built with window functions: ROW_NUMBER for first-purchase, LAG for repeat intervals, MODE WITHIN GROUP for dominant category.
analytics	Cohort retention matrix, weekly funnel, gap-and-island session reconstruction, modelling panels (naive 16,082 cells; DiD 6,689 cells), auto-generated quality-diagnostics table.

03 · The plot twist

Two specifications. Two opposite answers. One of them is right.

The naive Binomial model compares treated = (post-cutover AND subtotal ≥ R$ 150) against everything else. That conflates the policy with two other things: large baskets being structurally slower to ship, and a common marketplace-wide time trend.

Specification	Result	Recommendation
Naive (flat A/B)	−2.0 pp on-time delivery; 95% CI (−2.97, −1.74); p < 0.0001	Kill the policy. Unambiguous, and confounded.
Difference-in-differences (hierarchical Bayes)	+1.5 pp; δ̄ = +0.169 logit; 94% HDI (+0.048, +0.289); P(>0) = 99.7%	Proceed, monitor reviews, target the categories where the per-category effect is most positive.

A saturated logit on is_on_time ~ eligible + post + eligible × post decomposes the −2 pp headline into the three components it was conflating:

β eligible  = −0.194 logit   basket-size structural slowness
β post      = −0.342 logit   marketplace-wide time trend
δ̄ policy    = +0.169 logit   the policy effect, freed of both

The frequentist DiD gives +1.35 pp (p = 0.013); the Bayesian DiD gives +1.50 pp; the methods triangulate. The naive headline was the wrong sign, masked by two larger confounders pulling the same way.

04 · The three models

One pattern, three likelihoods, four outcomes.

Every model carries a varying intercept and varying treatment slope by product category, non-centered priors for sampler geometry, and partial pooling so small categories borrow strength. Only the likelihood changes to match the outcome.

i. Hierarchical Binomial DiD: on-time delivery (order-cell grain, 6,689 cells)

n_on_time ~ Binomial(n, p)
logit(p)  = α_C + β_S + γ_G + β_e·E + β_p·P + δ_C·(E × P)
δ_C ~ Normal(δ̄, σ_δ);  α_C ~ Normal(ᾱ, σ_α)

β_e basket-size = −0.194 logit; β_p time trend = −0.342 logit; δ̄ policy effect = +0.169 logit; P(δ̄ > 0) = 99.7%.

ii. Hurdle LogNormal: retention and repeat spend (per-customer grain, 2.3% repeat rate)

y = 0 with prob 1 − θ_i;   y ~ LogNormal(μ_i, σ) if repeat
logit θ = α_C + τ_C·T;   μ = β_C + δ_C·T + γ·log(first_subtotal)

δ̄_b P(repeat) = +0.065 logit (null); δ̄_l log spend = +0.093 (× 1.097 multiplier); 0 divergences. The naive −0.5 pp retention drop was an observation-window censoring artifact, not a policy effect.

iii. Ordered logit: review score 1–5 (per-order grain, 30k stratified sample)

y ~ OrderedLogit(φ, κ);   φ = β_C + τ_C·T + γ_S + δ_M
κ[0] = 0 (anchored);   κ[k] = κ[k−1] + HalfNormal(1)

δ̄ = −0.170 cumulative-logit; P(δ̄ < 0) = 99.7%; κ ESS after the anchoring fix 941–2195; raw effect −0.13 stars.

Integrated finding across the four outcomes: the policy modestly lifts on-time delivery, gives a plausible-if-uncertain conditional-spend lift, does not actually hurt retention (a 180-day censoring artifact), and costs about 0.13 stars in review score. Two of four naive findings were misleading, one was misleading in magnitude, one was correct.

05 · Per-category heterogeneity

A flat A/B gives one number. The hierarchy gives a posterior per category.

The same δ̄ that lifts the marketplace average by +1.5 pp ranges from +3.5 pp in furniture_decor down to −0.8 pp in consoles_games. Partial pooling shrinks small-sample categories toward the global mean, so the ordering is robust to thin cells. This is the Simpson-paradox-style heterogeneity that pooled tests cannot expose, and it is the operational lever: a real platform deploys the policy where the per-category effect is most positive, not everywhere at once.

06 · Diagnostics & robustness

Every assumption the DiD rests on is tested out loud.

Test	Result
Parallel trends	Pass. Pre-cutover slope-difference d = −0.003 logit/week, p = 0.33; 46,774 orders across 58 weeks.
Bunching at R$ 150 (McCrary)	Clean. Z = −13.9, a deficit of 478 orders above the threshold, a structural retail-pricing kink, not policy-induced bunching.
Prior sensitivity	Posterior dominates. δ̄ = +0.1699 / +0.1695 / +0.1696 under Exponential(1) / Exponential(2) / HalfNormal(1) priors; spread 0.0004 logit.
Sampler health	Hurdle and review fits: 0 divergences. Binomial DiD: 388 / 6,000 (6.5%), posterior means unaffected. R̂ = 1.00–1.01. PSIS-LOO Pareto-k 100% below 0.7.

07 · The business call

A positive per-customer lift. A sharply negative net envelope.

The Welch t-test on per-customer revenue is positive: +R$ 1.90 (p = 0.011). Sizing the policy with the actual freight data turns the per-customer win into a marketplace-wide loss, because the subsidy is paid on every eligible order, not only the marginal ones.

Line	Value
N post-cutover eligible orders	12,405
Average freight per eligible order	R$ 36.83
Subsidy cost	R$ 456,879
Incremental GMV	R$ 21,921
Incremental margin at 20%	R$ 4,384
Net envelope	−R$ 452,495

Break-even contribution margin would need to be roughly 2,084%. The supported recommendations at this design point are a phased rollout limited to the favourable category set, or holding the policy pending a real pilot. A marketplace-wide launch is not warranted by the envelope.

08 · Where the policy is worth running

The categories that win on-time are also the categories that lose reviews.

Top 5 categories where the policy lifts on-time delivery most: furniture_decor (+3.54 pp), home_appliances_2 (+3.04 pp), auto (+2.98 pp), home_appliances (+2.69 pp), furniture_living_room (+2.53 pp).

Top 5 categories where the policy hurts review scores most: sports_leisure (−0.326 cum-logit), auto (−0.279), furniture_decor (−0.258), baby (−0.247), consoles_games (−0.244).

auto and furniture_decor appear on both lists: heavy, expensive items where free shipping helps most and where customer expectations rise the moment the policy makes them attractive. The first wave that survives that tension: fashion_shoes, office_furniture, watches_gifts. The four outcome scales (logit, log, cumulative-logit) are not commensurable, so the analysis ranks per-outcome and lets the decision-maker apply their own weights.

09 · Honest limits

Four caveats that travel with the headline numbers.

Synthetic treatment. Olist never ran this policy; results characterise what the inferential machinery would say if it had, and do not generalise to deployment without a real pilot. SUTVA in a marketplace. Sellers learn the policy and respond by re-pricing or by reordering logistics priority; a seller-clustered RCT is the right answer. 180-day repeat-window censoring. Late-panel customers had less than 180 days of follow-up; a Cox or Weibull accelerated-failure-time model is the textbook fix. Bunching the data cannot show. A real R$ 150 threshold would induce basket-padding; the conditional-spend posterior here is therefore an underestimate of a live deployment.

10 · What this project demonstrates

An end-to-end inferential argument behind every headline number.

Causal-inference framing (DAGs, the four elemental confounds, adjustment sets, mediator and collider hygiene). Hierarchical Bayesian modelling in PyMC (varying intercepts and slopes, non-centered priors, partial pooling, hurdle mixtures, ordered logit with anchored cutpoint identification). Quasi-experimental design (difference-in-differences with formal parallel-trends testing and McCrary-style density-continuity diagnostics). Posterior diagnostics (PSIS-LOO, Pareto-k, effective sample size, R-hat, divergence accounting, posterior-predictive checks, prior-sensitivity sweeps). Production-shaped data engineering (a medallion bronze-silver-gold-analytics pipeline on DuckDB with window-function feature derivation and a CI-tested codebase). And translating a posterior into a decision: a cost-benefit envelope built off posterior means, multi-objective per-category ranking, and uncertainty communicated to non-technical readers.

Repository · github.com/thylinao1/Olist-Bayesian-AB-Testing · medallion ETL, three PyMC model factories, the DiD specification, prior-sensitivity and parallel-trends scripts, the cost-benefit envelope generator, and a 17-test pytest suite running in CI.

Unpacking...