project · Olist Bayesian A/B Testing | domain · causal inference on a marketplace
stack · PyMC · DuckDB · SQL · ArviZ | author · Maksim Silchenko
A case study in causal specification
What does pooled A/B testing pool away?
On a 97,277-order marketplace panel, a flat A/B test reported a hypothetical
free-shipping policy as a −2 pp drop in on-time delivery and recommended killing it.
A hierarchical Bayesian difference-in-differences on the same data recovered a +1.5 pp lift
with 99.7% posterior probability. The headline error was not magnitude. It was sign.
99.7%
P(δ > 0) for the policy effect on on-time delivery
73
Product categories with their own partial-pooled posterior
3
Hierarchical Bayesian models: binomial, hurdle, ordered logit
−R$ 452K
Net cost-benefit envelope. The lift is real; the economics still lose.
This is a plain-HTML executive summary of an interactive case study. The full
version with interactive charts requires JavaScript. Source code and the complete methodology
report are in the repository:
github.com/thylinao1/Olist-Bayesian-AB-Testing.
01 · The question
Would a free-shipping-above-R$ 150 policy actually lift the outcomes a marketplace team cares about?
The public Olist Brazilian E-commerce dataset publishes nine relational tables: 99,441 orders,
1.5M order-line records, one million geolocation points. There is no randomised controlled
trial in the data. To stay honest, the project is framed as the inferential machinery that
would be used if Olist ran this experiment. The exposure variable,
(purchase_week ≥ W*) AND (items_subtotal ≥ R$ 150), is constructed from real
columns; the modelling pipeline is the one that would deploy if the assignment were real.
Four outcomes were measured against the same 97,277-order panel and the same DAG-derived
adjustment set {C, S, G, M} (product category, seller tier, customer state,
calendar month): on-time delivery, 180-day repeat-purchase probability, conditional spend
given a repeat, and the 1-to-5 customer review score. Mediators (freight, delivery time,
whether a review was left) are deliberately not conditioned on, to avoid post-treatment bias.
02 · Feature pipeline
A medallion SQL stack on DuckDB. End-to-end rebuild in five seconds.
Each layer is idempotent and inspectable; python -m src.etl goes from raw CSVs to
a modelling-ready panel without leaving SQL.
| Layer | What it does |
| bronze | Nine CSVs to DuckDB. Pinned types, no joins, no transforms; lossless and reproducible. |
| silver | Cleaning, typo fixes, deduplication. The Olist customer_id grain trap is resolved by collapsing on customer_unique_id (one row per real person, not per shipping address) — 96,096 customers. |
| gold | fact_orders, dim_customer, dim_seller, built with window functions: ROW_NUMBER for first-purchase, LAG for repeat intervals, MODE WITHIN GROUP for dominant category. |
| analytics | Cohort retention matrix, weekly funnel, gap-and-island session reconstruction, modelling panels (naive 16,082 cells; DiD 6,689 cells), auto-generated quality-diagnostics table. |
03 · The plot twist
Two specifications. Two opposite answers. One of them is right.
The naive Binomial model compares treated = (post-cutover AND subtotal ≥ R$ 150)
against everything else. That conflates the policy with two other things: large baskets being
structurally slower to ship, and a common marketplace-wide time trend.
| Specification | Result | Recommendation |
| Naive (flat A/B) | −2.0 pp on-time delivery; 95% CI (−2.97, −1.74); p < 0.0001 | Kill the policy. Unambiguous — and confounded. |
| Difference-in-differences (hierarchical Bayes) | +1.5 pp; δ̄ = +0.169 logit; 94% HDI (+0.048, +0.289); P(>0) = 99.7% | Proceed, monitor reviews, target the categories where the per-category effect is most positive. |
A saturated logit on is_on_time ~ eligible + post + eligible × post
decomposes the −2 pp headline into the three components it was conflating:
β eligible = −0.194 logit basket-size structural slowness
β post = −0.342 logit marketplace-wide time trend
δ̄ policy = +0.169 logit the policy effect, freed of both
The frequentist DiD gives +1.35 pp (p = 0.013); the Bayesian DiD gives +1.50 pp — the
methods triangulate. The naive headline was the wrong sign, masked by two larger confounders
pulling the same way.
04 · The three models
One pattern, three likelihoods, four outcomes.
Every model carries a varying intercept and varying treatment slope by product category,
non-centered priors for sampler geometry, and partial pooling so small categories borrow
strength. Only the likelihood changes to match the outcome.
i. Hierarchical Binomial DiD — on-time delivery (order-cell grain, 6,689 cells)
n_on_time ~ Binomial(n, p)
logit(p) = α_C + β_S + γ_G + β_e·E + β_p·P + δ_C·(E × P)
δ_C ~ Normal(δ̄, σ_δ); α_C ~ Normal(ᾱ, σ_α)
β_e basket-size = −0.194 logit; β_p time trend = −0.342 logit;
δ̄ policy effect = +0.169 logit; P(δ̄ > 0) = 99.7%.
ii. Hurdle LogNormal — retention and repeat spend (per-customer grain, 2.3% repeat rate)
y = 0 with prob 1 − θ_i; y ~ LogNormal(μ_i, σ) if repeat
logit θ = α_C + τ_C·T; μ = β_C + δ_C·T + γ·log(first_subtotal)
δ̄_b P(repeat) = +0.065 logit (null); δ̄_l log spend = +0.093
(× 1.097 multiplier); 0 divergences. The naive −0.5 pp retention drop was an
observation-window censoring artifact, not a policy effect.
iii. Ordered logit — review score 1–5 (per-order grain, 30k stratified sample)
y ~ OrderedLogit(φ, κ); φ = β_C + τ_C·T + γ_S + δ_M
κ[0] = 0 (anchored); κ[k] = κ[k−1] + HalfNormal(1)
δ̄ = −0.170 cumulative-logit; P(δ̄ < 0) = 99.7%;
κ ESS after the anchoring fix 941–2195; raw effect −0.13 stars.
Integrated finding across the four outcomes: the policy modestly lifts on-time delivery,
gives a plausible-if-uncertain conditional-spend lift, does not actually hurt retention (a
180-day censoring artifact), and costs about 0.13 stars in review score. Two of four naive
findings were misleading, one was misleading in magnitude, one was correct.
05 · Per-category heterogeneity
A flat A/B gives one number. The hierarchy gives a posterior per category.
The same δ̄ that lifts the marketplace average by +1.5 pp ranges from +3.5 pp in
furniture_decor down to −0.8 pp in consoles_games. Partial
pooling shrinks small-sample categories toward the global mean, so the ordering is robust to
thin cells. This is the Simpson-paradox-style heterogeneity that pooled tests cannot expose,
and it is the operational lever: a real platform deploys the policy where the per-category
effect is most positive, not everywhere at once.
06 · Diagnostics & robustness
Every assumption the DiD rests on is tested out loud.
| Test | Result |
| Parallel trends | Pass. Pre-cutover slope-difference d = −0.003 logit/week, p = 0.33; 46,774 orders across 58 weeks. |
| Bunching at R$ 150 (McCrary) | Clean. Z = −13.9, a deficit of 478 orders above the threshold — a structural retail-pricing kink, not policy-induced bunching. |
| Prior sensitivity | Posterior dominates. δ̄ = +0.1699 / +0.1695 / +0.1696 under Exponential(1) / Exponential(2) / HalfNormal(1) priors; spread 0.0004 logit. |
| Sampler health | Hurdle and review fits: 0 divergences. Binomial DiD: 388 / 6,000 (6.5%), posterior means unaffected. R̂ = 1.00–1.01. PSIS-LOO Pareto-k 100% below 0.7. |
07 · The business call
A positive per-customer lift. A sharply negative net envelope.
The Welch t-test on per-customer revenue is positive: +R$ 1.90 (p = 0.011). Sizing the policy
with the actual freight data turns the per-customer win into a marketplace-wide loss, because
the subsidy is paid on every eligible order, not only the marginal ones.
| Line | Value |
| N post-cutover eligible orders | 12,405 |
| Average freight per eligible order | R$ 36.83 |
| Subsidy cost | R$ 456,879 |
| Incremental GMV | R$ 21,921 |
| Incremental margin at 20% | R$ 4,384 |
| Net envelope | −R$ 452,495 |
Break-even contribution margin would need to be roughly 2,084%. The supported recommendations
at this design point are a phased rollout limited to the favourable category set, or holding
the policy pending a real pilot. A marketplace-wide launch is not warranted by the envelope.
08 · Where the policy is worth running
The categories that win on-time are also the categories that lose reviews.
Top 5 categories where the policy lifts on-time delivery most: furniture_decor
(+3.54 pp), home_appliances_2 (+3.04 pp), auto (+2.98 pp),
home_appliances (+2.69 pp), furniture_living_room (+2.53 pp).
Top 5 categories where the policy hurts review scores most: sports_leisure
(−0.326 cum-logit), auto (−0.279), furniture_decor
(−0.258), baby (−0.247), consoles_games (−0.244).
auto and furniture_decor appear on both lists — heavy,
expensive items where free shipping helps most and where customer expectations rise the moment
the policy makes them attractive. The first wave that survives that tension:
fashion_shoes, office_furniture, watches_gifts. The four
outcome scales (logit, log, cumulative-logit) are not commensurable, so the analysis ranks
per-outcome and lets the decision-maker apply their own weights.
09 · Honest limits
Four caveats that travel with the headline numbers.
Synthetic treatment. Olist never ran this policy; results characterise what
the inferential machinery would say if it had, and do not generalise to deployment without a
real pilot. SUTVA in a marketplace. Sellers learn the policy and respond by
re-pricing or by reordering logistics priority; a seller-clustered RCT is the right answer.
180-day repeat-window censoring. Late-panel customers had less than 180 days
of follow-up; a Cox or Weibull accelerated-failure-time model is the textbook fix.
Bunching the data cannot show. A real R$ 150 threshold would induce
basket-padding; the conditional-spend posterior here is therefore an underestimate of a live
deployment.
10 · What this project demonstrates
Not just a model fit. An end-to-end inferential argument.
Causal-inference framing (DAGs, the four elemental confounds, adjustment sets, mediator and
collider hygiene). Hierarchical Bayesian modelling in PyMC (varying intercepts and slopes,
non-centered priors, partial pooling, hurdle mixtures, ordered logit with anchored cutpoint
identification). Quasi-experimental design (difference-in-differences with formal
parallel-trends testing and McCrary-style density-continuity diagnostics). Posterior
diagnostics (PSIS-LOO, Pareto-k, effective sample size, R-hat, divergence accounting,
posterior-predictive checks, prior-sensitivity sweeps). Production-shaped data engineering
(a medallion bronze-silver-gold-analytics pipeline on DuckDB with window-function feature
derivation and a CI-tested codebase). And translating a posterior into a decision: a
cost-benefit envelope built off posterior means, multi-objective per-category ranking, and
uncertainty communicated to non-technical readers.
Repository ·
github.com/thylinao1/Olist-Bayesian-AB-Testing
— medallion ETL, three PyMC model factories, the DiD specification, prior-sensitivity
and parallel-trends scripts, the cost-benefit envelope generator, and a 17-test pytest suite
running in CI.