M. Silchenko · Case Study
GitHub
BCG X · PowerCo SME
project · PowerCo SME Churn origin · BCG X · Forage stack · scikit-learn · imbalanced-learn · scikit-survival author · Maksim Silchenko
A case study in decision-aware modelling

One threshold change,
£15.9M in expected savings.

On a 14,606-customer PowerCo panel with a 9.7% churn rate, a textbook Random Forest at the default 0.5 threshold catches roughly 5% of churners and looks "91% accurate." Reframing churn as a £-denominated decision problem with the task brief's cost matrix collapses the operating threshold to the grid floor at t ≈ 0.01, lifts recall to ~99%, and cuts expected spend on the test fold by ~£15.9M. The threshold is sitting in a corner. That is the most interesting finding.

~£15.9M
Test-fold expected-cost reduction vs. RF at t = 0.5 (single 2015 snapshot).
~100%
Customer contact rate at the cost-optimal threshold: corner solution, not selective.
~0.99
Test-fold recall at t ≈ 0.01 (1.00 after isotonic calibration).
37
Pytest unit tests on src/ helpers, in CI on Python 3.10 and 3.11.
01 · The question

Which PowerCo SMEs should the retention team actually contact, and what is one wrong contact worth?

The PowerCo brief ships as a generic classification task: predict which SMEs will churn. The interesting move is to refuse the brief as stated. A churn probability is not an action; a decision threshold is. And a decision threshold is not free.

The project is framed as the inferential machinery a retention team would actually deploy: a £-denominated expected-cost objective, a threshold tuned on a held-out validation fold, and a test fold touched exactly once for the headline number.

Three cost levers drive the whole pipeline: CLV (cost of a missed churner), campaign cost (one retention contact), and retention rate (probability that a contacted churner is saved). The TP cell is a derived quantity, not an input. Every reported number is a function of those three.

Cost matrix · per outcome
TP = campaign − CLV · retention = £500 − £50k · 0.3 = −£14,500
ACTUAL OUTCOME y = 0 · stays y = 1 · churns PREDICTED ACTION ŷ = 0 no-contact ŷ = 1 contact TN £0 correct no-contact FN £50,000 missed churner · lost CLV FP £500 unneeded contact TP −£14,500 saved customer (net) FN : FP cost ratio 100 : 1
source task brief · TP derived from campaign − CLV · retention_rate · objective minimise total expected cost on the validation fold
02 · Data and feature pipeline

Pure, unit-tested feature functions. The test fold is touched once.

The shared logic lives under src/ behind a 48-test pytest suite running in CI on Python 3.10 and 3.11. The notebooks orchestrate; the library does the work. Anything that depends on the test fold raises if you call it before the very last cell.

01ingest
Two CSVs, one snapshot, pinned types
client_data.csv (14,606 rows, contract + consumption) joined to price_data.csv (off-peak / peak / mid-peak by period). The 2015 snapshot is the entire universe, so genuine out-of-time validation is not possible; a pseudo out-of-time split on activation date is used instead, and that limitation is named up front.
14,606 customers · 9.7% churn prevalence · 26 raw columns
02features
Temporal, volatility, log-transforms
tenure_days and days_to_end from contract dates against a 2016-01-01 reference. Per-row mean, max and variance over six price-change columns. Margin-per-kWh and consumption-change-rate with safe denominators. log1p on any covariate whose skewness exceeds 1.0.
build_feature_frame() · pure, unit-tested · no I/O
03split
60 / 20 / 20 stratified, the test fold sealed
Train, validation, test stratified on churn. Threshold selection lives entirely on the validation fold; the test fold is opened exactly once for the headline metrics. Everything else (CV, sensitivity sweeps, model selection) runs on train + validation only.
train 8,763 · val 2,921 · test 2,922 · prevalence held
04resample
SMOTE inside the pipeline, not before it
SMOTE is wrapped in an imblearn.Pipeline with the classifier so synthetic samples are regenerated inside every CV fold. Resampling outside the fold leaks the held-out distribution into training, a quiet but common mistake. The test fold stays at the natural 9.7% prevalence.
imblearn.Pipeline([(smote), (rf)]) · refit per fold
03 · The plot twist

Same model, same data. The threshold walks to the grid floor.

The default predict() uses a 0.5 cutoff. On a 9.7% churn panel the classifier almost never crosses it, so almost every churner is missed. Each missed churner costs £50,000 in lost CLV. With a £500 contact cost on the other side, the cost-optimal threshold collapses to the bottom of the 0.01-0.99 search grid and the policy contacts almost every customer in the book.

default what predict() at t = 0.5 gives
~5%
test-fold recall · accuracy ~0.91 · contact rate ~1% · cost dominated by FN
The headline a stakeholder reads as "91% accurate" is a model that catches roughly one in twenty churners. With CLV = £50k, that is the most expensive operating point on the curve.
cost-opt what the £-objective gives
~99%
recall at t ≈ 0.01 · validated on a disjoint fold · contact rate ~100%
Same probabilities, different cutoff. Recall rises by an order of magnitude, accuracy drops, and the policy ends up contacting essentially the whole book. The cost matrix has a corner solution and the threshold step finds it.
cost and recall vs. threshold · single sweep on the test fold
left axis expected cost (£M, test fold), negative = net benefit · right axis recall · source sweep_thresholds() over the SMOTE+RF pipeline, grid step 0.01
The model was never the problem. The default threshold was, and on imbalanced churn data with this cost matrix, defaults push you to a corner solution.
M. Silchenko · case-study verdict
04 · Three increments, one decision

Three pipelines, ablated cleanly. Only one of them moves the cost.

The increments are stacked in the order a careful practitioner would try them: a baseline first, then the standard imbalance fix, then the decision-theoretic step. The cost reduction is not where the literature would expect it.

i.
Random Forest baseline
test fold · 2,922 customers
step · ACCURACY-OPTIMISED
RandomForestClassifier(
  n_estimators = 100,
  max_depth    = 20,
  min_samples_split = 5,
  random_state = 42,
)

predict()  -> threshold = 0.5
Test accuracy~0.91
Test recall~0.05
Contact rate~1%
Honest verdictuseless as deployed
ii.
SMOTE + Random Forest
imblearn pipeline · SMOTE per fold
step · CLASS BALANCE
Pipeline([
  ("smote", SMOTE(random_state=42)),
  ("rf",    RandomForestClassifier(...)),
])

# probabilities only; threshold
# still selected downstream
Resampling locationinside fold
Probability calibrationbiased post-SMOTE
Cost reduction vs. baseline~£1.2M
Recall delta over threshold stepsmall
iii.
Cost-optimal threshold
validation fold · 1-D sweep
step · EXPECTED-COST MINIMUM
grid = np.arange(0.01, 0.99, 0.01)
cost(t) = FN(t)·£50k
        + FP(t)·£500
        + TP(t)·(−£14.5k)

t* = argmin_t  cost(t)
   = 0.01  (grid floor)
t* (validation)0.01 · boundary
Recall at t*~0.99
Contact rate at t*~100%
CV stability (5-fold)σ(t*) ≈ 0

The integrated finding: SMOTE contributes a real but modest cost reduction (~£1.2M on the test fold). The threshold step contributes the rest. The story is decision-theoretic, not resampling. A Random Survival Forest run in parallel reaches a held-out concordance of ~0.71 on time-to-churn, well above the ~0.56 of the linear Cox model it replaced; it is reported as a complement to the classifier, not the lead.

★ interactive · cost-matrix studio

Move the three numbers. Watch the recommendation move.

The whole report is a function of CLV, campaign cost and retention rate. Drag the sliders to test whether the cost-optimal threshold survives a different set of assumptions. The cost curve, the cost-optimal threshold, the operational profile, and the business envelope all recompute live.

CLV (cost of a missed churner)
£50k
£10k£200k
Campaign cost (one contact)
£500
£100£5,000
Retention rate (P(save | contact))
30%
5%90%
TP cell (derived): −£14,500
campaign − CLV · retention
Cost-optimal t*
0.01
grid floor · corner solution
Expected cost at t*
−£2.66M
test fold (2,922 customers)
Contact rate at t*
~100%
2,919 of 2,922
Churners caught
281 / 283
99.3% recall on the test fold

model SMOTE + Random Forest predicted probabilities on the held-out test fold · grid threshold 0.01..0.99 step 0.01 · readouts recomputed live from the confusion-matrix counts at t*

05 · Cost-parameter sensitivity

The threshold moves with the cost levers. The studio above is the live version of this.

The headline scenario fixes CLV at £50k, campaign cost at £500, and retention success at 30%. None of those is observed; all of them are operating assumptions from the original task brief.

Two directional facts the static charts on the right confirm. The cost-optimal threshold falls with CLV: more expensive churners make the optimiser more aggressive about contact, so it slides toward the floor. The cost-optimal threshold rises with campaign cost: only above roughly £1,500 per contact does universal contact stop being worthwhile. The retention-rate panel flips axes: it shows the net expected value at t*, crossing zero around 19%.

Every recommendation in this report can be re-priced by swapping the three constants, and the studio above is the live tool for doing that.

Each panel below is a one-dimensional sweep with the other two levers pinned at the headline values, computed off the actual test-fold confusion-matrix counts.

The shaded band on each threshold panel is t = 0.01 ± 0.005, the region the cost-optimum stays in for most of the sweep. The right panel flips axes to show net expected value at t* as the retention success rate is varied.

sensitivity · cost-optimal threshold (or net value) vs. lever
left, middle cost-optimal t* swept per cost lever · band t = 0.01 ± 0.005 around the boundary minimum · right net expected value at t* as retention success rate varies (zero around 19%)
06 · Diagnostics & robustness

Every choice that could leak signal is tested out loud.

Resampling location, calibration after SMOTE, the survival model's held-out concordance, the boundary nature of the cost-optimal threshold, and the out-of-time split by activation date. Each is reported with the test that would expose it, not a hand-wave.

Resampling location
leak-safe
SMOTE lives inside imblearn.Pipeline, so synthetic samples are regenerated inside every CV fold. The held-out fold's distribution never appears in training.
5-fold t*0.01 · floor on every fold
5-fold recall @ t*0.99 ± 0.01
test-fold prevalence9.7% · natural
Calibration after SMOTE
shipped
src/calibration.py + notebook §10 fit an isotonic regression on a disjoint holdout. Mean predicted probability drops from 0.165 (post-SMOTE, biased upward) to 0.097, matching the empirical churn rate. The business decision survives calibration.
mean ŷ · uncalibrated0.165
mean ŷ · isotonic0.097
empirical churn rate0.097
recall @ t* · post-calibration1.00
Survival model
held-out 0.71
A Random Survival Forest on 15 curated covariates, with contract tenure as the duration and churn as the event. It reaches a concordance index of ~0.71 on a held-out fold, replacing an earlier linear Cox model that managed only ~0.56. Electricity-margin features lead its permutation importance, agreeing with the classifier.
RSF concordance · held-out~0.71
Cox PH concordance · replaced~0.56
interpretationcomplement to RF · not primary
Boundary cost-optimum
corner finding
The 0.01-step grid puts the cost minimum at its lowest evaluated point. Extending the grid below 0.01 does not move the minimum meaningfully; the cost curve is still falling at the floor. Under this cost matrix, the policy is universal contact, not selective targeting.
grid step / range0.01 · [0.01, 0.99]
argmin locationgrid floor
contact rate at t*~100%
implicationoperational profile, not threshold
Out-of-time validation · split by contract activation date
drift check
A genuine out-of-time hold-out is impossible on a single 2015 snapshot, so the data is re-split by contract activation date: the model trains on the earliest ~80% of activations and is scored on the most recent ~20%, which churn materially more (~14% versus ~10%). The cost-optimal threshold stays on the grid floor, so the policy is robust to drift. The classifier's discrimination is not: out-of-time AUC falls from ~0.67 on a random split to ~0.62. The cost reduction stays large on the later cohort, but that is driven by its higher churn prevalence, not by a better model.
out-of-time t*~0.01 · unchanged from random split
out-of-time test AUC~0.62 (random split ~0.67)
later-cohort churn rate~14% (vs ~10% overall)
implicationretrain on a rolling window
★ interactive · operating-point inspector

Drag the threshold. Read off the policy.

The cost-optimal threshold lives at the grid floor under the brief's cost matrix, but a real deployment might still ship a different operating point (contact-budget constraint, ops capacity, communications policy). The slider below scans the threshold across the grid and reports the resulting policy: confusion-matrix counts, recall, precision, contact rate, and expected cost at the brief's cost matrix.

decision threshold
0.05
0.010.500.99
TNcorrect no-contact
3
FNmissed churner
4
FPunneeded contact
2,636
TPsaved customer (net)
279
Recall98.6%
Precision9.6%
Contact rate99.7%
Expected cost−£2.53M
Net benefit vs. RF @ 0.5£15.77M saved

cost matrix brief values · CLV £50k · campaign £500 · retention 30% · data SMOTE + Random Forest predicted probabilities on the held-out test fold

07 · The business call

A one-shot envelope, tested against temporal drift.

On the held-out test fold of 2,922 customers, the cost-tuned policy cuts expected spend by roughly £15.9M vs. the default 0.5 threshold, and tips the headline number negative (a net benefit) under the brief's cost matrix.

To test whether that survives temporal drift, the data is re-split by contract activation date: the model trains on the earliest ~80% of activations and is scored on the most recent ~20%. The cost-optimal threshold stays on the grid floor, so the policy is robust. The model's discrimination is not: out-of-time AUC falls from ~0.67 to ~0.62. The saving stays large on the later cohort only because it churns more (~14% vs. ~10%), not because the model improved.

The supported recommendations: ship the cost-tuned threshold as the operating point, treat the £15.9M as an upper bound on the single-snapshot prize, and retrain on a rolling window because discrimination visibly decays on newer cohorts.

What I would do next: a genuine forward-label out-of-time window, uplift modelling on a pilot subset, and a re-priced cost matrix once retention success is observed rather than assumed.

Test-fold customers2,922
Test-fold churners (~9.7%)~283
CLV per missed churner (FN)£50,000
Contact cost per FP£500
TP cell (derived)−£14,500
Expected cost · RF @ t = 0.5£13.24M
Expected cost · SMOTE + RF @ 0.5£12.08M
Expected cost · t* ≈ 0.01−£2.69M
Net cost reduction vs. RF baseline~ −£15.9M
Out-of-time test AUC~0.62 · random split ~0.67
Out-of-time t*~0.01 · threshold holds
08 · What is actually driving churn · hover for detail

Acquisition channel first. Margin and consumption second. Price volatility third.

Permutation importance on the test fold against the frozen pipeline ranks the anonymised channel_sales and origin_up codes first, by a wide margin. The literal driver of churn in this dataset is acquisition channel, not price. Which channel a customer was acquired through predicts whether they leave more than any contract feature on the file.

Below the channel block sits an economic tier: net_margin, log(cons_12m), var_year_price_off_peak. SMEs that look financially stretched on the existing contract churn more, and off-peak price variance is a supporting signal. The Random Survival Forest agrees: it ranks the electricity-margin features at the top of its own permutation importance.

This is a useful but uncomfortable finding. The brief expects a price story; the data tells a channel story. Hover over any bar for the detailed reading.

The forest plot below shows permutation importance on the held-out test fold with 95% bootstrap intervals, sorted by mean. Rows in the accent colour are the anonymised acquisition-channel codes that dominate the top of the list; the remaining rows are the economic / contract features the original brief assumed would lead.

Permutation importance is model-agnostic and resists the cardinality bias that pure tree-based importances suffer from on numeric features. The exact code strings are anonymised in the dataset; the labels here are truncated to fit.

permutation importance · test fold · frozen pipeline
whiskers 95% bootstrap interval · dot mean Δ accuracy under permutation · accent rows anonymised acquisition channel and origin codes · tip hover any bar for the per-feature reading
09 · Honest limits

Five caveats that travel with the headline numbers.

01
One snapshot, pseudo out-of-time only

The dataset is a single 2015 cross-section of PowerCo SMEs. The £15.9M cost reduction is a one-shot characterisation of the rule on this fold, not a forecast and not annualised. The out-of-time check splits on activation date, but every churn label is still observed at the same calendar moment, so it is a pseudo out-of-time test, not a substitute for a genuine forward-label window.

02
Corner-solution cost matrix

Under the brief's cost matrix the cost-optimal threshold lives on the grid floor and the policy contacts essentially the whole customer base. That is a property of the matrix (FN 100× more expensive than FP), not of the model. Before any deployment the matrix needs to be re-priced against observed contact and retention costs.

03
Assumed cost parameters

CLV £50k, campaign cost £500, retention rate 30% are operating assumptions from the original task brief. The sensitivity sweep and the Cost-Matrix Studio show directionally how the threshold responds; none of those three constants is observed in the data.

04
Missing context columns

The dataset has no service-interaction history, no competitor pricing, and no contract-change events. Price sensitivity in this report is bounded to absolute price levels visible in the 2015 snapshot; the behavioural side of churn is, by construction, out of scope.

05
Survival model carries some overfitting

The Random Survival Forest reaches ~0.71 concordance on a held-out fold, a real improvement on the ~0.56 of the linear Cox model it replaced. Its training concordance is higher still, so it carries some overfitting and is reported as a complement to the classifier, not a standalone retention model.

10 · What this project demonstrates
01
Decision-aware modelling
Framing classification as a £-denominated expected-cost objective; selecting the operating threshold by minimising cost on a disjoint validation fold; naming corner solutions when they appear.
02
Leak-safe resampling
SMOTE inside imblearn.Pipeline so synthetic samples are refit per fold; natural-prevalence test fold preserved; no leakage path between resampling and CV.
03
Probability calibration
Isotonic regression on a disjoint holdout to undo SMOTE's prevalence bias; calibration result verified against the empirical churn rate.
04
Survival analysis
Random Survival Forest on a curated covariate set with contract tenure as the duration; held-out concordance ~0.71, replacing a linear Cox model that reached only ~0.56. Validated out-of-time by activation date.
05
Cross-validated robustness
5-fold stratified CV reporting mean and standard deviation of the cost-optimal threshold; one-dimensional sensitivity sweeps on every cost lever.
06
Production-shaped Python
Pure feature functions under src/, a 48-test pytest suite in CI on Python 3.10 and 3.11, dataclass result containers, and a notebook layer that is only orchestration.
End of case study

Read the full report, the notebooks, and the test suite.

The repository contains the three modelling notebooks, the pure feature and evaluation helpers under src/, the isotonic-calibration module, the cost-sensitivity sweeps, the Random Survival Forest, the out-of-time validation, and a 48-test pytest suite running in CI against Python 3.10 and 3.11.

Open the repository