On a 14,606-customer PowerCo panel with a 9.7% churn rate, a textbook Random Forest at the default 0.5 threshold catches roughly 5% of churners and looks "91% accurate." Reframing churn as a £-denominated decision problem with the task brief's cost matrix collapses the operating threshold to the grid floor at t ≈ 0.01, lifts recall to ~99%, and cuts expected spend on the test fold by ~£15.9M. The threshold is sitting in a corner. That is the most interesting finding.
The PowerCo brief ships as a generic classification task: predict which SMEs will churn. The interesting move is to refuse the brief as stated. A churn probability is not an action; a decision threshold is. And a decision threshold is not free.
The project is framed as the inferential machinery a retention team would actually deploy: a £-denominated expected-cost objective, a threshold tuned on a held-out validation fold, and a test fold touched exactly once for the headline number.
Three cost levers drive the whole pipeline: CLV (cost of a missed churner), campaign cost (one retention contact), and retention rate (probability that a contacted churner is saved). The TP cell is a derived quantity, not an input. Every reported number is a function of those three.
The shared logic lives under src/ behind a 48-test pytest suite running in CI on Python 3.10 and 3.11. The notebooks orchestrate; the library does the work. Anything that depends on the test fold raises if you call it before the very last cell.
The default predict() uses a 0.5 cutoff. On a 9.7% churn panel the classifier almost never crosses it, so almost every churner is missed. Each missed churner costs £50,000 in lost CLV. With a £500 contact cost on the other side, the cost-optimal threshold collapses to the bottom of the 0.01-0.99 search grid and the policy contacts almost every customer in the book.
The increments are stacked in the order a careful practitioner would try them: a baseline first, then the standard imbalance fix, then the decision-theoretic step. The cost reduction is not where the literature would expect it.
RandomForestClassifier( n_estimators = 100, max_depth = 20, min_samples_split = 5, random_state = 42, ) predict() -> threshold = 0.5
Pipeline([
("smote", SMOTE(random_state=42)),
("rf", RandomForestClassifier(...)),
])
# probabilities only; threshold
# still selected downstream
grid = np.arange(0.01, 0.99, 0.01)
cost(t) = FN(t)·£50k
+ FP(t)·£500
+ TP(t)·(−£14.5k)
t* = argmin_t cost(t)
= 0.01 (grid floor)
The integrated finding: SMOTE contributes a real but modest cost reduction (~£1.2M on the test fold). The threshold step contributes the rest. The story is decision-theoretic, not resampling. A Random Survival Forest run in parallel reaches a held-out concordance of ~0.71 on time-to-churn, well above the ~0.56 of the linear Cox model it replaced; it is reported as a complement to the classifier, not the lead.
The whole report is a function of CLV, campaign cost and retention rate. Drag the sliders to test whether the cost-optimal threshold survives a different set of assumptions. The cost curve, the cost-optimal threshold, the operational profile, and the business envelope all recompute live.
model SMOTE + Random Forest predicted probabilities on the held-out test fold · grid threshold 0.01..0.99 step 0.01 · readouts recomputed live from the confusion-matrix counts at t*
The headline scenario fixes CLV at £50k, campaign cost at £500, and retention success at 30%. None of those is observed; all of them are operating assumptions from the original task brief.
Two directional facts the static charts on the right confirm. The cost-optimal threshold falls with CLV: more expensive churners make the optimiser more aggressive about contact, so it slides toward the floor. The cost-optimal threshold rises with campaign cost: only above roughly £1,500 per contact does universal contact stop being worthwhile. The retention-rate panel flips axes: it shows the net expected value at t*, crossing zero around 19%.
Every recommendation in this report can be re-priced by swapping the three constants, and the studio above is the live tool for doing that.
Each panel below is a one-dimensional sweep with the other two levers pinned at the headline values, computed off the actual test-fold confusion-matrix counts.
The shaded band on each threshold panel is t = 0.01 ± 0.005, the region the cost-optimum stays in for most of the sweep. The right panel flips axes to show net expected value at t* as the retention success rate is varied.
Resampling location, calibration after SMOTE, the survival model's held-out concordance, the boundary nature of the cost-optimal threshold, and the out-of-time split by activation date. Each is reported with the test that would expose it, not a hand-wave.
The cost-optimal threshold lives at the grid floor under the brief's cost matrix, but a real deployment might still ship a different operating point (contact-budget constraint, ops capacity, communications policy). The slider below scans the threshold across the grid and reports the resulting policy: confusion-matrix counts, recall, precision, contact rate, and expected cost at the brief's cost matrix.
cost matrix brief values · CLV £50k · campaign £500 · retention 30% · data SMOTE + Random Forest predicted probabilities on the held-out test fold
On the held-out test fold of 2,922 customers, the cost-tuned policy cuts expected spend by roughly £15.9M vs. the default 0.5 threshold, and tips the headline number negative (a net benefit) under the brief's cost matrix.
To test whether that survives temporal drift, the data is re-split by contract activation date: the model trains on the earliest ~80% of activations and is scored on the most recent ~20%. The cost-optimal threshold stays on the grid floor, so the policy is robust. The model's discrimination is not: out-of-time AUC falls from ~0.67 to ~0.62. The saving stays large on the later cohort only because it churns more (~14% vs. ~10%), not because the model improved.
The supported recommendations: ship the cost-tuned threshold as the operating point, treat the £15.9M as an upper bound on the single-snapshot prize, and retrain on a rolling window because discrimination visibly decays on newer cohorts.
What I would do next: a genuine forward-label out-of-time window, uplift modelling on a pilot subset, and a re-priced cost matrix once retention success is observed rather than assumed.
Permutation importance on the test fold against the frozen pipeline ranks the anonymised channel_sales and origin_up codes first, by a wide margin. The literal driver of churn in this dataset is acquisition channel, not price. Which channel a customer was acquired through predicts whether they leave more than any contract feature on the file.
Below the channel block sits an economic tier: net_margin, log(cons_12m), var_year_price_off_peak. SMEs that look financially stretched on the existing contract churn more, and off-peak price variance is a supporting signal. The Random Survival Forest agrees: it ranks the electricity-margin features at the top of its own permutation importance.
This is a useful but uncomfortable finding. The brief expects a price story; the data tells a channel story. Hover over any bar for the detailed reading.
The forest plot below shows permutation importance on the held-out test fold with 95% bootstrap intervals, sorted by mean. Rows in the accent colour are the anonymised acquisition-channel codes that dominate the top of the list; the remaining rows are the economic / contract features the original brief assumed would lead.
Permutation importance is model-agnostic and resists the cardinality bias that pure tree-based importances suffer from on numeric features. The exact code strings are anonymised in the dataset; the labels here are truncated to fit.
The dataset is a single 2015 cross-section of PowerCo SMEs. The £15.9M cost reduction is a one-shot characterisation of the rule on this fold, not a forecast and not annualised. The out-of-time check splits on activation date, but every churn label is still observed at the same calendar moment, so it is a pseudo out-of-time test, not a substitute for a genuine forward-label window.
Under the brief's cost matrix the cost-optimal threshold lives on the grid floor and the policy contacts essentially the whole customer base. That is a property of the matrix (FN 100× more expensive than FP), not of the model. Before any deployment the matrix needs to be re-priced against observed contact and retention costs.
CLV £50k, campaign cost £500, retention rate 30% are operating assumptions from the original task brief. The sensitivity sweep and the Cost-Matrix Studio show directionally how the threshold responds; none of those three constants is observed in the data.
The dataset has no service-interaction history, no competitor pricing, and no contract-change events. Price sensitivity in this report is bounded to absolute price levels visible in the 2015 snapshot; the behavioural side of churn is, by construction, out of scope.
The Random Survival Forest reaches ~0.71 concordance on a held-out fold, a real improvement on the ~0.56 of the linear Cox model it replaced. Its training concordance is higher still, so it carries some overfitting and is reported as a complement to the classifier, not a standalone retention model.
The repository contains the three modelling notebooks, the pure feature and evaluation helpers under src/, the isotonic-calibration module, the cost-sensitivity sweeps, the Random Survival Forest, the out-of-time validation, and a 48-test pytest suite running in CI against Python 3.10 and 3.11.
Open the repository ↗