project · JPMC Quantitative Researchdomain · credit risk · scorecards · reinforcement learningstack · scikit-learn · SHAP · dynamic programming · tabular Qauthor · Maksim Silchenko

A case study in evaluation protocol

What survives
out of sample?

Three modelling exercises against synthetic JPMC datasets. Each one carries a flattering in-sample headline and a held-out number that tells a colder truth. The contribution is the evaluation protocol, not the model.

$363K

Improvement on the held-out test fold from moving t = 0.5 to t* = 0.25

FICO buckets, DP-optimal, monotonic default rate from 66% to 3%

1.79x → 0.09x

Q-learning edge over seasonal swing, in-sample to held-out window

pytest tests in CI, covering every module under src/

01 · Three problems, one thread

Each project is a small test of one idea: the held-out fold is the only fold that gets to speak.

The Forage program supplies the data and the prompts. The contribution here is the evaluation protocol around each model, written so that the headline number on every report is one that a future replication could actually reproduce.

Probability of default, with an honest threshold

Credit risk

Logistic regression on a restricted feature set, then a 60 / 20 / 20 split so threshold selection never touches the fold that reports profit. Bootstrap CIs and a cost-sensitivity sweep size the result.

headline0.7827 AUC

scope95% CI [0.7575, 0.8062]

ii.

Optimal quantisation by log-likelihood

FICO bucketing

Greedy vs dynamic programming on the same objective. DP wins by 13.14 log-likelihood at k = 7 and respects monotonicity. Bootstrap CIs on each chosen boundary reveal which cuts are stable.

headlineIV 0.77

scopeWoE monotone, 7 buckets

iii.

A tabular Q that memorises the training year

Gas storage RL

Seasonal price model plus a discrete-action MDP for inject / withdraw / hold. A chronological 24 / 24 split exposes the formulation's failure mode: the agent's in-sample edge does not transfer.

headline$1.50

scopetest-window profit, agent

02 · Credit risk · the threshold twist

The model is the easy part. The decision rule is where the money is.

With an asymmetric cost matrix (good loan rejected costs loan × margin, defaulted loan approved costs loan × LGD) the operating point dominates the choice of estimator. Picking the cutoff and reporting profit on the same data quietly turns a calibration check into an in-sample optimisation.

naive default 50% cutoff

-$363K

test profit at t = 0.5 · on n = 2,000 held-out loans

The textbook half-way cutoff treats type I and type II errors symmetrically. On an asymmetric cost matrix that is the wrong default, by hundreds of thousands of dollars per 2,000-loan book.

protocol threshold picked on dev fold

test profit at t* = 0.25 · 95% bootstrap CI [-$266K, +$251K]

The threshold is selected on a 20% dev fold. Profit is reported on a 20% test fold that no threshold optimiser touches. The improvement is real; the level around it is honest about its uncertainty.

How the threshold sweep looks.

Profit on the test fold against threshold is sharply asymmetric. To the left of t* the false-rejection cost dominates (good loans turned away). To the right, default losses accumulate fast because LGD is 0.90 of the principal.

The flat 50% cutoff sits squarely in the part of the curve where the test fold realises a $363,000 loss. Moving to t* = 0.25 does not produce profit, it produces a break-even book. The improvement on the held-out fold is the gap between the two operating points.

Per loan improvement $181.50 · rejection rate at t* 24.3% · default rate on approved book 11.6%

The orange line is the threshold selected on the dev fold. The grey line is the textbook 50% default cutoff. Both are evaluated on the same held-out test fold so the comparison cannot be biased by sample choice.

The vertical arrow on the right measures the $363,000 gap, in dollars on the test fold, between the two operating points.

at t = 0.25, on the held-out test fold

test-fold profit on 2,000 loans

Rejection rate24.3%

False rejections290

Approved volume$15,150,000

Default rate (approved)11.6%

drag the slider to move t along the test-fold curve · grey textbook default 0.5 · every value comes from the precomputed profit / operational-profile arrays

"The model achieves the same AUC at every threshold. The operating point is what makes it a portfolio decision instead of an academic one."

03 · Sensitivity to the cost regime

Three plausible cost regimes. Three different operating points. Three different signs.

The base scenario assumes a 15% margin and a 90% loss given default. These are stand-in numbers, not estimates of a specific portfolio. A sensitivity sweep across one tighter and one looser scenario brackets the structural answer.

Under conservative costs the optimal rule still loses money on this synthetic book. Under aggressive costs it pays nearly $800K. The protocol is the same; only the cost vector changes.

The point of the sweep is not to pick a winner. It is to show how much of the headline number is the model and how much is the cost vector. On this data, most of it is the cost vector.

Each row is one cost regime. The bar reports the test-fold profit at t* chosen for that regime on the dev fold. Whiskers are 95% bootstrap CIs.

The conservative CI sits entirely below zero. The base CI straddles zero. The aggressive CI sits entirely above. This is the signal a portfolio owner actually needs from a credit model.

x-axis test-fold profit at t* chosen on dev fold · whiskers 95% bootstrap CI · finding sign of the result depends on the cost regime

Baset* = 0.25 · margin 0.15 · LGD 0.90

$095% CI [-$262,650, +$252,000]

Break-even on the held-out test fold; CI straddles zero.

04 · Calibration and explanation

The raw logit is already well calibrated. The model knows what it does not know.

Expected calibration error on the test fold is 2.16%. Brier is 0.126. The reliability curve tracks the diagonal with slight underconfidence in the 0.3 to 0.5 region where the decision boundary actually sits.

Post-hoc isotonic regression with CV=5 does not improve ECE on this data; it raises it to 2.82% by absorbing CV-fold variance. The negative result is worth keeping: not every well-known fix is appropriate on every dataset.

ECE raw LR0.0216

ECE isotonic CV=50.0282

Brier raw LR0.1259

Brier isotonic CV=50.1262

LinearExplainer on the LR puts FICO score at the largest mean absolute SHAP value, with loan amount second. Both effects point the right way: higher FICO lowers risk, higher outstanding loan raises it.

reliability curve · test fold

test fold n = 2,000 · ECE 0.0216 (raw LR) · Brier 0.1259 · slight underconfidence in 0.3 to 0.5

05 · Cohort generalisation

The operating point that wins on long-tenure customers is the wrong policy on short-tenure customers.

The Forage data has no time column, so a real out-of-time test is not possible. The next-best stress test is a monotone proxy: train on the long-tenure cohort (years_employed ≥ 3), evaluate on the short-tenure cohort (years_employed < 3).

The short-tenure cohort has a 2.4x higher base default rate. The model trained on long-tenure customers cannot recover that through the same threshold: per-loan profit drops by $876 and the operating point breaks down to a 90% rejection rate with 458 false rejections out of 796.

The general lesson: a model is not a thing, it is a thing plus the cohort it was fit on. Out-of-cohort deployment is its own modelling problem.

Train cohortyears_employed ≥ 3

Train n9,116

Train base default16.5%

Train t*0.21

Train profit+$1,393,500

Test cohortyears_employed < 3

Test n884

Test base default39.1%

Test rejection rate90.0%

Test profit-$639,000

y per-loan profit · protocol train on long-tenure cohort, evaluate on short-tenure cohort · the operating point does not transfer

06 · FICO score quantisation

Greedy lands fast and near the answer. Dynamic programming lands on the answer.

For a downstream scorecard the FICO range needs to be discretised into k buckets that maximise a likelihood derived from the empirical default rate in each bin. Greedy starts from quantile boundaries and improves locally. DP precomputes every interval likelihood and returns the exact solution in O(n² k).

The optimal seven buckets.

The DP-optimal seven-bucket configuration is monotonic in default rate, from 66% in the lowest bucket down to 3% in the highest. Weight of evidence runs from -2.1 to +2.0 with no inversions. Information value is 0.77, comfortably above the 0.3 threshold for strong predictive power.

Pushing to ten buckets produces a 1.65% / 37.50% inversion at bucket 8 to 9 driven by a small-sample anomaly. More granularity introduces instability without improving discrimination. The 7-bucket solution is the operationally usable one.

Bootstrap CIs (50 resamples) show edge buckets are stable; middle boundaries have substantial uncertainty (CI widths of 30 to 50 score points).

LL vs k · greedy vs DP

at k = 7 DP -4229.60 vs greedy -4242.74 · gap 13.14 · greedy lands in a local optimum

seven-bucket DP solution · default rate by bucket

y empirical default rate · IV 0.77 (strong) · monotonic 7 buckets pass · 10 buckets break at bucket 8 to 9

07 · Gas storage · the formulation's failure mode

The agent memorises the training year. The seasonal swing keeps the test year.

A tabular Q-learning agent with the raw time index inside its state can encode "do action a at month m" verbatim. On the 24-month training window that is enough to beat both heuristics. On the held-out window the policy is brittle and the seasonal-swing baseline wins comfortably.

monthly henry-hub-style series · seasonal model fit

model a + b t + c t² + d sin(2π t / 12) + e cos(2π t / 12) · data Nat_Gas.csv, 48 monthly observations · split 24 / 24 chronological

What the held-out window measures.

On the training window all three strategies are valued at the same prices and the same storage cost. Q-learning produces $27.10 per ten-unit position, versus $15.10 for the seasonal swing and a loss for buy-and-hold (the modest price drift does not cover carry on a passive long).

Out of sample the order reverses: the seasonal swing produces $16.00, and the trained agent collapses to $1.50. The seasonal pattern is robust across years; the agent's memorised time-action map is not.

This is the expected failure mode of the formulation, and reporting it is the result. Reasonable next steps include dropping the raw time index from the state, sliding-window evaluation, and a deep Q-network for continuous inventory.

The same numbers, laid out side by side. The lighter bar is the in-sample year; the darker bar is the held-out year. All values are per ten-unit position and include the shared $0.05 per-unit per-month carrying cost.

Illegal actions (inject at full, withdraw at empty) become no-op holds and are flagged in the info dict. The cost convention is shared between the env and the baselines so the comparison is on equal footing.

per-10-unit profit · train and held-out windows

cutover month 24 of 48 · train Oct 2020 to Jun 2022 (24 mo) · test Jul 2022 to Sep 2024 (24 mo) · each RL number is Q-learning trained AND evaluated on its own window

08 · Diagnostics

Every headline number gets a confidence interval next to it.

On a 2,000-row test fold, point estimates carry sampling noise that is large compared to the differences between competing models. Reporting bootstrap CIs alongside every point makes the right model comparisons obvious and the wrong ones impossible.

bootstrap 95% CIs · 2,000 resamples on the test fold

resamples 2,000 · scope test fold, n = 2,000 · each row has its own zoom centered on the point estimate

Model comparison

LR wins

Logistic regression vs random forest vs XGBoost on the restricted feature set, 5-fold CV. Paired t-test on LR vs XGBoost AUC differences across folds gives p = 0.0003. The simpler model wins on this regime; the tree-based models do not improve on it.

Logistic regression0.783 +/- 0.013 AUC

Random forest0.729 +/- 0.010 AUC

XGBoost0.730 +/- 0.011 AUC

LR vs XGB paired tp = 0.0003

Data sanity

flagged · adjusted

The full Forage feature set is trivially separable: simple LR hits 1.0000 AUC and credit_lines_outstanding correlates 0.86 with default. Real credit models live in the 0.65 to 0.75 AUC range. The notebook documents the diagnosis and restricts the feature set so the modelling regime is realistic, not flattering.

Full-feature AUC1.0000 (too easy)

credit_lines_outstanding ρ0.86 with default

Restricted feature setincome · years_employed · FICO · loan_amt

Restricted-feature AUC0.783

Boundary stability

edges stable

Bootstrap 95% CIs on the seven DP-optimal FICO boundaries. The lowest and highest boundaries are tight; the middle three carry 30 to 50 score-point uncertainty. For production this argues for re-fitting boundaries on each scoring window, or for accepting wider middle buckets as a stability bias.

b₁ · 513[494, 535]

b₂ · 553[526, 580]

b₃ · 585[553, 611]

b₄ · 617[608, 644]

b₅ · 655[638, 694]

b₆ · 715[690, 740]

RL evaluation rigour

no leakage

Train and evaluate functions are separated. Train fits a Q-table with linear epsilon decay from 0.3 to 0.05; evaluate rolls the greedy policy on the held-out price series. Storage cost is a single shared constant across env and heuristics so no comparison can quietly use a different number on each side.

Train windowOct 2020 - Sep 2022

Test windowOct 2022 - Sep 2024

Shared storage cost$0.05 / unit / month

Test-window agent advantage vs swing-$14.50 (collapsed)

09 · Honest limits

Four caveats that travel with every number on this page.

Synthetic data

All three datasets are synthetic, supplied by the Forage program. Headline dollar amounts inherit the data's limitations; they are not estimates of performance on a production portfolio.

No time column for credit

The loan dataset has no booking date. The model uses a stratified random split rather than out-of-time. The cohort generalisation test is a monotone proxy via years_employed, not a substitute for a real backtest.

Q-learning state contains time

The agent's state includes the raw month index, which lets it memorise the training trajectory. The held-out collapse is the expected failure mode of the formulation and is reported as the result, not patched out of the headline.

Out of scope

Real-time data integration, model monitoring, regulatory capital under Basel III, stress testing, fairness analysis, and production deployment are not addressed in this project.

10 · What this project demonstrates

Six things this case study is designed to evidence.

Evaluation protocol design

Three-way splits that quarantine the test fold from threshold selection. Chronological splits for time-series. Cohort holdouts as monotone-proxy backtests.

Decision-theoretic modelling

Asymmetric cost matrices, threshold sweeps, cost-regime sensitivity bands, profit-maximising operating points reported with held-out CIs.

Calibration and explanation

Reliability curves, ECE and Brier, isotonic post-hoc fitting, LinearExplainer SHAP for logistic models, plus the negative-result discipline to report when a fix does not help.

Combinatorial optimisation

Dynamic-programming bucketing for log-likelihood maximisation, greedy baselines for comparison, bootstrap CIs on each chosen boundary, monotonicity and information-value diagnostics.

Reinforcement learning fundamentals

Tabular Q with linear epsilon decay and seeded determinism. Separated train and evaluate functions. Shared cost conventions across the env and the heuristic baselines so comparisons are commensurable.

Engineering hygiene

72 pytest tests in CI covering every module under src/. Notebook re-execution as a smoke test. Single sources of truth for cost constants. Reproducibility taken seriously.

End of case study

Read the notebooks, the modules, and the test suite.

The repository contains three executable notebooks, the scikit-learn LR with three-way split logic, the dynamic-programming FICO bucketer, the gas-storage environment and tabular Q-learner, plus a 72-test pytest suite running in CI.

Open the repository↗

What survivesout of sample?