Three modelling exercises against synthetic JPMC datasets. Each one carries a flattering in-sample headline and a held-out number that tells a colder truth. The contribution is the evaluation protocol, not the model.
The Forage program supplies the data and the prompts. The contribution here is the evaluation protocol around each model, written so that the headline number on every report is one that a future replication could actually reproduce.
Logistic regression on a restricted feature set, then a 60 / 20 / 20 split so threshold selection never touches the fold that reports profit. Bootstrap CIs and a cost-sensitivity sweep size the result.
Greedy vs dynamic programming on the same objective. DP wins by 13.14 log-likelihood at k = 7 and respects monotonicity. Bootstrap CIs on each chosen boundary reveal which cuts are stable.
Seasonal price model plus a discrete-action MDP for inject / withdraw / hold. A chronological 24 / 24 split exposes the formulation's failure mode: the agent's in-sample edge does not transfer.
With an asymmetric cost matrix (good loan rejected costs loan × margin, defaulted loan approved costs loan × LGD) the operating point dominates the choice of estimator. Picking the cutoff and reporting profit on the same data quietly turns a calibration check into an in-sample optimisation.
Profit on the test fold against threshold is sharply asymmetric. To the left of t* the false-rejection cost dominates (good loans turned away). To the right, default losses accumulate fast because LGD is 0.90 of the principal.
The flat 50% cutoff sits squarely in the part of the curve where the test fold realises a $363,000 loss. Moving to t* = 0.25 does not produce profit, it produces a break-even book. The improvement on the held-out fold is the gap between the two operating points.
Per loan improvement $181.50 · rejection rate at t* 24.3% · default rate on approved book 11.6%
The orange line is the threshold selected on the dev fold. The grey line is the textbook 50% default cutoff. Both are evaluated on the same held-out test fold so the comparison cannot be biased by sample choice.
The vertical arrow on the right measures the $363,000 gap, in dollars on the test fold, between the two operating points.
The base scenario assumes a 15% margin and a 90% loss given default. These are stand-in numbers, not estimates of a specific portfolio. A sensitivity sweep across one tighter and one looser scenario brackets the structural answer.
Under conservative costs the optimal rule still loses money on this synthetic book. Under aggressive costs it pays nearly $800K. The protocol is the same; only the cost vector changes.
The point of the sweep is not to pick a winner. It is to show how much of the headline number is the model and how much is the cost vector. On this data, most of it is the cost vector.
Each row is one cost regime. The bar reports the test-fold profit at t* chosen for that regime on the dev fold. Whiskers are 95% bootstrap CIs.
The conservative CI sits entirely below zero. The base CI straddles zero. The aggressive CI sits entirely above. This is the signal a portfolio owner actually needs from a credit model.
Expected calibration error on the test fold is 2.16%. Brier is 0.126. The reliability curve tracks the diagonal with slight underconfidence in the 0.3 to 0.5 region where the decision boundary actually sits.
Post-hoc isotonic regression with CV=5 does not improve ECE on this data; it raises it to 2.82% by absorbing CV-fold variance. The negative result is worth keeping: not every well-known fix is appropriate on every dataset.
LinearExplainer on the LR puts FICO score at the largest mean absolute SHAP value, with loan amount second. Both effects point the right way: higher FICO lowers risk, higher outstanding loan raises it.
The Forage data has no time column, so a real out-of-time test is not possible. The next-best stress test is a monotone proxy: train on the long-tenure cohort (years_employed ≥ 3), evaluate on the short-tenure cohort (years_employed < 3).
The short-tenure cohort has a 2.4x higher base default rate. The model trained on long-tenure customers cannot recover that through the same threshold: per-loan profit drops by $876 and the operating point breaks down to a 90% rejection rate with 458 false rejections out of 796.
The general lesson: a model is not a thing, it is a thing plus the cohort it was fit on. Out-of-cohort deployment is its own modelling problem.
For a downstream scorecard the FICO range needs to be discretised into k buckets that maximise a likelihood derived from the empirical default rate in each bin. Greedy starts from quantile boundaries and improves locally. DP precomputes every interval likelihood and returns the exact solution in O(n² k).
The DP-optimal seven-bucket configuration is monotonic in default rate, from 66% in the lowest bucket down to 3% in the highest. Weight of evidence runs from -2.1 to +2.0 with no inversions. Information value is 0.77, comfortably above the 0.3 threshold for strong predictive power.
Pushing to ten buckets produces a 1.65% / 37.50% inversion at bucket 8 to 9 driven by a small-sample anomaly. More granularity introduces instability without improving discrimination. The 7-bucket solution is the operationally usable one.
Bootstrap CIs (50 resamples) show edge buckets are stable; middle boundaries have substantial uncertainty (CI widths of 30 to 50 score points).
A tabular Q-learning agent with the raw time index inside its state can encode "do action a at month m" verbatim. On the 24-month training window that is enough to beat both heuristics. On the held-out window the policy is brittle and the seasonal-swing baseline wins comfortably.
On the training window all three strategies are valued at the same prices and the same storage cost. Q-learning produces $27.10 per ten-unit position, versus $15.10 for the seasonal swing and a loss for buy-and-hold (the modest price drift does not cover carry on a passive long).
Out of sample the order reverses: the seasonal swing produces $16.00, and the trained agent collapses to $1.50. The seasonal pattern is robust across years; the agent's memorised time-action map is not.
This is the expected failure mode of the formulation, and reporting it is the result. Reasonable next steps include dropping the raw time index from the state, sliding-window evaluation, and a deep Q-network for continuous inventory.
The same numbers, laid out side by side. The lighter bar is the in-sample year; the darker bar is the held-out year. All values are per ten-unit position and include the shared $0.05 per-unit per-month carrying cost.
Illegal actions (inject at full, withdraw at empty) become no-op holds and are flagged in the info dict. The cost convention is shared between the env and the baselines so the comparison is on equal footing.
On a 2,000-row test fold, point estimates carry sampling noise that is large compared to the differences between competing models. Reporting bootstrap CIs alongside every point makes the right model comparisons obvious and the wrong ones impossible.
All three datasets are synthetic, supplied by the Forage program. Headline dollar amounts inherit the data's limitations; they are not estimates of performance on a production portfolio.
The loan dataset has no booking date. The model uses a stratified random split rather than out-of-time. The cohort generalisation test is a monotone proxy via years_employed, not a substitute for a real backtest.
The agent's state includes the raw month index, which lets it memorise the training trajectory. The held-out collapse is the expected failure mode of the formulation and is reported as the result, not patched out of the headline.
Real-time data integration, model monitoring, regulatory capital under Basel III, stress testing, fairness analysis, and production deployment are not addressed in this project.
The repository contains three executable notebooks, the scikit-learn LR with three-way split logic, the dynamic-programming FICO bucketer, the gas-storage environment and tabular Q-learner, plus a 72-test pytest suite running in CI.
Open the repository↗