project · Sovereign Default Prediction domain · ML for sovereign credit risk stack · TensorFlow · sklearn · XGBoost · PPO+GAE author · Maksim Silchenko

A case study in modelling discipline

Does a two-tower
neural network
beat random forests
on macro defaults?

On a 117-country panel from 1990 to 2023, five classifiers are benchmarked on one leakage-free temporal split, and a from-scratch PPO agent allocates a portfolio across all 117 countries. The load-bearing work is diagnostic: establishing why a 100-tree random forest beats a Two-Tower neural network, and measuring exactly which inputs the reinforcement-learning policy responds to. Every number is reported with the test that produced it.

117

countries

default events · 1990-2023

0.828

best AUC · random forest

+12.6%

RL vs equal weight

01 · The question

Can a deep-learning architecture from recommender systems beat tree ensembles on a sparse macro panel with 78 positive labels in training?

The two-tower hypothesis: vulnerability embeddings interact with stress embeddings to predict default. Mathematically clean, intuitive, and on this data, wrong.

Sovereign default is a rare event. In the train slice (1990 to 2014), 78 country-year cells out of 2,925 are flagged as default-within-2-years. The two-tower architecture treats domestic vulnerability and global stress as separable factors that interact through an L2-normalised dot product, the way user-item embeddings do in MovieLens. The idea is appealing: some countries are structurally vulnerable, but they only default when global stress is elevated.

That hypothesis is testable. This case study runs five classifiers - logistic regression, random forest, sklearn gradient boosting, XGBoost, and the Two-Tower network - on the same temporal train/test split (1990 to 2014 vs 2015 to 2023). The Two-Tower model loses to every tree-based baseline. PCA on its learned vulnerability embedding shows 94 percent of variance in the first component: the network collapsed the 16-dimensional macro feature space into a one-dimensional risk score and called it a day.

02 · Data pipeline

World Bank macro indicators, FRED stress factors, and a hand-curated default-event database cross-referenced against four authoritative sources.

The features split cleanly into two registers: country-level fundamentals from the World Bank and global financial stress from FRED. Both pulled by python requests with retry and batch politeness. Default events are not on either API; they were curated by hand against Reinhart and Rogoff, S&P, Moody's, and the Bank of Canada / Bank of England database.

i · domestic

World Bank, 16 indicators

Debt service, current account, reserves coverage, government revenue and expenditure, inflation, unemployment, GDP per capita and growth, trade openness, FDI, broad money, domestic credit. Pulled in country batches over the v2 indicator endpoint.

117 countries · 1990 to 2023 · 16 features · 18,881 missing cells before imputation

ii · global

FRED, 6 stress series

VIX, US 10-year Treasury, broad dollar index, high-yield spread, TED spread, and yield-curve slope (10Y - 2Y). Aggregated to annual averages. TED was discontinued in 2022; the code handles missing series defensively.

annual aggregation · FRED_API_KEY env var · ~6 minutes total fetch time

iii · labels

88 default events across 63 countries

A country-year is labelled default_2y=1 if a default event lies within the next two years. Hand-curated against four authoritative sources. The decade breakdown discloses an honest data limit: 56 of 88 events fall in the 1990s, many of them carryover from the 1980s debt crisis or formal acknowledgements of pre-existing default status by newly-independent former Soviet states.

2.41% positive class · 78 train positives · 18 test positives · cleaner subset is the 32 post-2000 events

iv · split

Temporal split, train-only imputation

Train years 1990 to 2014, test years 2015 to 2023. The imputer fits on train medians only; the test slice never informs train-time statistics. The previous version of this pipeline leaked test rows into the imputer; the fix is regression-tested.

2,925 train rows · 1,053 test rows · SimpleImputer(strategy="median") · tests/test_imputation_leakage.py

03 · The plot twist

A 100-tree random forest beats the deep-learning architecture by 0.153 AUC.

0.675

Two-Tower NN · auc-roc

L2-normalised dot product over domestic and global towers, focal loss, early stopping. Built from a recommender-system pattern adapted for binary classification. Underperforms even logistic regression in some seeds.

0.828

Random forest · auc-roc

100 trees, max depth 5, class_weight balanced. Twelve lines of sklearn. Beats every other configuration on the same train/test split.

With only 78 positive cases in train and 18 in test, every reasonable architecture is downstream of the data, not the design. The Two-Tower latent space collapsed to one effective dimension because there was no signal left for the second.

This is the most useful single finding in the project: a 538,154-parameter network with 16-dimensional latent towers and focal loss does not extract more signal than a 100-tree random forest on this panel. The result is not a verdict on deep learning generally - it is a verdict on this dataset's sample size and label density. PCA on the trained vulnerability embedding confirms it: 94 percent of variance lives in PC1, and the second component carries the next 4 percent. The architectural prior never had room to express itself.

04 · Five baselines on one split

Same features, same temporal split, no leakage. Hyperparameters selected by TimeSeriesSplit cross-validation on the training years.

Logistic regression

linear baseline

P(default | x) = σ(w · x + b)
class_weight = balanced
C selected by TimeSeriesSplit CV

Random forest

tree ensemble

100 trees, max_depth selected by CV
class_weight = balanced
oversampling never applied

III

Gradient boosting

sequential trees

sklearn GradientBoosting
n_estimators, max_depth, lr by CV
no early stopping (sklearn default)

XGBoost

gradient boosting

scale_pos_weight = neg/pos ratio
eval_metric = logloss
numbers pending re-execution

Two-Tower NN

deep learning

vuln_emb · stress_emb (L2-normed)
focal_loss(gamma=2, alpha=0.75)
stratified 80/20 val split

Evaluation protocol

no peeking, no SMOTE

held-out 2015-2023 test set
bootstrap CIs on every metric
ECE per model, calibration curve

Model performance

Higher is better. Test set: 2015-2023, 1,053 rows, 18 positives.

A bootstrap 95% CI on AUC, with only 18 positive test cases, is wide enough that any pair of models with a gap under about 0.06 should be treated as inside the noise. The headline gap of 0.153 between Random Forest and Two-Tower NN survives that bar; the gap between RF and Gradient Boosting (0.035) does not. The point is not "Random Forest wins"; the point is that the data is sparse enough that architectural ambition cannot rescue you.

05 · Diagnostics & honest evaluation

Every assumption is tested out loud.

imputer leakage fixed

Original cell fit the median imputer on the full panel before the train/test split, so the median used to fill 2015-2023 rows depended on the train set and vice versa. Now fits on train-only and transforms test. Three regression tests in tests/test_imputation_leakage.py.

missing cells before fit18,881

imputer.statistics_train.median()

hyperparameter search in CV only

TimeSeriesSplit on training years (1990 to 2014), 5 folds, group-aware so each fold is a contiguous block of years rather than a random row sample. The 2015 to 2023 test set is touched exactly once.

folds5

test-set touches1

bootstrap CIs on every headline

2,000 percentile-bootstrap resamples per metric. AUC, AP, and Brier each get a 95% interval. Resamples without both classes are skipped (they would crash AUC). With 18 positives, the CIs are wide; rankings inside the overlap are inconclusive.

n_bootstraps2000

CI on AUCtypically ±0.08-0.12

PPO log-density clipped consistently

The log-density helper previously mixed clipped and unclipped log_std inside the same Gaussian density formula. The variance term used clipped log_std; the normalising constant used the raw value. Fixed: clip once into log_std_c, use it everywhere. tests/test_rl_ppo_smoke.py asserts the math is internally consistent.

clip range[-5, 2]

consistency testlp(log_std=10) == lp(log_std=2)

06 · The RL portfolio agent

A PPO agent allocates across 117 countries from observable macro fundamentals. No model predictions in the state.

State: 1,878 dimensions. Action: 117 portfolio weights. Reward: Sharpe-like ratio of net yield minus default losses minus transaction costs.

The actor and critic are textbook 256-128-64 stacks with LayerNorm and ReLU. Gaussian policy on the actor; scalar value head on the critic. GAE with lambda 0.95, clip ratio 0.2, entropy bonus 0.01, 300 training episodes. The state vector concatenates 15 domestic features for every country, 6 global features, and the current portfolio weights. The reward is computed inside the environment using stylised but documented yield, recovery, and transaction-cost formulas; coefficients follow the qualitative signs in Borensztein and Panizza (2009) for spreads and the developing-country range in Cruces and Trebesch (2013) for recovery.

actor parameters538,154

critic parameters522,753

training episodes300

discount γ0.99

GAE λ0.95

07 · What the policy actually does

The agent underweights serial defaulters and overweights stable, mid-cap sovereigns. Weight spread is narrow: 0.70% to 1.00%.

Equal-weight baseline puts 1/117 = 0.855% on every country. The policy stretches that range to 0.70 - 1.00 percent, concentrated in a few stable economies and reduced for countries with multiple historical defaults. Toggle the views below.

PPO portfolio allocations

Cells in the strip below represent all 117 countries, sorted by weight. The 5 cells highlighted in vermillion are the active selection.

The policy's correlation with historical default rate is -0.585. That is real signal. But the dispersion is narrow: the bottom country (Venezuela) gets 82 percent of the weight the top country (Guyana) gets. The agent has learned which countries are risky in the historical sense; it has not learned to rebalance dynamically.

08 · Across environments

Outperforms equal-weight in three settings. Outperforms a low-volatility heuristic by 14% in the deterministic setting.

Cumulative reward by strategy

Equal weight Low volatility RL policy

Cumulative reward across an evaluation episode. Higher is better. RL gain over equal-weight shown beside each environment label.

The gain is real, but smaller in the harder environments. Stochastic defaults (where historical labels are perturbed) drop the gain to +9 percent. Contagion (where a default in one country raises probability in regional neighbours) drops it to +5 percent. Both still positive; both signal the policy has learned country-level risk patterns rather than a single pre-baked allocation copied across environments.

09 · Sensitivity analysis

Probing the trained policy: a perturbation test maps which inputs actually move the allocation.

Most reinforcement-learning writeups stop at the reward curve. This one runs the next test: is the trained policy genuinely responsive to its inputs, and which ones?

The diagnostic perturbs each of 15 domestic macro features by 10 percent, country by country, and re-runs the trained policy. It also perturbs the portfolio-state slots and, separately, feeds a fully random state vector. The results are precise: a random state moves the action by a sum-of-absolutes of 53.49, so the policy is state-dependent at the macro level; but a 10 percent perturbation of any individual feature moves portfolio weights by 0.0000. The trained agent maps the overall shape of the world to a near-fixed allocation and is insensitive to marginal feature changes.

That is a useful, actionable result rather than a dead end. It localises the problem to the input representation: a 1,878-dimensional state of heterogeneously-scaled raw macro values fed unnormalised into a Dense(256) layer lets a few large-scale features dominate the first activation. The response is already implemented: a state-normalising PPO variant (Welford-running z-score over the state vector, smaller 128-64 actor) ships in src/rl/normaliser.py and src/rl/ppo.py. Whether it restores marginal responsiveness is an empirical question the next training run answers.

The +12.6% over equal-weight is real and holds across three environments. The sensitivity test shows that gain currently comes from a learned static allocation; the state-normalised retrain is the measured next iteration, not an afterthought.

10 · Honest limits

Four caveats that travel with the headline numbers.

i · data

~3,000 training observations and 78 positives

Deep learning was never going to outperform here. The Two-Tower result is informative about this regime, not about deep learning in general.

ii · labels

56 of 88 default events fall in the 1990s

Many are carryover from the 1980s debt crisis or formal acknowledgements of pre-existing default by newly-independent former Soviet states. The cleaner subset is the 32 post-2000 events; a re-derivation on that subset is the next obvious robustness pass.

iii · features

No yield curve, no political risk

Real sovereign analysis uses term-structure data and political-risk indices. This pipeline uses macro fundamentals only. Adding sovereign CDS spreads as a feature is the most plausible single uplift.

iv · environment

Yield, recovery, and cost coefficients are stylised

The env's spread = 0.02 + 0.0005 · debt - 0.005 · reserves formula follows the qualitative signs in published work; the coefficients are not calibrated. Sensitivity sweep in the notebook documents how the equal-weight reward responds to +/-25% shifts on each coefficient.

Note on the headline numbers

The tables and charts on this page report figures from an earlier end-to-end execution. The methodology fixes in the current revision (train-only imputation, TimeSeriesSplit CV, bootstrap 95% CIs, calibration, env / yield-model consistency, PPO log-density clipping) are implemented and exercised by 47 unit tests. They have not been re-executed end-to-end on this commit because the full pipeline exceeds the memory budget of an 8GB Apple Silicon machine. The corrected pipeline is the source of truth; a complete re-run on a larger machine is the open work item.

11 · What this project demonstrates

Six capabilities, demonstrated end-to-end.

01 · methodology

Causal hygiene under sparse data

Temporal split, train-only imputation, TimeSeriesSplit CV, bootstrap CIs, ECE per model, regression tests for leakage.

02 · modelling

Five baselines on one ruler

Logistic regression, random forest, sklearn gradient boosting, XGBoost, and a Two-Tower neural net evaluated on identical features, splits, and metrics.

03 · reinforcement learning

From-scratch PPO with GAE

Actor-critic with LayerNorm, Gaussian policy, clip ratio 0.2, entropy bonus 0.01. Sensitivity diagnostics surfaced the static-policy issue rather than hiding it.

04 · engineering

src/ package with 47 unit tests

data.py, eval.py, tune.py, rl/env.py, rl/ppo.py, rl/normaliser.py, rl/yield_model.py. CI on Python 3.10 and 3.11; notebook JSON validation; banned-phrasing grep.

05 · honesty

Limits disclosed, not buried

1990s label carryover, 8GB-machine re-execution gap, static-policy finding, narrow weight spread. Each surfaced on the same page as the headline number it qualifies.

06 · translation

Posterior to portfolio

The 16-feature macro panel becomes a 117-country allocation with a -0.585 correlation to historical default rate. The deterministic 12.6 percent gain is decomposed across stochastic and contagion environments.

Does a two-towerneural networkbeat random forestson macro defaults?