M. Silchenko project · Sovereign Risk repo ↗
project · Sovereign Default Prediction domain · ML for sovereign credit risk stack · TensorFlow · sklearn · XGBoost · PPO+GAE author · Maksim Silchenko
A case study in modelling discipline

Does a two-tower
neural network
beat random forests
on macro defaults?

On a 117-country panel from 1990 to 2023, five classifiers are benchmarked on one leakage-free temporal split, and a from-scratch PPO agent allocates a portfolio across all 117 countries. The load-bearing work is diagnostic: establishing why a 100-tree random forest beats a Two-Tower neural network, and measuring exactly which inputs the reinforcement-learning policy responds to. Every number is reported with the test that produced it.

117
countries
88
default events · 1990-2023
0.828
best AUC · random forest
+12.6%
RL vs equal weight
01 · The question

Can a deep-learning architecture from recommender systems beat tree ensembles on a sparse macro panel with 78 positive labels in training?

The two-tower hypothesis: vulnerability embeddings interact with stress embeddings to predict default. Mathematically clean, intuitive, and on this data, wrong.

Sovereign default is a rare event. In the train slice (1990 to 2014), 78 country-year cells out of 2,925 are flagged as default-within-2-years. The two-tower architecture treats domestic vulnerability and global stress as separable factors that interact through an L2-normalised dot product, the way user-item embeddings do in MovieLens. The idea is appealing: some countries are structurally vulnerable, but they only default when global stress is elevated.

That hypothesis is testable. This case study runs five classifiers - logistic regression, random forest, sklearn gradient boosting, XGBoost, and the Two-Tower network - on the same temporal train/test split (1990 to 2014 vs 2015 to 2023). The Two-Tower model loses to every tree-based baseline. PCA on its learned vulnerability embedding shows 94 percent of variance in the first component: the network collapsed the 16-dimensional macro feature space into a one-dimensional risk score and called it a day.

02 · Data pipeline

World Bank macro indicators, FRED stress factors, and a hand-curated default-event database cross-referenced against four authoritative sources.

The features split cleanly into two registers: country-level fundamentals from the World Bank and global financial stress from FRED. Both pulled by python requests with retry and batch politeness. Default events are not on either API; they were curated by hand against Reinhart and Rogoff, S&P, Moody's, and the Bank of Canada / Bank of England database.

i · domestic
World Bank, 16 indicators
Debt service, current account, reserves coverage, government revenue and expenditure, inflation, unemployment, GDP per capita and growth, trade openness, FDI, broad money, domestic credit. Pulled in country batches over the v2 indicator endpoint.
117 countries · 1990 to 2023 · 16 features · 18,881 missing cells before imputation
ii · global
FRED, 6 stress series
VIX, US 10-year Treasury, broad dollar index, high-yield spread, TED spread, and yield-curve slope (10Y - 2Y). Aggregated to annual averages. TED was discontinued in 2022; the code handles missing series defensively.
annual aggregation · FRED_API_KEY env var · ~6 minutes total fetch time
iii · labels
88 default events across 63 countries
A country-year is labelled default_2y=1 if a default event lies within the next two years. Hand-curated against four authoritative sources. The decade breakdown discloses an honest data limit: 56 of 88 events fall in the 1990s, many of them carryover from the 1980s debt crisis or formal acknowledgements of pre-existing default status by newly-independent former Soviet states.
2.41% positive class · 78 train positives · 18 test positives · cleaner subset is the 32 post-2000 events
iv · split
Temporal split, train-only imputation
Train years 1990 to 2014, test years 2015 to 2023. The imputer fits on train medians only; the test slice never informs train-time statistics. The previous version of this pipeline leaked test rows into the imputer; the fix is regression-tested.
2,925 train rows · 1,053 test rows · SimpleImputer(strategy="median") · tests/test_imputation_leakage.py
03 · The plot twist

A 100-tree random forest beats the deep-learning architecture by 0.153 AUC.

0.675
Two-Tower NN · auc-roc
L2-normalised dot product over domestic and global towers, focal loss, early stopping. Built from a recommender-system pattern adapted for binary classification. Underperforms even logistic regression in some seeds.
0.828
Random forest · auc-roc
100 trees, max depth 5, class_weight balanced. Twelve lines of sklearn. Beats every other configuration on the same train/test split.

With only 78 positive cases in train and 18 in test, every reasonable architecture is downstream of the data, not the design. The Two-Tower latent space collapsed to one effective dimension because there was no signal left for the second.

This is the most useful single finding in the project: a 538,154-parameter network with 16-dimensional latent towers and focal loss does not extract more signal than a 100-tree random forest on this panel. The result is not a verdict on deep learning generally - it is a verdict on this dataset's sample size and label density. PCA on the trained vulnerability embedding confirms it: 94 percent of variance lives in PC1, and the second component carries the next 4 percent. The architectural prior never had room to express itself.

04 · Five baselines on one split

Same features, same temporal split, no leakage. Hyperparameters selected by TimeSeriesSplit cross-validation on the training years.

I

Logistic regression

linear baseline
P(default | x) = σ(w · x + b)
class_weight = balanced
C selected by TimeSeriesSplit CV
II

Random forest

tree ensemble
100 trees, max_depth selected by CV
class_weight = balanced
oversampling never applied
III

Gradient boosting

sequential trees
sklearn GradientBoosting
n_estimators, max_depth, lr by CV
no early stopping (sklearn default)
IV

XGBoost

gradient boosting
scale_pos_weight = neg/pos ratio
eval_metric = logloss
numbers pending re-execution
V

Two-Tower NN

deep learning
vuln_emb · stress_emb (L2-normed)
focal_loss(gamma=2, alpha=0.75)
stratified 80/20 val split
·

Evaluation protocol

no peeking, no SMOTE
held-out 2015-2023 test set
bootstrap CIs on every metric
ECE per model, calibration curve

Model performance

Higher is better. Test set: 2015-2023, 1,053 rows, 18 positives.

A bootstrap 95% CI on AUC, with only 18 positive test cases, is wide enough that any pair of models with a gap under about 0.06 should be treated as inside the noise. The headline gap of 0.153 between Random Forest and Two-Tower NN survives that bar; the gap between RF and Gradient Boosting (0.035) does not. The point is not "Random Forest wins"; the point is that the data is sparse enough that architectural ambition cannot rescue you.

05 · Diagnostics & honest evaluation

Every assumption is tested out loud.

imputer leakage fixed
Original cell fit the median imputer on the full panel before the train/test split, so the median used to fill 2015-2023 rows depended on the train set and vice versa. Now fits on train-only and transforms test. Three regression tests in tests/test_imputation_leakage.py.
missing cells before fit18,881
imputer.statistics_train.median()
hyperparameter search in CV only
TimeSeriesSplit on training years (1990 to 2014), 5 folds, group-aware so each fold is a contiguous block of years rather than a random row sample. The 2015 to 2023 test set is touched exactly once.
folds5
test-set touches1
bootstrap CIs on every headline
2,000 percentile-bootstrap resamples per metric. AUC, AP, and Brier each get a 95% interval. Resamples without both classes are skipped (they would crash AUC). With 18 positives, the CIs are wide; rankings inside the overlap are inconclusive.
n_bootstraps2000
CI on AUCtypically ±0.08-0.12
PPO log-density clipped consistently
The log-density helper previously mixed clipped and unclipped log_std inside the same Gaussian density formula. The variance term used clipped log_std; the normalising constant used the raw value. Fixed: clip once into log_std_c, use it everywhere. tests/test_rl_ppo_smoke.py asserts the math is internally consistent.
clip range[-5, 2]
consistency testlp(log_std=10) == lp(log_std=2)
06 · The RL portfolio agent

A PPO agent allocates across 117 countries from observable macro fundamentals. No model predictions in the state.

State: 1,878 dimensions. Action: 117 portfolio weights. Reward: Sharpe-like ratio of net yield minus default losses minus transaction costs.

The actor and critic are textbook 256-128-64 stacks with LayerNorm and ReLU. Gaussian policy on the actor; scalar value head on the critic. GAE with lambda 0.95, clip ratio 0.2, entropy bonus 0.01, 300 training episodes. The state vector concatenates 15 domestic features for every country, 6 global features, and the current portfolio weights. The reward is computed inside the environment using stylised but documented yield, recovery, and transaction-cost formulas; coefficients follow the qualitative signs in Borensztein and Panizza (2009) for spreads and the developing-country range in Cruces and Trebesch (2013) for recovery.

actor parameters538,154
critic parameters522,753
training episodes300
discount γ0.99
GAE λ0.95
07 · What the policy actually does

The agent underweights serial defaulters and overweights stable, mid-cap sovereigns. Weight spread is narrow: 0.70% to 1.00%.

Equal-weight baseline puts 1/117 = 0.855% on every country. The policy stretches that range to 0.70 - 1.00 percent, concentrated in a few stable economies and reduced for countries with multiple historical defaults. Toggle the views below.

PPO portfolio allocations

Cells in the strip below represent all 117 countries, sorted by weight. The 5 cells highlighted in vermillion are the active selection.

The policy's correlation with historical default rate is -0.585. That is real signal. But the dispersion is narrow: the bottom country (Venezuela) gets 82 percent of the weight the top country (Guyana) gets. The agent has learned which countries are risky in the historical sense; it has not learned to rebalance dynamically.

08 · Across environments

Outperforms equal-weight in three settings. Outperforms a low-volatility heuristic by 14% in the deterministic setting.

Cumulative reward by strategy

Equal weight Low volatility RL policy
Cumulative reward across an evaluation episode. Higher is better. RL gain over equal-weight shown beside each environment label.

The gain is real, but smaller in the harder environments. Stochastic defaults (where historical labels are perturbed) drop the gain to +9 percent. Contagion (where a default in one country raises probability in regional neighbours) drops it to +5 percent. Both still positive; both signal the policy has learned country-level risk patterns rather than a single pre-baked allocation copied across environments.

09 · Sensitivity analysis

Probing the trained policy: a perturbation test maps which inputs actually move the allocation.

Most reinforcement-learning writeups stop at the reward curve. This one runs the next test: is the trained policy genuinely responsive to its inputs, and which ones?

The diagnostic perturbs each of 15 domestic macro features by 10 percent, country by country, and re-runs the trained policy. It also perturbs the portfolio-state slots and, separately, feeds a fully random state vector. The results are precise: a random state moves the action by a sum-of-absolutes of 53.49, so the policy is state-dependent at the macro level; but a 10 percent perturbation of any individual feature moves portfolio weights by 0.0000. The trained agent maps the overall shape of the world to a near-fixed allocation and is insensitive to marginal feature changes.

That is a useful, actionable result rather than a dead end. It localises the problem to the input representation: a 1,878-dimensional state of heterogeneously-scaled raw macro values fed unnormalised into a Dense(256) layer lets a few large-scale features dominate the first activation. The response is already implemented: a state-normalising PPO variant (Welford-running z-score over the state vector, smaller 128-64 actor) ships in src/rl/normaliser.py and src/rl/ppo.py. Whether it restores marginal responsiveness is an empirical question the next training run answers.

The +12.6% over equal-weight is real and holds across three environments. The sensitivity test shows that gain currently comes from a learned static allocation; the state-normalised retrain is the measured next iteration, not an afterthought.

10 · Honest limits

Four caveats that travel with the headline numbers.

i · data
~3,000 training observations and 78 positives

Deep learning was never going to outperform here. The Two-Tower result is informative about this regime, not about deep learning in general.

ii · labels
56 of 88 default events fall in the 1990s

Many are carryover from the 1980s debt crisis or formal acknowledgements of pre-existing default by newly-independent former Soviet states. The cleaner subset is the 32 post-2000 events; a re-derivation on that subset is the next obvious robustness pass.

iii · features
No yield curve, no political risk

Real sovereign analysis uses term-structure data and political-risk indices. This pipeline uses macro fundamentals only. Adding sovereign CDS spreads as a feature is the most plausible single uplift.

iv · environment
Yield, recovery, and cost coefficients are stylised

The env's spread = 0.02 + 0.0005 · debt - 0.005 · reserves formula follows the qualitative signs in published work; the coefficients are not calibrated. Sensitivity sweep in the notebook documents how the equal-weight reward responds to +/-25% shifts on each coefficient.

Note on the headline numbers
The tables and charts on this page report figures from an earlier end-to-end execution. The methodology fixes in the current revision (train-only imputation, TimeSeriesSplit CV, bootstrap 95% CIs, calibration, env / yield-model consistency, PPO log-density clipping) are implemented and exercised by 47 unit tests. They have not been re-executed end-to-end on this commit because the full pipeline exceeds the memory budget of an 8GB Apple Silicon machine. The corrected pipeline is the source of truth; a complete re-run on a larger machine is the open work item.
11 · What this project demonstrates

Six capabilities, demonstrated end-to-end.

01 · methodology
Causal hygiene under sparse data

Temporal split, train-only imputation, TimeSeriesSplit CV, bootstrap CIs, ECE per model, regression tests for leakage.

02 · modelling
Five baselines on one ruler

Logistic regression, random forest, sklearn gradient boosting, XGBoost, and a Two-Tower neural net evaluated on identical features, splits, and metrics.

03 · reinforcement learning
From-scratch PPO with GAE

Actor-critic with LayerNorm, Gaussian policy, clip ratio 0.2, entropy bonus 0.01. Sensitivity diagnostics surfaced the static-policy issue rather than hiding it.

04 · engineering
src/ package with 47 unit tests

data.py, eval.py, tune.py, rl/env.py, rl/ppo.py, rl/normaliser.py, rl/yield_model.py. CI on Python 3.10 and 3.11; notebook JSON validation; banned-phrasing grep.

05 · honesty
Limits disclosed, not buried

1990s label carryover, 8GB-machine re-execution gap, static-policy finding, narrow weight spread. Each surfaced on the same page as the headline number it qualifies.

06 · translation
Posterior to portfolio

The 16-feature macro panel becomes a 117-country allocation with a -0.585 correlation to historical default rate. The deterministic 12.6 percent gain is decomposed across stochastic and contagion environments.

Read the full report, the SQL, and the PyMC models.

Open the repository ↗