On a 117-country panel from 1990 to 2023, five classifiers are benchmarked on one leakage-free temporal split, and a from-scratch PPO agent allocates a portfolio across all 117 countries. The load-bearing work is diagnostic: establishing why a 100-tree random forest beats a Two-Tower neural network, and measuring exactly which inputs the reinforcement-learning policy responds to. Every number is reported with the test that produced it.
The two-tower hypothesis: vulnerability embeddings interact with stress embeddings to predict default. Mathematically clean, intuitive, and on this data, wrong.
Sovereign default is a rare event. In the train slice (1990 to 2014), 78 country-year cells out of 2,925 are flagged as default-within-2-years. The two-tower architecture treats domestic vulnerability and global stress as separable factors that interact through an L2-normalised dot product, the way user-item embeddings do in MovieLens. The idea is appealing: some countries are structurally vulnerable, but they only default when global stress is elevated.
That hypothesis is testable. This case study runs five classifiers - logistic regression, random forest, sklearn gradient boosting, XGBoost, and the Two-Tower network - on the same temporal train/test split (1990 to 2014 vs 2015 to 2023). The Two-Tower model loses to every tree-based baseline. PCA on its learned vulnerability embedding shows 94 percent of variance in the first component: the network collapsed the 16-dimensional macro feature space into a one-dimensional risk score and called it a day.
The features split cleanly into two registers: country-level fundamentals from the World Bank and global financial stress from FRED. Both pulled by python requests with retry and batch politeness. Default events are not on either API; they were curated by hand against Reinhart and Rogoff, S&P, Moody's, and the Bank of Canada / Bank of England database.
With only 78 positive cases in train and 18 in test, every reasonable architecture is downstream of the data, not the design. The Two-Tower latent space collapsed to one effective dimension because there was no signal left for the second.
This is the most useful single finding in the project: a 538,154-parameter network with 16-dimensional latent towers and focal loss does not extract more signal than a 100-tree random forest on this panel. The result is not a verdict on deep learning generally - it is a verdict on this dataset's sample size and label density. PCA on the trained vulnerability embedding confirms it: 94 percent of variance lives in PC1, and the second component carries the next 4 percent. The architectural prior never had room to express itself.
P(default | x) = σ(w · x + b) class_weight = balanced C selected by TimeSeriesSplit CV
100 trees, max_depth selected by CV class_weight = balanced oversampling never applied
sklearn GradientBoosting n_estimators, max_depth, lr by CV no early stopping (sklearn default)
scale_pos_weight = neg/pos ratio eval_metric = logloss numbers pending re-execution
vuln_emb · stress_emb (L2-normed) focal_loss(gamma=2, alpha=0.75) stratified 80/20 val split
held-out 2015-2023 test set bootstrap CIs on every metric ECE per model, calibration curve
A bootstrap 95% CI on AUC, with only 18 positive test cases, is wide enough that any pair of models with a gap under about 0.06 should be treated as inside the noise. The headline gap of 0.153 between Random Forest and Two-Tower NN survives that bar; the gap between RF and Gradient Boosting (0.035) does not. The point is not "Random Forest wins"; the point is that the data is sparse enough that architectural ambition cannot rescue you.
State: 1,878 dimensions. Action: 117 portfolio weights. Reward: Sharpe-like ratio of net yield minus default losses minus transaction costs.
The actor and critic are textbook 256-128-64 stacks with LayerNorm and ReLU. Gaussian policy on the actor; scalar value head on the critic. GAE with lambda 0.95, clip ratio 0.2, entropy bonus 0.01, 300 training episodes. The state vector concatenates 15 domestic features for every country, 6 global features, and the current portfolio weights. The reward is computed inside the environment using stylised but documented yield, recovery, and transaction-cost formulas; coefficients follow the qualitative signs in Borensztein and Panizza (2009) for spreads and the developing-country range in Cruces and Trebesch (2013) for recovery.
Equal-weight baseline puts 1/117 = 0.855% on every country. The policy stretches that range to 0.70 - 1.00 percent, concentrated in a few stable economies and reduced for countries with multiple historical defaults. Toggle the views below.
The policy's correlation with historical default rate is -0.585. That is real signal. But the dispersion is narrow: the bottom country (Venezuela) gets 82 percent of the weight the top country (Guyana) gets. The agent has learned which countries are risky in the historical sense; it has not learned to rebalance dynamically.
The gain is real, but smaller in the harder environments. Stochastic defaults (where historical labels are perturbed) drop the gain to +9 percent. Contagion (where a default in one country raises probability in regional neighbours) drops it to +5 percent. Both still positive; both signal the policy has learned country-level risk patterns rather than a single pre-baked allocation copied across environments.
Most reinforcement-learning writeups stop at the reward curve. This one runs the next test: is the trained policy genuinely responsive to its inputs, and which ones?
The diagnostic perturbs each of 15 domestic macro features by 10 percent, country by country, and re-runs the trained policy. It also perturbs the portfolio-state slots and, separately, feeds a fully random state vector. The results are precise: a random state moves the action by a sum-of-absolutes of 53.49, so the policy is state-dependent at the macro level; but a 10 percent perturbation of any individual feature moves portfolio weights by 0.0000. The trained agent maps the overall shape of the world to a near-fixed allocation and is insensitive to marginal feature changes.
That is a useful, actionable result rather than a dead end. It localises the problem to the input representation: a 1,878-dimensional state of heterogeneously-scaled raw macro values fed unnormalised into a Dense(256) layer lets a few large-scale features dominate the first activation. The response is already implemented: a state-normalising PPO variant (Welford-running z-score over the state vector, smaller 128-64 actor) ships in src/rl/normaliser.py and src/rl/ppo.py. Whether it restores marginal responsiveness is an empirical question the next training run answers.
The +12.6% over equal-weight is real and holds across three environments. The sensitivity test shows that gain currently comes from a learned static allocation; the state-normalised retrain is the measured next iteration, not an afterthought.
Deep learning was never going to outperform here. The Two-Tower result is informative about this regime, not about deep learning in general.
Many are carryover from the 1980s debt crisis or formal acknowledgements of pre-existing default by newly-independent former Soviet states. The cleaner subset is the 32 post-2000 events; a re-derivation on that subset is the next obvious robustness pass.
Real sovereign analysis uses term-structure data and political-risk indices. This pipeline uses macro fundamentals only. Adding sovereign CDS spreads as a feature is the most plausible single uplift.
The env's spread = 0.02 + 0.0005 · debt - 0.005 · reserves formula follows the qualitative signs in published work; the coefficients are not calibrated. Sensitivity sweep in the notebook documents how the equal-weight reward responds to +/-25% shifts on each coefficient.
Temporal split, train-only imputation, TimeSeriesSplit CV, bootstrap CIs, ECE per model, regression tests for leakage.
Logistic regression, random forest, sklearn gradient boosting, XGBoost, and a Two-Tower neural net evaluated on identical features, splits, and metrics.
Actor-critic with LayerNorm, Gaussian policy, clip ratio 0.2, entropy bonus 0.01. Sensitivity diagnostics surfaced the static-policy issue rather than hiding it.
data.py, eval.py, tune.py, rl/env.py, rl/ppo.py, rl/normaliser.py, rl/yield_model.py. CI on Python 3.10 and 3.11; notebook JSON validation; banned-phrasing grep.
1990s label carryover, 8GB-machine re-execution gap, static-policy finding, narrow weight spread. Each surfaced on the same page as the headline number it qualifies.
The 16-feature macro panel becomes a 117-country allocation with a -0.585 correlation to historical default rate. The deterministic 12.6 percent gain is decomposed across stochastic and contagion environments.
Read the full report, the SQL, and the PyMC models.
Open the repository ↗