TradePool — a self-improving trading coding-agent (Laguna XS.2 LoRA)

Poolside × Prime Intellect Research Hackathon — Foundations track.

A LoRA adapter for poolside/Laguna-XS.2, trained with reinforcement learning so the model becomes a coding agent that writes causal crypto trading-strategy functions, scored by a leak-proof out-of-sample backtest.

The idea in one line

Trading discipline that normally lives as prompt text (a memory file of rules) is turned into adapter weights by rewarding disciplined, profitable behaviour on held-out market data. The verifier is the backtest.

How it works

Environment (verifiers, v0 SingleTurnEnv, pushed to stimulir/trade-pool): the agent is given a Base-chain token's in-sample price history + a library of causal indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write def strategy(features, position) -> target_position.
Verifier / reward — the strategy runs bar-by-bar over a held-out window (lookahead is structurally impossible; the function never sees future bars), scored by a weighted rubric:
- OOS Sharpe (0.40) · beats buy-and-hold (0.20) · drawdown control (0.15) · sane exposure (0.10) · transaction cost (0.05) · valid+actually-trades (0.10)
- Hard gates → reward 0: invalid code, lookahead, NaN equity, do-nothing strategies.
Training — Prime Hosted RL (GRPO), poolside/Laguna-XS.2, 50 steps, batch 128, rollouts_per_example=8, enable_thinking=false. FREE hosted Laguna run.

Results

RL produced a clean, monotonic reward climb on the training environment:

Stage	Total reward
step ~0 (baseline)	~0.15
step ~8	0.19
step ~11	0.28
step ~13 (peak)	~0.42
step ~50 (final)	~0.34–0.41

Every rubric component improved together (not single-metric gaming): reward_valid 0.30 → ~0.70 (writes valid trading code far more often), reward_sharpe 0.10 → 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base Laguna scored reward_valid 0.75 / reward_sharpe 0.45, confirming the env is in the healthy trainable band before training.

The novel contribution: closing the self-improvement loop

Weights channel: each RL iteration warm-starts from the prior adapter (checkpoint_id) — genuine parametric continuation.
Curriculum channel: a reflection step reads the prior adapter's out-of-sample eval and shifts the next run's objective (sharpe → min-drawdown → balanced) and focuses the weakest symbols — the agent's own results drive its next curriculum.
Falsifiable proof ("memory is the adapter"): the discipline block (distilled from 618 real prior trading decisions) can be stripped from the prompt (use_seed_principles=false); if the trained adapter stays disciplined, the rules now live in the weights, not the prompt.

Files

trade_pool/ — the full verifiers environment (features, causal backtester, executor, rubric, data) — installable, builds to a wheel, bundles its own OHLCV tape.
adapter/ — the trained LoRA adapter weights for poolside/Laguna-XS.2.
configs/ — the RL training config(s).
reward_curve.txt, eval_*.json — training + eval metrics.

Reproduce

prime env push --path ./trade_pool --visibility PRIVATE     # -> <you>/trade-pool
prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
prime train run configs/iter_1.toml                          # FREE hosted Laguna RL
prime deployments create <adapter_id>                        # serve the adapter

Built at the Poolside London hackathon, 29–30 May 2026. Team: TradePool (Tosin Dairo).

Downloads last month: -

Video Preview

Reinforcement Learning

Model tree for poolside-laguna-hackathon/trade-pool

Base model

poolside/Laguna-XS.2

Adapter

(7)

this model