Slipstream: a benchmark for mid-flight project forecasting, and a small agent that blends the best methods

Published June 15, 2026

1. Summary
Key results at a glance (at 40% complete)
2. Data
2.1 Where the data comes from
2.2 Filling the gaps with simulation
2.3 How the data is split (and why there is no cheating)
3. Evaluation setup
3.1 The question, the snapshots, the metrics
3.2 The control group and the baselines
3.3 The families we compared, and what each is good (and bad) at
3.4 The gap: nobody is good across all horizons
4. Teacher bake-off
5. Distillation: shrinking the agent for the edge
5.1 Generating and filtering the training traces
5.2 The five student models and how they are trained
5.3 Before vs after (at 40% complete)
6. Limitations and honest caveats
7. Future work
8. Conclusion
Appendix A: open releases
References
1. Summary

Big projects rarely finish exactly on budget or on time. The useful question, half-way through, is: given how a project has gone so far, what will it finally cost, and when will it actually finish?

Two numbers capture that:

EAC (Estimate at Completion) - the project's total final cost.
Finish period - the time-period the project completes in.

We built a benchmark that asks 37 forecasting methods exactly this question, on 107 real projects the methods never saw during training, at four points in each project's life (25%, 40%, 60% and 75% complete). We score every method the same way, so they are directly comparable.

The headline finding: no single method is good across a project's whole life.

Earned Schedule - the long-standing industry-standard formula - is excellent on short projects but its finish forecast degrades badly on long ones (median error grows from 0.35 periods to 5.4 periods as projects get longer).
Machine-learning "reference-class" models are the mirror image: steady and robust on long projects, but weaker than Earned Schedule early on.
Time-series foundation models are the weakest on long projects of all.

Each family wins only one or two of the short / medium / long horizons, never all three. That gap is exactly what an extra layer of intelligence can close. We add a small agent that, for each project, writes and runs code to call every forecasting tool, reasons about which to trust for this situation (lean on the ML reference-class for long-horizon timing; keep cost anchored on Earned Schedule), reconciles them into one answer, and states the rule it used. That agent is the only method that is strong across all horizons - it keeps Earned Schedule's early accuracy (0.35 periods on short) and inherits the ML robustness on long (1.30 periods on very-long, versus Earned Schedule's 5.37).

Finally, we distil that agent's reasoning into small open models (1-4 billion parameters). The best of them match the large teacher and the best classical baseline on accuracy, while being small enough to run on-device and air-gapped - which matters because much project data (defence, critical infrastructure, government) cannot leave its environment.

Everything - simulating missing data, generating the teacher's reasoning traces, fine-tuning and evaluation - ran on Modal.

Key results at a glance (at 40% complete)

Method	What it is	Cost error (median)	Finish error (median)
Earned Schedule	industry-standard formula	2.37%	1.0 period
Best ML (TabPFN, zero-shot)	learned reference class	3.0%	0.93 periods
Best time-series FM (Chronos)	general forecasting AI	3.93%	2.13 periods
Agent (DeepSeek V4 teacher)	reconciles the above	2.4%	0.6 periods
Distilled 4B student (Nemotron)	the agent, shrunk	2.37%	0.61 periods

(All values: data/curated/bench.json, stage 0.40.)

2. Data

2.1 Where the data comes from

The raw projects come from the Operations Research & Scheduling (OR&S) group at Ghent University (Mario Vanhoucke), a widely used public collection of project-scheduling datasets (https://www.projectmanagement.ugent.be/research/data; DATA.md §2). Seven libraries are used: DSLIB, RCPLIB, MPLIB, ASLIB, MMLIB, MSLIB and SSLIB.

There is one crucial distinction (DATA.md §2; pipeline/parsers/):

Only DSLIB contains real outcomes - actual recorded cost and progress over time (the planned value PV, earned value EV and actual cost AC curves) for 117 of its 231 projects (parsers/dslib.py). These are real projects that actually ran.
The other six libraries are network structures only - they describe the tasks, durations and dependencies of a project, but were never executed, so they have no real cost/progress history.

Jargon, in plain terms. EVM (Earned Value Management) tracks three running totals: PV = what you planned to have spent by now, EV = the budgeted value of work actually done, and AC = what you actually spent. From these you derive everything else (e.g. are we over budget? behind schedule?). BAC = Budget at Completion, the total planned cost.

2.2 Filling the gaps with simulation

To train forecasting models you need many examples with known outcomes, but only 117 real ones exist. So we simulate the missing outcomes: take the real network structures (which we have in abundance), schedule them, then run a Monte-Carlo "execution" to produce realistic PV/EV/AC curves (pipeline/simulate.py; DATA.md §4-5).

Each simulated project is assigned a behavioural regime - controlled (33%), typical (49%) or troubled (18%) - so the corpus contains calm projects and messy ones in realistic proportions (simulate.py). The simulator was tuned not for visual realism but for usefulness: we ran an A/B/C comparison and kept the configuration ("config C": behavioural regimes, a particular cost mechanism switched off) that produced models which transferred best to the real test projects (DATA.md §4). The simulated distribution of overruns was checked against the real DSLIB projects and matches closely (e.g. real final cost-efficiency 10th/50th/90th percentiles 0.82 / 0.98 / 1.20 vs simulated 0.82 / 1.00 / 1.15; DATA.md §"calibration").

This produced 3,632 simulated project trajectories from 1,734 source networks (data/curated/train_sim.jsonl; DATA.md §"counts").

2.3 How the data is split (and why there is no cheating)

File	What it is	Used for
`train_sim.jsonl` (3,632)	simulated projects	train the ML + foundation models; seed the teacher's reasoning traces
`test_real.jsonl` (117)	real DSLIB projects	the held-out benchmark - never trained on
`sft_phasec.jsonl` (367)	teacher reasoning traces over simulated projects	fine-tune the small models

The benchmark is scored on the 107 real projects that have at least four reporting periods (enough to forecast from a mid-point); the other 10 are too short to forecast (pipeline/eval/harness.py, bench.json: n_test=117, n_eval=107). They are stratified by length: short 43, medium 27, long 25, very-long 12.

The integrity rule is structural: a real DSLIB project that has an outcome is a test item and is never used as a training source; everything trained on is simulated. The produced manifest records zero overlap between training networks and test projects (DATA.md §"contamination"). All 367 distillation traces are simulation-only (every record id begins sim::), so the small models are never shown a real test project even indirectly (slipstream-phasec-realdata memory; build_sft.py).

3. Evaluation setup

3.1 The question, the snapshots, the metrics

For each test project we reveal only the first part of its history (25%, 40%, 60% or 75% of the way through) and ask the method for its EAC and finish period. 40% is the primary reporting point - far enough in to have signal, early enough to be useful (harness.py; EVAL.md). We measure three things (pipeline/eval/harness.py):

Cost error (eac_ape_med): the median, across projects, of the percentage gap between the forecast final cost and the true final cost. Lower is better.
Finish error (finish_err_med): the median gap, in periods, between the forecast finish and the true finish. Lower is better.
Valid rate (valid_rate): the fraction of projects for which the method returned a usable forecast at all.

We use the median (the "typical" project) rather than the average so a single wild project cannot dominate the score.

3.2 The control group and the baselines

Null control (naive_dist): ignores the specific project and just guesses the average outcome of all projects. Any method that is genuinely using project-specific signal should beat this (pipeline/eval/ml_modal.py). It scores 3.32% / 0.71 periods at 40%.
Simple baselines (naive): "assume the plan holds" (naive_onplan, 4.44% / 1.0), "carry on at the last rate" (last_value), and a straight-line extrapolation (linear).
Industry standard (earned_schedule): Earned Schedule (Lipke, 2003) is the established project-controls method. It is a simple closed-form formula over the PV/EV/AC series - the time at which the plan's PV first equals today's EV, then finish = planned-duration ÷ schedule-efficiency, and cost = BAC ÷ cost-efficiency (pipeline/eval/formulas.py). It needs no special data or assurance effort beyond the EVM numbers a project already has. At 40% it is the best classical method on cost (2.37%).

3.3 The families we compared, and what each is good (and bad) at

We tried six families of methods (37 in total, bench.json):

Classical EVM formulas (6): Earned Schedule, evm_cpi_spi, xsm (an exponential-smoothing variant; Batselier & Vanhoucke, 2017), growth_curve (an S-curve fit), exp_smoothing, logistic. Standout: Earned Schedule and xsm. Weakness: their finish forecast collapses on long projects.
Machine learning on engineered features (12): gradient-boosted trees (CatBoost, LightGBM, XGBoost, HistGBM), Random Forest, ridge/SVR/MLP, and TabPFN (a pre-trained tabular model). These are trained on the simulated projects and predict the real ones zero-shot (no real labels). Standout: TabPFN, LightGBM, HistGBM. Weakness: weaker than Earned Schedule on short projects.
Time-series foundation models (3): TimesFM (Das et al., 2024) and Chronos (Ansari et al., 2024) forecast the raw cost/progress curve directly. Standout: none decisive. Weakness: the worst long-horizon errors of any family; adding a planned-value side input made Chronos worse, not better (a clean negative result: chronos_2_cov 4.43% vs chronos_2 3.93%).
Naive and 5. Control as above.
Agent (the new approach, §1 and §4).

A note on "realcv". Some ML rows are labelled _realcv. These are not deployable results: they are trained on the real test projects' own answers via cross-validation, as an upper-bound "ceiling" (ml_modal.py). The honest, deployable figure is the zero-shot, simulation-trained model. We never present _realcv as a real-world result.

3.4 The gap: nobody is good across all horizons

This is the central observation. Below is the finish error (periods) at 40% complete, broken down by project length (all values from bench.json by_strata):

Method (family)	short	medium	long	very-long	pattern
Earned Schedule (classical)	0.35	1.00	2.67	5.37	great early, collapses late
TabPFN (ML)	0.98	0.46	0.72	0.89	robust late, weak early
CatBoost (ML)	0.95	0.29	0.61	1.31	robust late, weak early
TimesFM (foundation)	0.59	2.79	6.65	12.45	worst on long
Agent (DeepSeek V4)	0.35	0.50	0.84	1.30	strong across all

Read it as: Earned Schedule wins short but is ~15× worse by very-long; the ML models win medium/long/very-long but lose short; the foundation models only hold up on short. No non-agent method is good at short and medium and long. The agent is the only row that is - it matches Earned Schedule early and stays near the ML models late. (This is the "blend" chart in the accompanying interactive presentation.)

4. Teacher bake-off

The agent needs a strong "teacher" model to define the reasoning we later distil. We compared two reasoning models head-to-head through the identical agent harness (pipeline/agent/bakeoff_modal.py; results in data/curated/agent_bakeoff.json):

Candidate teacher	Cost error	Finish error	Valid rate
DeepSeek V4 flash	0.79%	1.23 periods	1.00
DeepSeek V4 pro	0.83%	1.19 periods	1.00

Two earlier candidates - GLM-5.1 and an NVIDIA Nemotron-Ultra - were dropped before the committed run on access/rate-limit grounds (AGENT.md), not on quality.

We chose DeepSeek V4 flash as the teacher. On this head-to-head the two DeepSeek models were statistically indistinguishable, and flash is the cheaper of the two on list price, which matters because trace generation runs the teacher over hundreds of projects (AGENT.md).

Honest caveats. This bake-off is deliberately small - 4 projects (one per length band), so it is a sanity check for model selection, not a precise accuracy comparison. And our tooling did not capture a dollar cost for these brand-new model IDs (it logged 0.0), so "cheaper" is a published-price argument, not a measured spend. Run over the full 107-project benchmark, the chosen teacher scores 2.4% cost / 0.6 periods finish at 40% - the best all-round result in the benchmark.

How the agent works. It acts through a single tool, run_python(code=...): each turn it writes a short piece of Python that calls the forecasting tools (earned_schedule(), ml_predict(), timesfm(), etc.), inspects the numbers, and finally calls submit(finish, eac) (pipeline/agent/loop.py, forecast_tools.py). This "structured code action" design (building on CodeAct; Wang et al., 2024) is reliable even for small models and makes every step auditable. Its calibrated rule, stated in plain terms in the prompt: short and steady → trust Earned Schedule; long and slipping → trust the ML reference class for timing; keep cost on BAC ÷ CPI throughout, nudged by ML only when it strongly signals a larger overrun. This routing is what produces the "strong-across-all-horizons" behaviour in §3.4.

5. Distillation: shrinking the agent for the edge

A large cloud teacher is not deployable where much project data lives (offline, air-gapped). So we teach small open models to copy the teacher's reasoning.

5.1 Generating and filtering the training traces

We run the teacher over the simulated projects and record its full reasoning trace for each: the system instructions, the project, and every [reasoning → run_python → result] turn until it submits (bakeoff_modal.py, AGENT.md). We then keep only high-quality traces (pipeline/agent/select_traces.py): the forecast must be accurate (within 6% on cost and 2 periods on finish), error-free, concise (≤ 8 turns), and have a sensible amount of reasoning. 367 traces survive into data/curated/sft_phasec.jsonl.

5.2 The five student models and how they are trained

Student	Size	Notes
MiniCPM5-1B	~1B	the flagship small model
Qwen3.5-2B	~2B	linear-attention hybrid
Qwen3.5-4B	~4B	linear-attention hybrid
Gemma-E2B	~2B effective	Google Gemma
Nemotron-3-Nano-4B	~4B	NVIDIA Mamba-hybrid

Each is fine-tuned with LoRA (Hu et al., 2022) - a lightweight method that trains a small set of extra weights rather than the whole model - using plain transformers + TRL + PEFT, with the loss applied only to the model's own reasoning and tool calls (pipeline/agent/distill_modal.py). Crucially, every student is scored through the same agent harness as the teacher and the baselines, so the comparison is apples-to-apples.

5.3 Before vs after (at 40% complete)

Student	Valid rate (base → sft)	Cost error (base → sft)
MiniCPM5-1B	0.019 → 0.991	50.5% → 2.69%
Gemma-E2B	0.664 → 0.991	3.21% → 2.31%
Nemotron-3-Nano-4B	0.972 → 1.000	2.69% → 2.37%
Qwen3.5-4B	1.000 → 1.000	2.78% → 2.91%
Qwen3.5-2B	0.159 → 0.131	(unreliable)

(All values: bench.json, stage 0.40. Reference: teacher 2.4%, Earned Schedule 2.37%.)

The story is clear:

Off the shelf, small models cannot do the task. MiniCPM5-1B produces a usable forecast under 2% of the time (valid 0.019).
After distillation, the best reach parity with the large teacher (2.4%) and the best classical baseline (2.37%): Gemma-E2B sft at 2.31% and Nemotron-4B sft at 2.37%, both fully reliable. MiniCPM5-1B sft goes from useless to 0.991 valid at 2.69% - the most dramatic gain, in the smallest model.
One model did not take: Qwen3.5-2B stayed unreliable (valid ~0.13). Our reading is that, at 2B with only 367 examples, it could not reliably learn the tool-calling format; this is our interpretation, not a measured cause.

Takeaway: a 1-4B model, distilled from a strong teacher, can match that teacher and the industry-standard baseline on this task - small enough to forecast at the edge.

6. Limitations and honest caveats

Cost is largely "solved" by Earned Schedule. The agent matches it on cost rather than beating it; the agent's genuine win is on the schedule dimension, plus auditability and edge-deployability.
The teacher bake-off was tiny (4 projects) and the cost comparison was list-price, not measured (§4).
_realcv ML numbers are an in-domain ceiling, not deployable (§3.3).
107 real projects, one simulator, one teacher. A modest test set; results are bounded by the simulator's fidelity (carefully calibrated, but still synthetic training data) and by a single teacher policy.
Small models still have a floor (Qwen3.5-2B, §5.3).
A couple of documentation figures predate the final results refresh; this report uses the values in bench.json as the source of truth.

7. Future work

More real outcome data; reinforcement learning on the agent (not just imitation of the teacher); better small-model training for the models that struggled; and a wider panel of teachers. On the science side, the schedule-blending result suggests an explicit, learnable router between the "inside view" (Earned Schedule, the project's own past) and the "outside view" (the reference-class ML model) - the forecasting analogue of reference-class forecasting (Flyvbjerg, 2006; Kahneman & Tversky, 1979).

8. Conclusion

On real projects, the long-standing Earned Schedule formula and modern ML reference-class models have complementary strengths - early-horizon vs late-horizon - and neither dominates. A small agentic layer that reconciles them, choosing the right tool per project and explaining its choice, is the only method strong across the whole project lifecycle. That agent distils into 1-4B open models that match it, making accurate, auditable, on-device project forecasting practical even in air-gapped settings.

Appendix A: open releases

Everything is released on the Build Small Hackathon organisation on Hugging Face. The distillation dataset is CC-BY-4.0; each fine-tune inherits its base model's licence.

Write-up / article: https://huggingface.co/blog/build-small-hackathon/slipstream
Interactive Space: https://huggingface.co/spaces/build-small-hackathon/slipstream
Distillation dataset (the reasoning trajectories): https://huggingface.co/datasets/build-small-hackathon/slipstream-evm-sft
MiniCPM5-1B agent: https://huggingface.co/build-small-hackathon/slipstream-minicpm5-1b-evm
Nemotron-3-Nano 4B agent: https://huggingface.co/build-small-hackathon/slipstream-nemotron3-nano-4b-evm
Gemma-E2B agent: https://huggingface.co/build-small-hackathon/slipstream-gemma4-e2b-evm
Social post (hackathon requirement): https://x.com/NZXW63TF/status/2066647669540360315

The Space's final slide runs the agentic layer live (deployed on Modal); every run cold-starts a GPU, so a forecast takes roughly 5-7 minutes and the methods stream in as they finish.

References

Lipke, W. (2003). Schedule is Different. The Measurable News. (Earned Schedule.)
Batselier, J., & Vanhoucke, M. (2017). Improving project forecast accuracy by integrating earned value management with exponential smoothing and reference class forecasting. Int. J. Project Management. (The xsm method.)
Flyvbjerg, B. (2006). From Nobel Prize to Project Management: Getting Risks Right. Project Management Journal. (Reference-class forecasting.)
Kahneman, D., & Tversky, A. (1979). Intuitive prediction: biases and corrective procedures. (Inside vs outside view.)
Vanhoucke, M., et al. OR&S project-scheduling datasets, Ghent University. https://www.projectmanagement.ugent.be/research/data. Includes PSPLIB (Kolisch & Sprecher, 1997) and MMLIB (Van Peteghem & Vanhoucke, 2014).
Wang, X., et al. (2024). Executable Code Actions Elicit Better LLM Agents (CodeAct). ICML.
Hu, E., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
Hollmann, N., et al. TabPFN: a transformer that solves small tabular problems. (Tabular foundation model.)
Das, A., et al. (2024). A decoder-only foundation model for time-series forecasting (TimesFM). Google.
Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. Amazon.
Model providers: DeepSeek (V4), OpenBMB (MiniCPM5), Alibaba/Qwen (Qwen3.5), Google DeepMind (Gemma), NVIDIA (Nemotron-3-Nano).

Internal sources cited inline: docs/DATA.md, docs/EVAL.md, docs/AGENT.md, data/curated/bench.json, data/curated/agent_bakeoff.json, and pipeline/ modules.

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote