Slipstream: a benchmark for mid-flight project forecasting, and a small agent that blends the best methods
1. Summary
Big projects rarely finish exactly on budget or on time. The useful question, half-way through, is: given how a project has gone so far, what will it finally cost, and when will it actually finish?
Two numbers capture that:
- EAC (Estimate at Completion) - the project's total final cost.
- Finish period - the time-period the project completes in.
We built a benchmark that asks 37 forecasting methods exactly this question, on 107 real projects the methods never saw during training, at four points in each project's life (25%, 40%, 60% and 75% complete). We score every method the same way, so they are directly comparable.
The headline finding: no single method is good across a project's whole life.
- Earned Schedule - the long-standing industry-standard formula - is excellent on short projects but its finish forecast degrades badly on long ones (median error grows from 0.35 periods to 5.4 periods as projects get longer).
- Machine-learning "reference-class" models are the mirror image: steady and robust on long projects, but weaker than Earned Schedule early on.
- Time-series foundation models are the weakest on long projects of all.
Each family wins only one or two of the short / medium / long horizons, never all three. That gap is exactly what an extra layer of intelligence can close. We add a small agent that, for each project, writes and runs code to call every forecasting tool, reasons about which to trust for this situation (lean on the ML reference-class for long-horizon timing; keep cost anchored on Earned Schedule), reconciles them into one answer, and states the rule it used. That agent is the only method that is strong across all horizons - it keeps Earned Schedule's early accuracy (0.35 periods on short) and inherits the ML robustness on long (1.30 periods on very-long, versus Earned Schedule's 5.37).
Finally, we distil that agent's reasoning into small open models (1-4 billion parameters). The best of them match the large teacher and the best classical baseline on accuracy, while being small enough to run on-device and air-gapped - which matters because much project data (defence, critical infrastructure, government) cannot leave its environment.
Everything - simulating missing data, generating the teacher's reasoning traces, fine-tuning and evaluation - ran on Modal.
Key results at a glance (at 40% complete)
| Method | What it is | Cost error (median) | Finish error (median) |
|---|---|---|---|
| Earned Schedule | industry-standard formula | 2.37% | 1.0 period |
| Best ML (TabPFN, zero-shot) | learned reference class | 3.0% | 0.93 periods |
| Best time-series FM (Chronos) | general forecasting AI | 3.93% | 2.13 periods |
| Agent (DeepSeek V4 teacher) | reconciles the above | 2.4% | 0.6 periods |
| Distilled 4B student (Nemotron) | the agent, shrunk | 2.37% | 0.61 periods |
(All values: data/curated/bench.json, stage 0.40.)
2. Data
2.1 Where the data comes from
The raw projects come from the Operations Research & Scheduling (OR&S) group at Ghent University
(Mario Vanhoucke), a widely used public collection of project-scheduling datasets
(https://www.projectmanagement.ugent.be/research/data; DATA.md §2). Seven libraries are used:
DSLIB, RCPLIB, MPLIB, ASLIB, MMLIB, MSLIB and SSLIB.
There is one crucial distinction (DATA.md §2; pipeline/parsers/):
- Only DSLIB contains real outcomes - actual recorded cost and progress over time (the planned
value PV, earned value EV and actual cost AC curves) for 117 of its 231 projects
(
parsers/dslib.py). These are real projects that actually ran. - The other six libraries are network structures only - they describe the tasks, durations and dependencies of a project, but were never executed, so they have no real cost/progress history.
Jargon, in plain terms. EVM (Earned Value Management) tracks three running totals: PV = what you planned to have spent by now, EV = the budgeted value of work actually done, and AC = what you actually spent. From these you derive everything else (e.g. are we over budget? behind schedule?). BAC = Budget at Completion, the total planned cost.
2.2 Filling the gaps with simulation
To train forecasting models you need many examples with known outcomes, but only 117 real ones
exist. So we simulate the missing outcomes: take the real network structures (which we have in
abundance), schedule them, then run a Monte-Carlo "execution" to produce realistic PV/EV/AC curves
(pipeline/simulate.py; DATA.md §4-5).
Each simulated project is assigned a behavioural regime - controlled (33%), typical (49%) or
troubled (18%) - so the corpus contains calm projects and messy ones in realistic proportions
(simulate.py). The simulator was tuned not for visual realism but for usefulness: we ran an
A/B/C comparison and kept the configuration ("config C": behavioural regimes, a particular cost
mechanism switched off) that produced models which transferred best to the real test projects
(DATA.md §4). The simulated distribution of overruns was checked against the real DSLIB projects and
matches closely (e.g. real final cost-efficiency 10th/50th/90th percentiles 0.82 / 0.98 / 1.20 vs
simulated 0.82 / 1.00 / 1.15; DATA.md §"calibration").
This produced 3,632 simulated project trajectories from 1,734 source networks
(data/curated/train_sim.jsonl; DATA.md §"counts").
2.3 How the data is split (and why there is no cheating)
| File | What it is | Used for |
|---|---|---|
train_sim.jsonl (3,632) |
simulated projects | train the ML + foundation models; seed the teacher's reasoning traces |
test_real.jsonl (117) |
real DSLIB projects | the held-out benchmark - never trained on |
sft_phasec.jsonl (367) |
teacher reasoning traces over simulated projects | fine-tune the small models |
The benchmark is scored on the 107 real projects that have at least four reporting periods (enough
to forecast from a mid-point); the other 10 are too short to forecast (pipeline/eval/harness.py,
bench.json: n_test=117, n_eval=107). They are stratified by length: short 43, medium 27, long 25,
very-long 12.
The integrity rule is structural: a real DSLIB project that has an outcome is a test item and is
never used as a training source; everything trained on is simulated. The produced manifest records
zero overlap between training networks and test projects (DATA.md §"contamination"). All 367
distillation traces are simulation-only (every record id begins sim::), so the small models are never
shown a real test project even indirectly (slipstream-phasec-realdata memory; build_sft.py).
3. Evaluation setup
3.1 The question, the snapshots, the metrics
For each test project we reveal only the first part of its history (25%, 40%, 60% or 75% of the way
through) and ask the method for its EAC and finish period. 40% is the primary reporting point - far
enough in to have signal, early enough to be useful (harness.py; EVAL.md). We measure three things
(pipeline/eval/harness.py):
- Cost error (
eac_ape_med): the median, across projects, of the percentage gap between the forecast final cost and the true final cost. Lower is better. - Finish error (
finish_err_med): the median gap, in periods, between the forecast finish and the true finish. Lower is better. - Valid rate (
valid_rate): the fraction of projects for which the method returned a usable forecast at all.
We use the median (the "typical" project) rather than the average so a single wild project cannot dominate the score.
3.2 The control group and the baselines
- Null control (
naive_dist): ignores the specific project and just guesses the average outcome of all projects. Any method that is genuinely using project-specific signal should beat this (pipeline/eval/ml_modal.py). It scores 3.32% / 0.71 periods at 40%. - Simple baselines (
naive): "assume the plan holds" (naive_onplan, 4.44% / 1.0), "carry on at the last rate" (last_value), and a straight-line extrapolation (linear). - Industry standard (
earned_schedule): Earned Schedule (Lipke, 2003) is the established project-controls method. It is a simple closed-form formula over the PV/EV/AC series - the time at which the plan's PV first equals today's EV, then finish = planned-duration ÷ schedule-efficiency, and cost = BAC ÷ cost-efficiency (pipeline/eval/formulas.py). It needs no special data or assurance effort beyond the EVM numbers a project already has. At 40% it is the best classical method on cost (2.37%).
3.3 The families we compared, and what each is good (and bad) at
We tried six families of methods (37 in total, bench.json):
- Classical EVM formulas (6): Earned Schedule,
evm_cpi_spi,xsm(an exponential-smoothing variant; Batselier & Vanhoucke, 2017),growth_curve(an S-curve fit),exp_smoothing,logistic. Standout: Earned Schedule andxsm. Weakness: their finish forecast collapses on long projects. - Machine learning on engineered features (12): gradient-boosted trees (CatBoost, LightGBM, XGBoost, HistGBM), Random Forest, ridge/SVR/MLP, and TabPFN (a pre-trained tabular model). These are trained on the simulated projects and predict the real ones zero-shot (no real labels). Standout: TabPFN, LightGBM, HistGBM. Weakness: weaker than Earned Schedule on short projects.
- Time-series foundation models (3): TimesFM (Das et al., 2024) and Chronos (Ansari et al., 2024)
forecast the raw cost/progress curve directly. Standout: none decisive. Weakness: the worst
long-horizon errors of any family; adding a planned-value side input made Chronos worse, not
better (a clean negative result:
chronos_2_cov4.43% vschronos_23.93%). - Naive and 5. Control as above.
- Agent (the new approach, §1 and §4).
A note on "realcv". Some ML rows are labelled
_realcv. These are not deployable results: they are trained on the real test projects' own answers via cross-validation, as an upper-bound "ceiling" (ml_modal.py). The honest, deployable figure is the zero-shot, simulation-trained model. We never present_realcvas a real-world result.
3.4 The gap: nobody is good across all horizons
This is the central observation. Below is the finish error (periods) at 40% complete, broken down
by project length (all values from bench.json by_strata):
| Method (family) | short | medium | long | very-long | pattern |
|---|---|---|---|---|---|
| Earned Schedule (classical) | 0.35 | 1.00 | 2.67 | 5.37 | great early, collapses late |
| TabPFN (ML) | 0.98 | 0.46 | 0.72 | 0.89 | robust late, weak early |
| CatBoost (ML) | 0.95 | 0.29 | 0.61 | 1.31 | robust late, weak early |
| TimesFM (foundation) | 0.59 | 2.79 | 6.65 | 12.45 | worst on long |
| Agent (DeepSeek V4) | 0.35 | 0.50 | 0.84 | 1.30 | strong across all |
Read it as: Earned Schedule wins short but is ~15× worse by very-long; the ML models win medium/long/very-long but lose short; the foundation models only hold up on short. No non-agent method is good at short and medium and long. The agent is the only row that is - it matches Earned Schedule early and stays near the ML models late. (This is the "blend" chart in the accompanying interactive presentation.)
4. Teacher bake-off
The agent needs a strong "teacher" model to define the reasoning we later distil. We compared two
reasoning models head-to-head through the identical agent harness (pipeline/agent/bakeoff_modal.py;
results in data/curated/agent_bakeoff.json):
| Candidate teacher | Cost error | Finish error | Valid rate |
|---|---|---|---|
| DeepSeek V4 flash | 0.79% | 1.23 periods | 1.00 |
| DeepSeek V4 pro | 0.83% | 1.19 periods | 1.00 |
Two earlier candidates - GLM-5.1 and an NVIDIA Nemotron-Ultra - were dropped before the committed run on access/rate-limit grounds (AGENT.md), not on quality.
We chose DeepSeek V4 flash as the teacher. On this head-to-head the two DeepSeek models were statistically indistinguishable, and flash is the cheaper of the two on list price, which matters because trace generation runs the teacher over hundreds of projects (AGENT.md).
Honest caveats. This bake-off is deliberately small - 4 projects (one per length band), so it is a sanity check for model selection, not a precise accuracy comparison. And our tooling did not capture a dollar cost for these brand-new model IDs (it logged 0.0), so "cheaper" is a published-price argument, not a measured spend. Run over the full 107-project benchmark, the chosen teacher scores 2.4% cost / 0.6 periods finish at 40% - the best all-round result in the benchmark.
How the agent works. It acts through a single tool, run_python(code=...): each turn it writes a
short piece of Python that calls the forecasting tools (earned_schedule(), ml_predict(),
timesfm(), etc.), inspects the numbers, and finally calls submit(finish, eac)
(pipeline/agent/loop.py, forecast_tools.py). This "structured code action" design (building on
CodeAct; Wang et al., 2024) is reliable even for small models and makes every step auditable. Its
calibrated rule, stated in plain terms in the prompt: short and steady → trust Earned Schedule;
long and slipping → trust the ML reference class for timing; keep cost on BAC ÷ CPI throughout, nudged
by ML only when it strongly signals a larger overrun. This routing is what produces the
"strong-across-all-horizons" behaviour in §3.4.
5. Distillation: shrinking the agent for the edge
A large cloud teacher is not deployable where much project data lives (offline, air-gapped). So we teach small open models to copy the teacher's reasoning.
5.1 Generating and filtering the training traces
We run the teacher over the simulated projects and record its full reasoning trace for each: the
system instructions, the project, and every [reasoning → run_python → result] turn until it submits
(bakeoff_modal.py, AGENT.md). We then keep only high-quality traces
(pipeline/agent/select_traces.py): the forecast must be accurate (within 6% on cost and 2 periods on
finish), error-free, concise (≤ 8 turns), and have a sensible amount of reasoning. 367 traces
survive into data/curated/sft_phasec.jsonl.
5.2 The five student models and how they are trained
| Student | Size | Notes |
|---|---|---|
| MiniCPM5-1B | ~1B | the flagship small model |
| Qwen3.5-2B | ~2B | linear-attention hybrid |
| Qwen3.5-4B | ~4B | linear-attention hybrid |
| Gemma-E2B | ~2B effective | Google Gemma |
| Nemotron-3-Nano-4B | ~4B | NVIDIA Mamba-hybrid |
Each is fine-tuned with LoRA (Hu et al., 2022) - a lightweight method that trains a small set of
extra weights rather than the whole model - using plain transformers + TRL + PEFT, with the loss
applied only to the model's own reasoning and tool calls (pipeline/agent/distill_modal.py).
Crucially, every student is scored through the same agent harness as the teacher and the baselines,
so the comparison is apples-to-apples.
5.3 Before vs after (at 40% complete)
| Student | Valid rate (base → sft) | Cost error (base → sft) |
|---|---|---|
| MiniCPM5-1B | 0.019 → 0.991 | 50.5% → 2.69% |
| Gemma-E2B | 0.664 → 0.991 | 3.21% → 2.31% |
| Nemotron-3-Nano-4B | 0.972 → 1.000 | 2.69% → 2.37% |
| Qwen3.5-4B | 1.000 → 1.000 | 2.78% → 2.91% |
| Qwen3.5-2B | 0.159 → 0.131 | (unreliable) |
(All values: bench.json, stage 0.40. Reference: teacher 2.4%, Earned Schedule 2.37%.)
The story is clear:
- Off the shelf, small models cannot do the task. MiniCPM5-1B produces a usable forecast under 2% of the time (valid 0.019).
- After distillation, the best reach parity with the large teacher (2.4%) and the best classical baseline (2.37%): Gemma-E2B sft at 2.31% and Nemotron-4B sft at 2.37%, both fully reliable. MiniCPM5-1B sft goes from useless to 0.991 valid at 2.69% - the most dramatic gain, in the smallest model.
- One model did not take: Qwen3.5-2B stayed unreliable (valid ~0.13). Our reading is that, at 2B with only 367 examples, it could not reliably learn the tool-calling format; this is our interpretation, not a measured cause.
Takeaway: a 1-4B model, distilled from a strong teacher, can match that teacher and the industry-standard baseline on this task - small enough to forecast at the edge.
6. Limitations and honest caveats
- Cost is largely "solved" by Earned Schedule. The agent matches it on cost rather than beating it; the agent's genuine win is on the schedule dimension, plus auditability and edge-deployability.
- The teacher bake-off was tiny (4 projects) and the cost comparison was list-price, not measured (§4).
_realcvML numbers are an in-domain ceiling, not deployable (§3.3).- 107 real projects, one simulator, one teacher. A modest test set; results are bounded by the simulator's fidelity (carefully calibrated, but still synthetic training data) and by a single teacher policy.
- Small models still have a floor (Qwen3.5-2B, §5.3).
- A couple of documentation figures predate the final results refresh; this report uses the values in
bench.jsonas the source of truth.
7. Future work
More real outcome data; reinforcement learning on the agent (not just imitation of the teacher); better small-model training for the models that struggled; and a wider panel of teachers. On the science side, the schedule-blending result suggests an explicit, learnable router between the "inside view" (Earned Schedule, the project's own past) and the "outside view" (the reference-class ML model) - the forecasting analogue of reference-class forecasting (Flyvbjerg, 2006; Kahneman & Tversky, 1979).
8. Conclusion
On real projects, the long-standing Earned Schedule formula and modern ML reference-class models have complementary strengths - early-horizon vs late-horizon - and neither dominates. A small agentic layer that reconciles them, choosing the right tool per project and explaining its choice, is the only method strong across the whole project lifecycle. That agent distils into 1-4B open models that match it, making accurate, auditable, on-device project forecasting practical even in air-gapped settings.
Appendix A: open releases
Everything is released on the Build Small Hackathon organisation on Hugging Face. The distillation dataset is CC-BY-4.0; each fine-tune inherits its base model's licence.
- Write-up / article: https://huggingface.co/blog/build-small-hackathon/slipstream
- Interactive Space: https://huggingface.co/spaces/build-small-hackathon/slipstream
- Distillation dataset (the reasoning trajectories): https://huggingface.co/datasets/build-small-hackathon/slipstream-evm-sft
- MiniCPM5-1B agent: https://huggingface.co/build-small-hackathon/slipstream-minicpm5-1b-evm
- Nemotron-3-Nano 4B agent: https://huggingface.co/build-small-hackathon/slipstream-nemotron3-nano-4b-evm
- Gemma-E2B agent: https://huggingface.co/build-small-hackathon/slipstream-gemma4-e2b-evm
- Social post (hackathon requirement): https://x.com/NZXW63TF/status/2066647669540360315
The Space's final slide runs the agentic layer live (deployed on Modal); every run cold-starts a GPU, so a forecast takes roughly 5-7 minutes and the methods stream in as they finish.
References
- Lipke, W. (2003). Schedule is Different. The Measurable News. (Earned Schedule.)
- Batselier, J., & Vanhoucke, M. (2017). Improving project forecast accuracy by integrating earned
value management with exponential smoothing and reference class forecasting. Int. J. Project
Management. (The
xsmmethod.) - Flyvbjerg, B. (2006). From Nobel Prize to Project Management: Getting Risks Right. Project Management Journal. (Reference-class forecasting.)
- Kahneman, D., & Tversky, A. (1979). Intuitive prediction: biases and corrective procedures. (Inside vs outside view.)
- Vanhoucke, M., et al. OR&S project-scheduling datasets, Ghent University.
https://www.projectmanagement.ugent.be/research/data. Includes PSPLIB (Kolisch & Sprecher, 1997) and MMLIB (Van Peteghem & Vanhoucke, 2014). - Wang, X., et al. (2024). Executable Code Actions Elicit Better LLM Agents (CodeAct). ICML.
- Hu, E., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR.
- Hollmann, N., et al. TabPFN: a transformer that solves small tabular problems. (Tabular foundation model.)
- Das, A., et al. (2024). A decoder-only foundation model for time-series forecasting (TimesFM). Google.
- Ansari, A. F., et al. (2024). Chronos: Learning the Language of Time Series. Amazon.
- Model providers: DeepSeek (V4), OpenBMB (MiniCPM5), Alibaba/Qwen (Qwen3.5), Google DeepMind (Gemma), NVIDIA (Nemotron-3-Nano).
Internal sources cited inline: docs/DATA.md, docs/EVAL.md, docs/AGENT.md,
data/curated/bench.json, data/curated/agent_bakeoff.json, and pipeline/ modules.