Willow Alpha

An early-stage version of Forge-1V

Small language model research by North ML.

Overview

Willow Alpha is an early-stage base model checkpoint in the Forge-1V model line.

This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements.

Model Details

Field	Value
Model name	Willow Alpha
Project	Forge-1V
Organization	North ML
Model type	Causal Language Model
Language	English
License	MIT
Status	Early-stage / Alpha

Evaluation Results

All benchmarks below were run in 0-shot mode.

Benchmark	Metric	Score	Runtime
HellaSwag	acc_norm	26.71%	318.67s
PIQA	acc_norm	53.86%	38.85s
WinoGrande	acc	50.67%	23.73s
BoolQ	acc	40.21%	144.80s
ARC-Easy	acc_norm	34.68%	51.41s
ARC-Challenge	acc_norm	25.60%	37.69s
OpenBookQA	acc_norm	25.00%	21.14s
CommonsenseQA	acc	20.31%	27.66s
LAMBADA	acc	0.23%	96.28s
BLiMP	acc	59.23%	354.79s
MMLU	acc	23.89%	388.62s
WikiText-2	word_perplexity	12524.42	182.89s
WikiText-2	byte_perplexity	5.84	181.42s
SciQ	acc_norm	35.60%	87.15s
COPA	acc	64.00%	17.21s
RACE	acc	23.16%	334.70s
SWAG	acc_norm	29.13%	252.00s
TruthfulQA MC2	acc	48.74%	126.29s

Evaluation Summary

Category	Result
Number of completed benchmark runs	18
Successful runs	18
Failed runs	0
Best accuracy-style score	COPA — 64.00%
Best language-structure score	BLiMP — 59.23%
MMLU score	23.89%
WikiText-2 byte perplexity	5.84
WikiText-2 word perplexity	12524.42

Notes

Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks.

The strongest early signals are:

COPA: 64.00%
BLiMP: 59.23%
PIQA: 53.86%
WinoGrande: 50.67%
TruthfulQA MC2: 48.74%

The weakest areas are:

LAMBADA
WikiText-2 word perplexity
CommonsenseQA
MMLU
RACE

These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant.

Intended Use

Willow Alpha is intended for:

Research
Benchmarking
Pretraining experiments
Fine-tuning experiments
Small language model development
Forge-1V pipeline testing

It is not yet recommended for production use.

Limitations

This model may:

Produce incorrect information
Fail basic reasoning tasks
Struggle with factual knowledge
Generate repetitive or low-quality text
Perform poorly on long-context tasks
Require additional supervised fine-tuning

Citation

@misc{willow-alpha,
  title = {Willow Alpha},
  author = {North ML},
  year = {2026},
  note = {Early-stage Forge-1V checkpoint}
}

Downloads last month: 228

Safetensors

Model size

0.3B params

Tensor type

F32