[Nah] = can't fill that section in right now.

Dillionv2

Summary

Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
   1. ARC Easy - 29.63%
   2. BLiMP - 64.96%
   3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)

Description

Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).

What changed

Dillion (v1)	Dillionv2	why
9B token count	24B token count	More tokens allow the model to see more patterns, improving almost everything.
FineWeb-edu dataset	9-source dataset	FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics.
72 hidden size	96 hidden size	72 was too narrow. 96 would allow the model to capture more complex patterns.
12 num layers	9 num layers	To stay in the parameter budget.
288 intermediate size	288 intermediate size	No change.
3 number of heads	3 number of heads	No change.
3076 vocab size	2564 vocab size	To free up parameters.
SGD optimizer	AdamW optimizer	AdamW is the modern choice and much better than SGD.
Cosine scheduler	WSD scheduler	WSD gives a better final loss.
Qwen3.5 architecture	Qwen3.5 architecture	No change.

Training

We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.

Dataset

The dataset is 34B tokens (we only use the first 24B) and 146GB in total:

FineWeb-edu (35GB): Educational-filtered Common Crawl
DCLM-Edu (20GB): Educational-filtered webtext
The Pile Deduped (20GB): Broad, diverse 23-source dataset
FineWeb-HQ (20GB): Knowledge-filtered Webtext
FineMath (13GB): Math-filtered Common Crawl
Cosmopedia-v2 (7GB): Synthetic textbooks
Wikipedia (5GB): you better know what this is
NpSetPython-Edu (3.5GB): normalized Python code
Misc (600MB): LessWrong + HF configs + HF dataset/model cards

Training results

The final loss ended at 3.078, which is a perplexity of 21.417.

benchmarks

Benchmark	Dillion	Dillionv2
BLiMP	62.94%	64.96%
ARC Easy (Norm)	31.36%	29.63%
PiQA (Norm)	53.10%	53.16%
SWAG (Norm)	30.36%	32.07%
HellaSwag (Norm)	26.65%	27.37%
ArithMark	24.80%	27.00%
AVG	38.20%	39.03%

Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.

generations

[Nah]

Use Cases

Educational research, learning, etc
fine-tuning for downstream use
deployment on edge devices
or for fun

Limitations

Doesn't have any!! No!! It does not.. alright fine..

cannot chat, code, reason, or answer factually
short context
always unfactual

Inference

#!/usr/bin/env python3
# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "Harley-ml/Dillionv2-1.3M"
TOKENIZER_PATH = MODEL_DIR

# --- Generation settings ---
PROMPT             = "The"
MAX_NEW_TOKENS     = 362
TEMPERATURE        = 0.6
TOP_P              = 0.95
TOP_K              = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE          = True

# =============================================================================

import os
import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------

def load_tokenizer(path_or_repo: str):
    p = Path(path_or_repo)

    # Case 1: explicit local tokenizer.json file
    if p.exists() and p.is_file() and p.suffix.lower() == ".json":
        tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
    # Case 2: local directory or HF repo ID
    else:
        tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)

    # Ensure required special tokens exist
    if tok.bos_token is None:
        tok.add_special_tokens({"bos_token": "<|bos|>"})
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token": "<|eos|>"})
    if tok.unk_token is None:
        tok.add_special_tokens({"unk_token": "<|unk|>"})
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"

    tok.padding_side = "left"
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {len(tokenizer)}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)

model.eval()
model.to(device)

# Safer inference for cache-related issues
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = False

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str = PROMPT,
    max_new_tokens: int = MAX_NEW_TOKENS,
    temperature: float = TEMPERATURE,
    top_p: float = TOP_P,
    top_k: int = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool = DO_SAMPLE,
) -> str:
    bos = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)

    inputs.pop("token_type_ids", None)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p
        gen_kwargs["top_k"] = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    prompt_len = inputs["input_ids"].shape[-1]
    new_ids = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)

# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)

License

MIT License. Read the license file here.

Citation

Downloads last month: 187

Safetensors

Model size

1.29M params

Tensor type

F32

Harley-ml
/

Dillionv2-1.3M