[Nah] = can't fill that section in right now.

Dillionv2

Summary

Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
   1. ARC Easy - 29.63%
   2. BLiMP - 64.96%
   3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)

Description

Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).

What changed

Dillion (v1) Dillionv2 why
9B token count 24B token count More tokens allow the model to see more patterns, improving almost everything.
FineWeb-edu dataset 9-source dataset FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics.
72 hidden size 96 hidden size 72 was too narrow. 96 would allow the model to capture more complex patterns.
12 num layers 9 num layers To stay in the parameter budget.
288 intermediate size 288 intermediate size No change.
3 number of heads 3 number of heads No change.
3076 vocab size 2564 vocab size To free up parameters.
SGD optimizer AdamW optimizer AdamW is the modern choice and much better than SGD.
Cosine scheduler WSD scheduler WSD gives a better final loss.
Qwen3.5 architecture Qwen3.5 architecture No change.

Training

We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.

Dataset

The dataset is 34B tokens (we only use the first 24B) and 146GB in total:

  1. FineWeb-edu (35GB): Educational-filtered Common Crawl
  2. DCLM-Edu (20GB): Educational-filtered webtext
  3. The Pile Deduped (20GB): Broad, diverse 23-source dataset
  4. FineWeb-HQ (20GB): Knowledge-filtered Webtext
  5. FineMath (13GB): Math-filtered Common Crawl
  6. Cosmopedia-v2 (7GB): Synthetic textbooks
  7. Wikipedia (5GB): you better know what this is
  8. NpSetPython-Edu (3.5GB): normalized Python code
  9. Misc (600MB): LessWrong + HF configs + HF dataset/model cards

Training results

The final loss ended at 3.078, which is a perplexity of 21.417.

benchmarks

Benchmark Dillion Dillionv2
BLiMP 62.94% 64.96%
ARC Easy (Norm) 31.36% 29.63%
PiQA (Norm) 53.10% 53.16%
SWAG (Norm) 30.36% 32.07%
HellaSwag (Norm) 26.65% 27.37%
ArithMark 24.80% 27.00%
AVG 38.20% 39.03%

Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.

generations

[Nah]

Use Cases

  1. Educational research, learning, etc
  2. fine-tuning for downstream use
  3. deployment on edge devices
  4. or for fun

Limitations

Doesn't have any!! No!! It does not.. alright fine..

  1. cannot chat, code, reason, or answer factually
  2. short context
  3. always unfactual

Inference

#!/usr/bin/env python3
# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "Harley-ml/Dillionv2-1.3M"
TOKENIZER_PATH = MODEL_DIR

# --- Generation settings ---
PROMPT             = "The"
MAX_NEW_TOKENS     = 362
TEMPERATURE        = 0.6
TOP_P              = 0.95
TOP_K              = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE          = True

# =============================================================================

import os
import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------

def load_tokenizer(path_or_repo: str):
    p = Path(path_or_repo)

    # Case 1: explicit local tokenizer.json file
    if p.exists() and p.is_file() and p.suffix.lower() == ".json":
        tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
    # Case 2: local directory or HF repo ID
    else:
        tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)

    # Ensure required special tokens exist
    if tok.bos_token is None:
        tok.add_special_tokens({"bos_token": "<|bos|>"})
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token": "<|eos|>"})
    if tok.unk_token is None:
        tok.add_special_tokens({"unk_token": "<|unk|>"})
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"

    tok.padding_side = "left"
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {len(tokenizer)}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)

model.eval()
model.to(device)

# Safer inference for cache-related issues
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = False

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str = PROMPT,
    max_new_tokens: int = MAX_NEW_TOKENS,
    temperature: float = TEMPERATURE,
    top_p: float = TOP_P,
    top_k: int = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool = DO_SAMPLE,
) -> str:
    bos = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)

    inputs.pop("token_type_ids", None)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p
        gen_kwargs["top_k"] = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    prompt_len = inputs["input_ids"].shape[-1]
    new_ids = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)

# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)

License

MIT License. Read the license file here.

Citation


Downloads last month
187
Safetensors
Model size
1.29M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Harley-ml/Dillionv2-1.3M

Spaces using Harley-ml/Dillionv2-1.3M 2