[Nah] = can't fill that section in right now.
Dillionv2
Summary
Task: Text-Generation
Total training time: 35 hours
Inputs: text
Outputs: text
Params: ~1.3M
Final Loss: 3.078
Important Benchmark Scores:
1. ARC Easy - 29.63%
2. BLiMP - 64.96%
3. HellaSwag - 27.27%
Framework: PyTorch, transformers
Author: Paul Courneya (Harley-ml)
Description
Dillionv2 is our second generation model of the Dillion SLM family. It is a significant improvement over v1 (in everything except ARC).
What changed
| Dillion (v1) | Dillionv2 | why |
|---|---|---|
| 9B token count | 24B token count | More tokens allow the model to see more patterns, improving almost everything. |
| FineWeb-edu dataset | 9-source dataset | FineWeb-edu is edu-filtered and pretty narrow in style. 9 sources allow the model to see more patterns, styles, and non-educational text, improving semantics. |
| 72 hidden size | 96 hidden size | 72 was too narrow. 96 would allow the model to capture more complex patterns. |
| 12 num layers | 9 num layers | To stay in the parameter budget. |
| 288 intermediate size | 288 intermediate size | No change. |
| 3 number of heads | 3 number of heads | No change. |
| 3076 vocab size | 2564 vocab size | To free up parameters. |
| SGD optimizer | AdamW optimizer | AdamW is the modern choice and much better than SGD. |
| Cosine scheduler | WSD scheduler | WSD gives a better final loss. |
| Qwen3.5 architecture | Qwen3.5 architecture | No change. |
Training
We trained Dillionv2 for one epoch on 24B tokens for a combined total of 35 hours on an RTX 2060 and two T4s from Kaggle with a batch size of 384 and a gradient accumulation of 2.
Dataset
The dataset is 34B tokens (we only use the first 24B) and 146GB in total:
- FineWeb-edu (35GB): Educational-filtered Common Crawl
- DCLM-Edu (20GB): Educational-filtered webtext
- The Pile Deduped (20GB): Broad, diverse 23-source dataset
- FineWeb-HQ (20GB): Knowledge-filtered Webtext
- FineMath (13GB): Math-filtered Common Crawl
- Cosmopedia-v2 (7GB): Synthetic textbooks
- Wikipedia (5GB): you better know what this is
- NpSetPython-Edu (3.5GB): normalized Python code
- Misc (600MB): LessWrong + HF configs + HF dataset/model cards
Training results
The final loss ended at 3.078, which is a perplexity of 21.417.
benchmarks
| Benchmark | Dillion | Dillionv2 |
|---|---|---|
| BLiMP | 62.94% | 64.96% |
| ARC Easy (Norm) | 31.36% | 29.63% |
| PiQA (Norm) | 53.10% | 53.16% |
| SWAG (Norm) | 30.36% | 32.07% |
| HellaSwag (Norm) | 26.65% | 27.37% |
| ArithMark | 24.80% | 27.00% |
| AVG | 38.20% | 39.03% |
Dillionv2 shows stonger performace on multiple benchmarks than v1, except ARC. For a comphrehensive comparison among many small models, including my own, such as this one, go to AxiomicLab's Open SLM Leaderboard.
generations
[Nah]
Use Cases
- Educational research, learning, etc
- fine-tuning for downstream use
- deployment on edge devices
- or for fun
Limitations
Doesn't have any!! No!! It does not.. alright fine..
- cannot chat, code, reason, or answer factually
- short context
- always unfactual
Inference
#!/usr/bin/env python3
# =============================================================================
# Inference
# =============================================================================
MODEL_DIR = "Harley-ml/Dillionv2-1.3M"
TOKENIZER_PATH = MODEL_DIR
# --- Generation settings ---
PROMPT = "The"
MAX_NEW_TOKENS = 362
TEMPERATURE = 0.6
TOP_P = 0.95
TOP_K = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE = True
# =============================================================================
import os
import torch
from pathlib import Path
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
PreTrainedTokenizerFast,
AddedToken,
)
# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------
device = (
"cuda" if torch.cuda.is_available() else
"mps" if torch.backends.mps.is_available() else
"cpu"
)
print(f"Device : {device}")
# ---------------------------------------------------------------------------
# Tokenizer
# ---------------------------------------------------------------------------
def load_tokenizer(path_or_repo: str):
p = Path(path_or_repo)
# Case 1: explicit local tokenizer.json file
if p.exists() and p.is_file() and p.suffix.lower() == ".json":
tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
# Case 2: local directory or HF repo ID
else:
tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)
# Ensure required special tokens exist
if tok.bos_token is None:
tok.add_special_tokens({"bos_token": "<|bos|>"})
if tok.eos_token is None:
tok.add_special_tokens({"eos_token": "<|eos|>"})
if tok.unk_token is None:
tok.add_special_tokens({"unk_token": "<|unk|>"})
if tok.pad_token is None:
tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"
tok.padding_side = "left"
return tok
print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f" Vocab size : {len(tokenizer)}")
print(f" BOS : {tokenizer.bos_token!r}")
print(f" EOS : {tokenizer.eos_token!r}")
print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
torch_dtype=torch.float16 if device == "cuda" else torch.float32,
low_cpu_mem_usage=True,
)
model.eval()
model.to(device)
# Safer inference for cache-related issues
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
model.generation_config.use_cache = False
total_params = sum(p.numel() for p in model.parameters())
print(f" Parameters : {total_params:,}")
# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------
def generate(
prompt: str = PROMPT,
max_new_tokens: int = MAX_NEW_TOKENS,
temperature: float = TEMPERATURE,
top_p: float = TOP_P,
top_k: int = TOP_K,
repetition_penalty: float = REPETITION_PENALTY,
do_sample: bool = DO_SAMPLE,
) -> str:
bos = tokenizer.bos_token or ""
full_prompt = bos + prompt
inputs = tokenizer(
full_prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
inputs.pop("token_type_ids", None)
gen_kwargs = dict(
max_new_tokens=max_new_tokens,
do_sample=do_sample,
repetition_penalty=repetition_penalty,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
use_cache=False,
)
if do_sample:
gen_kwargs["temperature"] = temperature
gen_kwargs["top_p"] = top_p
gen_kwargs["top_k"] = top_k
with torch.inference_mode():
output_ids = model.generate(**inputs, **gen_kwargs)
prompt_len = inputs["input_ids"].shape[-1]
new_ids = output_ids[0][prompt_len:]
return tokenizer.decode(new_ids, skip_special_tokens=True)
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
if __name__ == "__main__":
print(f"\nPrompt : {PROMPT!r}")
print("-" * 60)
output = generate(PROMPT)
print("Generated:")
print(output)
License
MIT License. Read the license file here.
Citation
- Downloads last month
- 187