Er-Tiny

Summary

Task: Text-Generation
Total training time: 38 hours
Inputs: text
Outputs: text
Params: 1.33M
Final Loss: 3.024
Important Benchmark Scores:
   1. ARC Easy - 30.13%
   2. BLiMP - 66.20%
   3. HellaSwag - 27.58%
   4. ArithMark-2.0 - 27.12%
Framework: PyTorch, transformers
Authors: Paul Courneya, Jonathan Ly

Description

‘Er-Tiny’ is a 1.33M-parameter Small Language Model trained on 34.8B tokens from a nine-source dataset. Its name, “Er,” is the reverse of “Re,” the prefix of Re:Zero – Starting Life in Another World, the light novel series that inspired the organization’s name.

Model Details

Architecture: Qwen3.5
Hidden Size: 112
Number of Layers: 7
Intermediate Size: 311 (a 2.7x expansion)
Number of Attention Heads: 4
Number of KV Heads: 1
Head Dim: 28
Vocab Size: 2564
Max Position Embeddings: 384
Total Parameters: 1.33M

Training

Dataset

Source	Bytes (GB)	Share (%)	What it is
FineWeb-edu	35.0	28.2%	Educational-filtered Common Crawl
DCLM-Edu	20.0	16.1%	Educational-filtered webtext
The Pile Deduped	20.0	16.1%	Broad, diverse 23-source dataset
FineWeb-HQ	20.0	16.1%	Knowledge-filtered webtext
FineMath	13.0	10.5%	Math-filtered Common Crawl
Cosmopedia-v2	7.0	5.6%	Synthetic textbooks
Wikipedia	5.0	4.0%	Wikipedia articles
NpSetPython-Edu	3.5	2.8%	Normalized Python code
Misc	0.6	0.5%	LessWrong + HF configs + HF dataset/model cards

Training Details

Maximum Learning Rate: 3.3e-3
Minimum Learning Rate: 0
Number of Epochs: 1
Sequence Length: 384
Batch Size: 384
Eval Split Ratio: 0.0025
Gradient Accumulation Steps: 2
Gradient Checkpointing: True
Gradient Clipping: 1.0
Torch Compile: True
Torch Compile Mode: max-autotune-no-cudagraphs
AdamW Betas: (0.9, 0.95)
WSD Warmup Ratioq: 0.02
WSD Stable Ratio: 0.73
WSD Decay Ratio: 0.25
DType: bfloat16

Final Eval and Train Loss

Train: 3.023
Val: 3.024

Hardware

GPU: Two NVIDIA RTX 5070s (used for training; by @LyJonathon)
CPU: AMD Ryzen 5 2600 (used for tokenization; by @Harley-ml)

Benchmark scores

Task	Value
BLiMP	66.20%
ARC Easy	30.13%
HellaSwag	27.58%
PiQA	52.39%
SciQ	57.50%
SWAG	32.50%
Winogrande	49.80%
ArithMark-2.0	27.12%

For a comparison with other small language models like this one, go here.

Use Cases

Educational work and research
Fine-tuning for downstream use
Deployment on edge devices
Or just for fun.

Limitations

Cannot chat, reason, code, or answer questions
Almost always unfactual
No long-context handling

License

Before using, distributing, selling, or modifying this software, you must read the license here.

Inference

#!/usr/bin/env python3

MODEL_DIR = "fromziro/Er-Tiny-1.3M"
TOKENIZER_PATH = MODEL_DIR

PROMPT = "Artificial intelligence is"
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7
TOP_P = 0.95
TOP_K = 30
REPETITION_PENALTY = 1.2
DO_SAMPLE = True

import torch
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizerFast

device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

def load_tokenizer(path_or_repo: str):
    p = Path(path_or_repo)

    if p.exists() and p.is_file() and p.suffix.lower() == ".json":
        tok = PreTrainedTokenizerFast(tokenizer_file=str(p.resolve()))
    else:
        tok = AutoTokenizer.from_pretrained(path_or_repo, use_fast=True)

    if tok.bos_token is None:
        tok.add_special_tokens({"bos_token": "<|bos|>"})
    if tok.eos_token is None:
        tok.add_special_tokens({"eos_token": "<|eos|>"})
    if tok.unk_token is None:
        tok.add_special_tokens({"unk_token": "<|unk|>"})
    if tok.pad_token is None:
        tok.pad_token = tok.eos_token if tok.eos_token is not None else "<|pad|>"

    tok.padding_side = "left"
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {len(tokenizer)}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)

model.eval()
model.to(device)
model.config.use_cache = False
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = False

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

def generate(
    prompt: str = PROMPT,
    max_new_tokens: int = MAX_NEW_TOKENS,
    temperature: float = TEMPERATURE,
    top_p: float = TOP_P,
    top_k: int = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool = DO_SAMPLE,
) -> str:
    bos = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)

    inputs.pop("token_type_ids", None)

    gen_kwargs = dict(
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        repetition_penalty=repetition_penalty,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=False,
    )

    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"] = top_p
        gen_kwargs["top_k"] = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    prompt_len = inputs["input_ids"].shape[-1]
    new_ids = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)
    output = generate(PROMPT)
    print("Generated:")
    print(output)

Copyright

Copyright (c) 2026 FromZero  
Copyright (c) 2026 Paul Courneya
Copyright (c) 2026 Jonathan LY

Citation

@misc{er-tiny-1.3m,
  title     = {Er-Tiny-1.3M},
  author    = {FromZero},
  year      = {2026},
  url       = {https://huggingface.co/fromziro/Er-Tiny-1.3M}
}

Downloads last month: -

Safetensors

Model size

1.33M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

fromziro
/

Er-Tiny-1.3M