Model Card for syaffers/tiny-random-storywriter-base

A full finetune of hmellor/tiny-random-LlamaForCausalLM (~1M parameters, hidden size 16, 2 layers) on TinyStories. This is a hyperparameter tuning sandbox — the goal was to see how far aggressive LR scaling and larger effective batch sizes could push training loss on a deliberately tiny LLaMA model.

Model Details

Model Description

This model card covers a series of full finetuning runs on a near-randomly-initialised LLaMA variant. The base model has a hidden size of 16 and only 2 transformer layers, making it trainable on a single consumer GPU in minutes. Multiple runs were conducted sweeping learning rate, effective batch size, warmup steps, and sequence length. The best checkpoint achieves a final training loss of 4.18 (PPL 65), down from a baseline of 7.67 (PPL 2133) — a 33× reduction in perplexity through hyperparameter tuning alone.

Developed by: Syafiq Kamarul Azman
Funded by [optional]: [More Information Needed]
Shared by [optional]: Syafiq Kamarul Azman
Model type: Causal language model (LlamaForCausalLM)
Language(s) (NLP): English
License: MIT
Finetuned from model: hmellor/tiny-random-LlamaForCausalLM

Model Sources [optional]

Repository: https://github.com/syaffers/tiny-random-fft
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

Load any checkpoint subfolder and generate text. The model produces English prose in the style of children's short stories, consistent with its training data. Output quality is limited by the model's tiny capacity.

Downstream Use [optional]

This model is not intended for downstream use in production systems. It may be useful as a fast-iteration target for testing training pipelines, tokenizer integrations, or generation code.

Out-of-Scope Use

This model should not be used for any task requiring factual accuracy, reasoning, instruction following, or real-world language understanding. It has ~1M parameters and was trained solely on synthetic children's stories.

Bias, Risks, and Limitations

The training data (TinyStories) is a synthetically generated dataset of simple English children's stories. The model will reflect the vocabulary, topics, and narrative patterns of that dataset exclusively. It has no knowledge of the world beyond those stories and will produce incoherent output for any other input distribution.

Recommendations

This model is a research/learning artefact. Do not deploy it in any application where output quality or safety matters.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "syaffers/tiny-random-storywriter-base"
subfolder = "bs-512-lr-16e3-wu1600"  # best checkpoint

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder, torch_dtype=torch.bfloat16
)

inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

TinyStories — a dataset of synthetically generated short stories for children, used in streaming mode (train split). Tokens are packed into fixed-length chunks with an EOS token appended after each story.

Training Procedure

Preprocessing [optional]

Text is tokenised with the base model's tokenizer, EOS is appended per story, and tokens are packed into contiguous chunks of seq_len tokens. No padding is used.

Training Hyperparameters

Training regime: bf16 non-mixed precision
Optimizer: AdamW
LR schedule: Linear warmup → cosine decay to 0
Grad clip: 1.0
Steps: up to 5,000 (some runs stopped early)

See the Runs Summary below for per-run hyperparameters.

Speeds, Sizes, Times [optional]

Trained on RTX 5050 (8 GB VRAM, CUDA 13.2)
Each run takes roughly 15–30 minutes depending on step count
Checkpoint size: ~4 MB per subfolder (model.safetensors)

Evaluation

Testing Data, Factors & Metrics

Testing Data

No held-out evaluation set was used. Loss was measured on the training stream only.

Factors

Runs were compared by: learning rate, effective batch size (via gradient accumulation), warmup steps, and sequence length.

Metrics

Training loss (cross-entropy)
Perplexity (exp of loss)

Results

Runs Summary

All runs trained on seq_len=256 unless noted. Best final loss in bold.

Run	LR	Effective batch size	Warmup steps	Final loss	Final PPL
baseline	3e-4	64	200	7.67	2,133
lr-1e3-sched	1e-3	64	200	6.19	489
bs-256	1e-3	256	200	5.93	377
bs-512-lr-1e3	1e-3	512	200	5.98	394
bs-512-lr-2e3	2e-3	512	200	5.03	153
bs-512-lr-4e3-wu400	4e-3	512	400	4.43	84
bs-512-lr-8e3-wu800	8e-3	512	800	4.23	68
bs-512-lr-16e3-wu1600	16e-3	512	1600	4.18	65
bs-512-lr-16e3-wu1600-seq512	16e-3	512	1600	4.20†	66†

†seq512 run was cut short at ~1,958 steps.

Loss milestones — best run (bs-512-lr-16e3-wu1600)

Step	Loss	PPL
1	10.38	32,092
250	6.14	463
500	5.14	171
1,000	4.52	92
1,500	4.31	75
2,000	4.18	65

Summary

Learning rate is the dominant lever: scaling from 3e-4 to 16e-3 (with proportional warmup) reduced perplexity by 33×. Warmup steps should scale proportionally with LR to avoid early instability. Gradient accumulation to an effective batch of 512 provided additional gains over smaller batches at the same LR.

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA RTX 5050 (8 GB VRAM)
Hours used: [More Information Needed]
Cloud Provider: Local (no cloud)
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

LlamaForCausalLM with the following configuration:

Property	Value
Hidden size	16
Layers	2
Attention heads	4
KV heads	4
Intermediate size	64
Head dim	64
Vocab size	32,000
Max position embeddings	8,192
Activation	SiLU
Positional encoding	RoPE (θ=10,000)
dtype	bfloat16
Parameters	~1M

Objective: standard next-token prediction (cross-entropy loss).

Compute Infrastructure

Single consumer GPU, local workstation.

Hardware

NVIDIA RTX 5050, 8 GB VRAM, CUDA 13.2

Software

Python (uv-managed environment)
PyTorch
HuggingFace Transformers 5.9.0
HuggingFace Datasets (streaming)

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

Effective batch size: batch_size × grad_accum. Runs using gradient accumulation (×8 or ×16) maintain the same GPU memory footprint as bs=64 while computing gradients over a larger token count per update.
PPL: Perplexity, computed as exp(loss). Lower is better.
Warmup: The first N steps during which the learning rate ramps linearly from 0 to peak LR before cosine decay begins.

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Syafiq Kamarul Azman

Model Card Contact

[More Information Needed]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for syaffers/tiny-random-storywriter-base

Base model

hmellor/tiny-random-LlamaForCausalLM

Finetuned

(4)

this model

Dataset used to train syaffers/tiny-random-storywriter-base

Paper for syaffers/tiny-random-storywriter-base

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 49