Transformers
Safetensors
English
causal-lm
llama
tiny
finetuned

Model Card for syaffers/tiny-random-storywriter-base

A full finetune of hmellor/tiny-random-LlamaForCausalLM (~1M parameters, hidden size 16, 2 layers) on TinyStories. This is a hyperparameter tuning sandbox — the goal was to see how far aggressive LR scaling and larger effective batch sizes could push training loss on a deliberately tiny LLaMA model.

Model Details

Model Description

This model card covers a series of full finetuning runs on a near-randomly-initialised LLaMA variant. The base model has a hidden size of 16 and only 2 transformer layers, making it trainable on a single consumer GPU in minutes. Multiple runs were conducted sweeping learning rate, effective batch size, warmup steps, and sequence length. The best checkpoint achieves a final training loss of 4.18 (PPL 65), down from a baseline of 7.67 (PPL 2133) — a 33× reduction in perplexity through hyperparameter tuning alone.

  • Developed by: Syafiq Kamarul Azman
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: Syafiq Kamarul Azman
  • Model type: Causal language model (LlamaForCausalLM)
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: hmellor/tiny-random-LlamaForCausalLM

Model Sources [optional]

Uses

Direct Use

Load any checkpoint subfolder and generate text. The model produces English prose in the style of children's short stories, consistent with its training data. Output quality is limited by the model's tiny capacity.

Downstream Use [optional]

This model is not intended for downstream use in production systems. It may be useful as a fast-iteration target for testing training pipelines, tokenizer integrations, or generation code.

Out-of-Scope Use

This model should not be used for any task requiring factual accuracy, reasoning, instruction following, or real-world language understanding. It has ~1M parameters and was trained solely on synthetic children's stories.

Bias, Risks, and Limitations

The training data (TinyStories) is a synthetically generated dataset of simple English children's stories. The model will reflect the vocabulary, topics, and narrative patterns of that dataset exclusively. It has no knowledge of the world beyond those stories and will produce incoherent output for any other input distribution.

Recommendations

This model is a research/learning artefact. Do not deploy it in any application where output quality or safety matters.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "syaffers/tiny-random-storywriter-base"
subfolder = "bs-512-lr-16e3-wu1600"  # best checkpoint

tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
    model_id, subfolder=subfolder, torch_dtype=torch.bfloat16
)

inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

TinyStories — a dataset of synthetically generated short stories for children, used in streaming mode (train split). Tokens are packed into fixed-length chunks with an EOS token appended after each story.

Training Procedure

Preprocessing [optional]

Text is tokenised with the base model's tokenizer, EOS is appended per story, and tokens are packed into contiguous chunks of seq_len tokens. No padding is used.

Training Hyperparameters

  • Training regime: bf16 non-mixed precision
  • Optimizer: AdamW
  • LR schedule: Linear warmup → cosine decay to 0
  • Grad clip: 1.0
  • Steps: up to 5,000 (some runs stopped early)

See the Runs Summary below for per-run hyperparameters.

Speeds, Sizes, Times [optional]

  • Trained on RTX 5050 (8 GB VRAM, CUDA 13.2)
  • Each run takes roughly 15–30 minutes depending on step count
  • Checkpoint size: ~4 MB per subfolder (model.safetensors)

Evaluation

Testing Data, Factors & Metrics

Testing Data

No held-out evaluation set was used. Loss was measured on the training stream only.

Factors

Runs were compared by: learning rate, effective batch size (via gradient accumulation), warmup steps, and sequence length.

Metrics

  • Training loss (cross-entropy)
  • Perplexity (exp of loss)

Results

Runs Summary

All runs trained on seq_len=256 unless noted. Best final loss in bold.

Run LR Effective batch size Warmup steps Final loss Final PPL
baseline 3e-4 64 200 7.67 2,133
lr-1e3-sched 1e-3 64 200 6.19 489
bs-256 1e-3 256 200 5.93 377
bs-512-lr-1e3 1e-3 512 200 5.98 394
bs-512-lr-2e3 2e-3 512 200 5.03 153
bs-512-lr-4e3-wu400 4e-3 512 400 4.43 84
bs-512-lr-8e3-wu800 8e-3 512 800 4.23 68
bs-512-lr-16e3-wu1600 16e-3 512 1600 4.18 65
bs-512-lr-16e3-wu1600-seq512 16e-3 512 1600 4.20† 66†

†seq512 run was cut short at ~1,958 steps.

Loss milestones — best run (bs-512-lr-16e3-wu1600)

Step Loss PPL
1 10.38 32,092
250 6.14 463
500 5.14 171
1,000 4.52 92
1,500 4.31 75
2,000 4.18 65

Summary

Learning rate is the dominant lever: scaling from 3e-4 to 16e-3 (with proportional warmup) reduced perplexity by 33×. Warmup steps should scale proportionally with LR to avoid early instability. Gradient accumulation to an effective batch of 512 provided additional gains over smaller batches at the same LR.

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA RTX 5050 (8 GB VRAM)
  • Hours used: [More Information Needed]
  • Cloud Provider: Local (no cloud)
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

LlamaForCausalLM with the following configuration:

Property Value
Hidden size 16
Layers 2
Attention heads 4
KV heads 4
Intermediate size 64
Head dim 64
Vocab size 32,000
Max position embeddings 8,192
Activation SiLU
Positional encoding RoPE (θ=10,000)
dtype bfloat16
Parameters ~1M

Objective: standard next-token prediction (cross-entropy loss).

Compute Infrastructure

Single consumer GPU, local workstation.

Hardware

NVIDIA RTX 5050, 8 GB VRAM, CUDA 13.2

Software

  • Python (uv-managed environment)
  • PyTorch
  • HuggingFace Transformers 5.9.0
  • HuggingFace Datasets (streaming)

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

  • Effective batch size: batch_size × grad_accum. Runs using gradient accumulation (×8 or ×16) maintain the same GPU memory footprint as bs=64 while computing gradients over a larger token count per update.
  • PPL: Perplexity, computed as exp(loss). Lower is better.
  • Warmup: The first N steps during which the learning rate ramps linearly from 0 to peak LR before cosine decay begins.

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Syafiq Kamarul Azman

Model Card Contact

[More Information Needed]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for syaffers/tiny-random-storywriter-base

Finetuned
(4)
this model

Dataset used to train syaffers/tiny-random-storywriter-base

Paper for syaffers/tiny-random-storywriter-base