Instructions to use syaffers/tiny-random-storywriter-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use syaffers/tiny-random-storywriter-base with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("syaffers/tiny-random-storywriter-base", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Model Card for syaffers/tiny-random-storywriter-base
- Model Details
- Uses
- Bias, Risks, and Limitations
- How to Get Started with the Model
- Training Details
- Evaluation
- Model Examination [optional]
- Environmental Impact
- Technical Specifications [optional]
- Citation [optional]
- Glossary [optional]
- More Information [optional]
- Model Card Authors [optional]
- Model Card Contact
Model Card for syaffers/tiny-random-storywriter-base
A full finetune of hmellor/tiny-random-LlamaForCausalLM (~1M parameters, hidden size 16, 2 layers) on TinyStories. This is a hyperparameter tuning sandbox — the goal was to see how far aggressive LR scaling and larger effective batch sizes could push training loss on a deliberately tiny LLaMA model.
Model Details
Model Description
This model card covers a series of full finetuning runs on a near-randomly-initialised LLaMA variant. The base model has a hidden size of 16 and only 2 transformer layers, making it trainable on a single consumer GPU in minutes. Multiple runs were conducted sweeping learning rate, effective batch size, warmup steps, and sequence length. The best checkpoint achieves a final training loss of 4.18 (PPL 65), down from a baseline of 7.67 (PPL 2133) — a 33× reduction in perplexity through hyperparameter tuning alone.
- Developed by: Syafiq Kamarul Azman
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: Syafiq Kamarul Azman
- Model type: Causal language model (LlamaForCausalLM)
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: hmellor/tiny-random-LlamaForCausalLM
Model Sources [optional]
- Repository: https://github.com/syaffers/tiny-random-fft
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
Load any checkpoint subfolder and generate text. The model produces English prose in the style of children's short stories, consistent with its training data. Output quality is limited by the model's tiny capacity.
Downstream Use [optional]
This model is not intended for downstream use in production systems. It may be useful as a fast-iteration target for testing training pipelines, tokenizer integrations, or generation code.
Out-of-Scope Use
This model should not be used for any task requiring factual accuracy, reasoning, instruction following, or real-world language understanding. It has ~1M parameters and was trained solely on synthetic children's stories.
Bias, Risks, and Limitations
The training data (TinyStories) is a synthetically generated dataset of simple English children's stories. The model will reflect the vocabulary, topics, and narrative patterns of that dataset exclusively. It has no knowledge of the world beyond those stories and will produce incoherent output for any other input distribution.
Recommendations
This model is a research/learning artefact. Do not deploy it in any application where output quality or safety matters.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "syaffers/tiny-random-storywriter-base"
subfolder = "bs-512-lr-16e3-wu1600" # best checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_id, subfolder=subfolder)
model = AutoModelForCausalLM.from_pretrained(
model_id, subfolder=subfolder, torch_dtype=torch.bfloat16
)
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
TinyStories — a dataset of synthetically generated short stories for children, used in streaming mode (train split). Tokens are packed into fixed-length chunks with an EOS token appended after each story.
Training Procedure
Preprocessing [optional]
Text is tokenised with the base model's tokenizer, EOS is appended per story, and tokens are packed into contiguous chunks of seq_len tokens. No padding is used.
Training Hyperparameters
- Training regime: bf16 non-mixed precision
- Optimizer: AdamW
- LR schedule: Linear warmup → cosine decay to 0
- Grad clip: 1.0
- Steps: up to 5,000 (some runs stopped early)
See the Runs Summary below for per-run hyperparameters.
Speeds, Sizes, Times [optional]
- Trained on RTX 5050 (8 GB VRAM, CUDA 13.2)
- Each run takes roughly 15–30 minutes depending on step count
- Checkpoint size: ~4 MB per subfolder (model.safetensors)
Evaluation
Testing Data, Factors & Metrics
Testing Data
No held-out evaluation set was used. Loss was measured on the training stream only.
Factors
Runs were compared by: learning rate, effective batch size (via gradient accumulation), warmup steps, and sequence length.
Metrics
- Training loss (cross-entropy)
- Perplexity (exp of loss)
Results
Runs Summary
All runs trained on seq_len=256 unless noted. Best final loss in bold.
| Run | LR | Effective batch size | Warmup steps | Final loss | Final PPL |
|---|---|---|---|---|---|
| baseline | 3e-4 | 64 | 200 | 7.67 | 2,133 |
| lr-1e3-sched | 1e-3 | 64 | 200 | 6.19 | 489 |
| bs-256 | 1e-3 | 256 | 200 | 5.93 | 377 |
| bs-512-lr-1e3 | 1e-3 | 512 | 200 | 5.98 | 394 |
| bs-512-lr-2e3 | 2e-3 | 512 | 200 | 5.03 | 153 |
| bs-512-lr-4e3-wu400 | 4e-3 | 512 | 400 | 4.43 | 84 |
| bs-512-lr-8e3-wu800 | 8e-3 | 512 | 800 | 4.23 | 68 |
| bs-512-lr-16e3-wu1600 | 16e-3 | 512 | 1600 | 4.18 | 65 |
| bs-512-lr-16e3-wu1600-seq512 | 16e-3 | 512 | 1600 | 4.20† | 66† |
†seq512 run was cut short at ~1,958 steps.
Loss milestones — best run (bs-512-lr-16e3-wu1600)
| Step | Loss | PPL |
|---|---|---|
| 1 | 10.38 | 32,092 |
| 250 | 6.14 | 463 |
| 500 | 5.14 | 171 |
| 1,000 | 4.52 | 92 |
| 1,500 | 4.31 | 75 |
| 2,000 | 4.18 | 65 |
Summary
Learning rate is the dominant lever: scaling from 3e-4 to 16e-3 (with proportional warmup) reduced perplexity by 33×. Warmup steps should scale proportionally with LR to avoid early instability. Gradient accumulation to an effective batch of 512 provided additional gains over smaller batches at the same LR.
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA RTX 5050 (8 GB VRAM)
- Hours used: [More Information Needed]
- Cloud Provider: Local (no cloud)
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
LlamaForCausalLM with the following configuration:
| Property | Value |
|---|---|
| Hidden size | 16 |
| Layers | 2 |
| Attention heads | 4 |
| KV heads | 4 |
| Intermediate size | 64 |
| Head dim | 64 |
| Vocab size | 32,000 |
| Max position embeddings | 8,192 |
| Activation | SiLU |
| Positional encoding | RoPE (θ=10,000) |
| dtype | bfloat16 |
| Parameters | ~1M |
Objective: standard next-token prediction (cross-entropy loss).
Compute Infrastructure
Single consumer GPU, local workstation.
Hardware
NVIDIA RTX 5050, 8 GB VRAM, CUDA 13.2
Software
- Python (uv-managed environment)
- PyTorch
- HuggingFace Transformers 5.9.0
- HuggingFace Datasets (streaming)
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
- Effective batch size:
batch_size × grad_accum. Runs using gradient accumulation (×8 or ×16) maintain the same GPU memory footprint as bs=64 while computing gradients over a larger token count per update. - PPL: Perplexity, computed as
exp(loss). Lower is better. - Warmup: The first N steps during which the learning rate ramps linearly from 0 to peak LR before cosine decay begins.
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
Syafiq Kamarul Azman
Model Card Contact
[More Information Needed]
Model tree for syaffers/tiny-random-storywriter-base
Base model
hmellor/tiny-random-LlamaForCausalLM