TinyLLM Banner

Model Card: TinyLLM (5M Parameters)

πŸ“Œ Model Overview

TinyLLM is a custom, lightweight Transformer-based causal language model trained entirely from scratch. With approximately 5 Million parameters, it is designed to be highly efficient and accessible for educational purposes, rapid prototyping, and deployment on edge devices.

The model was trained in two distinct hardware environments (Dual T4 GPUs and TPUs) to experiment with different accelerator optimizations, and the weights are provided for both training runs.

πŸ—οΈ Architecture Specifications

The model uses a standard autoregressive Transformer decoder architecture with the following hyperparameters:

Parameter Value
Total Parameters ~5 Million
Vocab Size 50,257 (GPT-2 Tokenizer)
Hidden Size (d_model) 128
Number of Layers 4
Number of Attention Heads 4
Max Sequence Length 512
Activation Function GELU

βš™οΈ Training Details

  • Dataset: The model was trained on a clean subset (first 100,000 rows) of the roneneldan/TinyStories dataset.
  • Total Training Tokens: ~50 Million tokens (100k examples Γ— 512 max length).
  • Optimizer: AdamW
  • Learning Rate: 5e-4
  • Batch Size: 16

πŸ’» Hardware Configuration

This model was trained separately on two different hardware setups to ensure cross-platform compatibility:

  1. Dual GPU Setup: Trained on 2x NVIDIA T4 GPUs utilizing nn.DataParallel and Automatic Mixed Precision (AMP) with Float16 for memory efficiency and speed.
  2. TPU Setup: Trained on Google TPU using PyTorch XLA (torch_xla). The TPU environment was forced to use BFloat16 (XLA_USE_BF16='1') for maximum speed and memory savings.

πŸ› οΈ Train from Scratch (train.py)

If you wish to train this model from scratch, the repository includes a train.py script. You can use this script to replicate the training process or experiment with your own datasets.

Hardware Requirements & Specs (Default Support)

The train.py script is designed to support two major hardware configurations out of the box. You can run it on either setup depending on your hardware availability:

  • Option 1: Google TPU v5e

    • Optimizations: Fully optimized with PyTorch XLA (torch_xla).
    • Precision: Utilizes BFloat16 mixed-precision training (via XLA_USE_BF16=1) for maximum speed and minimal memory consumption.
    • Best For: Fast training and handling larger batch sizes efficiently.
  • Option 2: 2x NVIDIA T4 GPUs (Dual GPU Setup)

    • Optimizations: Uses nn.DataParallel to distribute the workload across both GPUs.
    • Precision: Implements Automatic Mixed Precision (AMP) with Float16 scaling to prevent Out of Memory (OOM) errors on the primary GPU.
    • Best For: Running on easily accessible cloud environments like Kaggle or Google Colab Pro.

πŸ“¦ Available Model Files

Depending on your preference and inference engine (such as llama.cpp or native PyTorch), you can download the specific weights from the files section:

  • tiny_custom_model.gguf: The quantized/GGUF version of the model trained on the 2x T4 GPU setup. Best for CPU inference or llama.cpp.
  • tiny_custom_model_tpu.gguf: The quantized/GGUF version of the model trained on the Google TPU setup.
  • tiny_model_t4.pth / .safetensors: The raw PyTorch and Safetensors weights for the GPU-trained version.
  • tiny_model_tpu.pth / .safetensors: The raw PyTorch and Safetensors weights for the TPU-trained version.

πŸš€ How to Use (PyTorch)

import torch
import torch.nn as nn
from transformers import AutoTokenizer

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token 

# Define Model Architecture (TinyLLM)
class TinyLLM(nn.Module):
    def __init__(self, vocab_size=50257, d_model=128, n_heads=4, n_layers=4, max_seq_len=512):
        super().__init__()
        self.d_model = d_model
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        
        decoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=n_heads, dim_feedforward=4 * d_model, 
            batch_first=True, activation="gelu"
        )
        self.transformer = nn.TransformerEncoder(decoder_layer, num_layers=n_layers)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.token_emb.weight = self.lm_head.weight

    def forward(self, x):
        seq_len = x.size(1)
        mask = nn.Transformer.generate_square_subsequent_mask(seq_len).to(x.device)
        positions = torch.arange(0, seq_len, dtype=torch.long, device=x.device).unsqueeze(0)
        x = self.token_emb(x) + self.pos_emb(positions)
        x = self.transformer(x, mask=mask, is_causal=True)
        return self.lm_head(x)

# Load Weights
model = TinyLLM()
model.load_state_dict(torch.load('tiny_model_t4.pth')) # Or use the TPU .pth file
model.eval()

# Generate Text Example (Requires custom generation loop for this barebones architecture)
Downloads last month
25
GGUF
Model size
13.7M params
Architecture
tiny_custom_llm
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for Nebulixlabs/Nutral-v2-Tiny

Quantized
(1)
this model

Dataset used to train Nebulixlabs/Nutral-v2-Tiny