Instructions to use Nebulixlabs/Nutral-v2-Tiny with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Nebulixlabs/Nutral-v2-Tiny with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Nebulixlabs/Nutral-v2-Tiny", filename="tiny_custom_model.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Nebulixlabs/Nutral-v2-Tiny with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Nebulixlabs/Nutral-v2-Tiny # Run inference directly in the terminal: llama-cli -hf Nebulixlabs/Nutral-v2-Tiny
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Nebulixlabs/Nutral-v2-Tiny # Run inference directly in the terminal: llama-cli -hf Nebulixlabs/Nutral-v2-Tiny
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Nebulixlabs/Nutral-v2-Tiny # Run inference directly in the terminal: ./llama-cli -hf Nebulixlabs/Nutral-v2-Tiny
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Nebulixlabs/Nutral-v2-Tiny # Run inference directly in the terminal: ./build/bin/llama-cli -hf Nebulixlabs/Nutral-v2-Tiny
Use Docker
docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny
- LM Studio
- Jan
- vLLM
How to use Nebulixlabs/Nutral-v2-Tiny with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Nebulixlabs/Nutral-v2-Tiny" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Nebulixlabs/Nutral-v2-Tiny", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny
- Ollama
How to use Nebulixlabs/Nutral-v2-Tiny with Ollama:
ollama run hf.co/Nebulixlabs/Nutral-v2-Tiny
- Unsloth Studio
How to use Nebulixlabs/Nutral-v2-Tiny with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Nebulixlabs/Nutral-v2-Tiny to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Nebulixlabs/Nutral-v2-Tiny to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Nebulixlabs/Nutral-v2-Tiny to start chatting
- Atomic Chat new
- Docker Model Runner
How to use Nebulixlabs/Nutral-v2-Tiny with Docker Model Runner:
docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny
- Lemonade
How to use Nebulixlabs/Nutral-v2-Tiny with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Nebulixlabs/Nutral-v2-Tiny
Run and chat with the model
lemonade run user.Nutral-v2-Tiny-{{QUANT_TAG}}List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)Model Card: TinyLLM (5M Parameters)
π Model Overview
TinyLLM is a custom, lightweight Transformer-based causal language model trained entirely from scratch. With approximately 5 Million parameters, it is designed to be highly efficient and accessible for educational purposes, rapid prototyping, and deployment on edge devices.
The model was trained in two distinct hardware environments (Dual T4 GPUs and TPUs) to experiment with different accelerator optimizations, and the weights are provided for both training runs.
ποΈ Architecture Specifications
The model uses a standard autoregressive Transformer decoder architecture with the following hyperparameters:
| Parameter | Value |
|---|---|
| Total Parameters | ~5 Million |
| Vocab Size | 50,257 (GPT-2 Tokenizer) |
Hidden Size (d_model) |
128 |
| Number of Layers | 4 |
| Number of Attention Heads | 4 |
| Max Sequence Length | 512 |
| Activation Function | GELU |
βοΈ Training Details
- Dataset: The model was trained on a clean subset (first 100,000 rows) of the roneneldan/TinyStories dataset.
- Total Training Tokens: ~50 Million tokens (100k examples Γ 512 max length).
- Optimizer: AdamW
- Learning Rate: 5e-4
- Batch Size: 16
π» Hardware Configuration
This model was trained separately on two different hardware setups to ensure cross-platform compatibility:
- Dual GPU Setup: Trained on 2x NVIDIA T4 GPUs utilizing
nn.DataParalleland Automatic Mixed Precision (AMP) withFloat16for memory efficiency and speed. - TPU Setup: Trained on Google TPU using PyTorch XLA (
torch_xla). The TPU environment was forced to useBFloat16(XLA_USE_BF16='1') for maximum speed and memory savings.
π οΈ Train from Scratch (train.py)
If you wish to train this model from scratch, the repository includes a train.py script. You can use this script to replicate the training process or experiment with your own datasets.
Hardware Requirements & Specs (Default Support)
The train.py script is designed to support two major hardware configurations out of the box. You can run it on either setup depending on your hardware availability:
Option 1: Google TPU v5e
- Optimizations: Fully optimized with PyTorch XLA (
torch_xla). - Precision: Utilizes
BFloat16mixed-precision training (viaXLA_USE_BF16=1) for maximum speed and minimal memory consumption. - Best For: Fast training and handling larger batch sizes efficiently.
- Optimizations: Fully optimized with PyTorch XLA (
Option 2: 2x NVIDIA T4 GPUs (Dual GPU Setup)
- Optimizations: Uses
nn.DataParallelto distribute the workload across both GPUs. - Precision: Implements Automatic Mixed Precision (AMP) with
Float16scaling to prevent Out of Memory (OOM) errors on the primary GPU. - Best For: Running on easily accessible cloud environments like Kaggle or Google Colab Pro.
- Optimizations: Uses
π¦ Available Model Files
Depending on your preference and inference engine (such as llama.cpp or native PyTorch), you can download the specific weights from the files section:
tiny_custom_model.gguf: The quantized/GGUF version of the model trained on the 2x T4 GPU setup. Best for CPU inference orllama.cpp.tiny_custom_model_tpu.gguf: The quantized/GGUF version of the model trained on the Google TPU setup.tiny_model_t4.pth/.safetensors: The raw PyTorch and Safetensors weights for the GPU-trained version.tiny_model_tpu.pth/.safetensors: The raw PyTorch and Safetensors weights for the TPU-trained version.
π How to Use (PyTorch)
import torch
import torch.nn as nn
from transformers import AutoTokenizer
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# Define Model Architecture (TinyLLM)
class TinyLLM(nn.Module):
def __init__(self, vocab_size=50257, d_model=128, n_heads=4, n_layers=4, max_seq_len=512):
super().__init__()
self.d_model = d_model
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_seq_len, d_model)
decoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=n_heads, dim_feedforward=4 * d_model,
batch_first=True, activation="gelu"
)
self.transformer = nn.TransformerEncoder(decoder_layer, num_layers=n_layers)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
self.token_emb.weight = self.lm_head.weight
def forward(self, x):
seq_len = x.size(1)
mask = nn.Transformer.generate_square_subsequent_mask(seq_len).to(x.device)
positions = torch.arange(0, seq_len, dtype=torch.long, device=x.device).unsqueeze(0)
x = self.token_emb(x) + self.pos_emb(positions)
x = self.transformer(x, mask=mask, is_causal=True)
return self.lm_head(x)
# Load Weights
model = TinyLLM()
model.load_state_dict(torch.load('tiny_model_t4.pth')) # Or use the TPU .pth file
model.eval()
# Generate Text Example (Requires custom generation loop for this barebones architecture)
- Downloads last month
- 28
We're not able to determine the quantization variants.

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Nebulixlabs/Nutral-v2-Tiny", filename="", )