Instructions to use Nebulixlabs/Nutral-v2-Tiny with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Nebulixlabs/Nutral-v2-Tiny with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Nebulixlabs/Nutral-v2-Tiny",
	filename="tiny_custom_model.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Nebulixlabs/Nutral-v2-Tiny with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Nebulixlabs/Nutral-v2-Tiny
# Run inference directly in the terminal:
llama-cli -hf Nebulixlabs/Nutral-v2-Tiny

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Nebulixlabs/Nutral-v2-Tiny
# Run inference directly in the terminal:
llama-cli -hf Nebulixlabs/Nutral-v2-Tiny

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Nebulixlabs/Nutral-v2-Tiny
# Run inference directly in the terminal:
./llama-cli -hf Nebulixlabs/Nutral-v2-Tiny

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Nebulixlabs/Nutral-v2-Tiny
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Nebulixlabs/Nutral-v2-Tiny

Use Docker

docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny

LM Studio
Jan

vLLM

How to use Nebulixlabs/Nutral-v2-Tiny with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Nebulixlabs/Nutral-v2-Tiny"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nebulixlabs/Nutral-v2-Tiny",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny

Ollama
How to use Nebulixlabs/Nutral-v2-Tiny with Ollama:
```
ollama run hf.co/Nebulixlabs/Nutral-v2-Tiny
```

Unsloth Studio

How to use Nebulixlabs/Nutral-v2-Tiny with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Nebulixlabs/Nutral-v2-Tiny to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Nebulixlabs/Nutral-v2-Tiny to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Nebulixlabs/Nutral-v2-Tiny to start chatting

Atomic Chat new
Docker Model Runner
How to use Nebulixlabs/Nutral-v2-Tiny with Docker Model Runner:
```
docker model run hf.co/Nebulixlabs/Nutral-v2-Tiny
```

Lemonade

How to use Nebulixlabs/Nutral-v2-Tiny with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Nebulixlabs/Nutral-v2-Tiny

Run and chat with the model

lemonade run user.Nutral-v2-Tiny-{{QUANT_TAG}}

List all available models

lemonade list

Model Card: TinyLLM (5M Parameters)

📌 Model Overview

TinyLLM is a custom, lightweight Transformer-based causal language model trained entirely from scratch. With approximately 5 Million parameters, it is designed to be highly efficient and accessible for educational purposes, rapid prototyping, and deployment on edge devices.

The model was trained in two distinct hardware environments (Dual T4 GPUs and TPUs) to experiment with different accelerator optimizations, and the weights are provided for both training runs.

🏗️ Architecture Specifications

The model uses a standard autoregressive Transformer decoder architecture with the following hyperparameters:

Parameter	Value
Total Parameters	~5 Million
Vocab Size	50,257 (GPT-2 Tokenizer)
Hidden Size (`d_model`)	128
Number of Layers	4
Number of Attention Heads	4
Max Sequence Length	512
Activation Function	GELU

⚙️ Training Details

Dataset: The model was trained on a clean subset (first 100,000 rows) of the roneneldan/TinyStories dataset.
Total Training Tokens: ~50 Million tokens (100k examples × 512 max length).
Optimizer: AdamW
Learning Rate: 5e-4
Batch Size: 16

💻 Hardware Configuration

This model was trained separately on two different hardware setups to ensure cross-platform compatibility:

Dual GPU Setup: Trained on 2x NVIDIA T4 GPUs utilizing nn.DataParallel and Automatic Mixed Precision (AMP) with Float16 for memory efficiency and speed.
TPU Setup: Trained on Google TPU using PyTorch XLA (torch_xla). The TPU environment was forced to use BFloat16 (XLA_USE_BF16='1') for maximum speed and memory savings.

🛠️ Train from Scratch (`train.py`)

If you wish to train this model from scratch, the repository includes a train.py script. You can use this script to replicate the training process or experiment with your own datasets.

Hardware Requirements & Specs (Default Support)

The train.py script is designed to support two major hardware configurations out of the box. You can run it on either setup depending on your hardware availability:

Option 1: Google TPU v5e
- Optimizations: Fully optimized with PyTorch XLA (torch_xla).
- Precision: Utilizes BFloat16 mixed-precision training (via XLA_USE_BF16=1) for maximum speed and minimal memory consumption.
- Best For: Fast training and handling larger batch sizes efficiently.
Option 2: 2x NVIDIA T4 GPUs (Dual GPU Setup)
- Optimizations: Uses nn.DataParallel to distribute the workload across both GPUs.
- Precision: Implements Automatic Mixed Precision (AMP) with Float16 scaling to prevent Out of Memory (OOM) errors on the primary GPU.
- Best For: Running on easily accessible cloud environments like Kaggle or Google Colab Pro.

📦 Available Model Files

Depending on your preference and inference engine (such as llama.cpp or native PyTorch), you can download the specific weights from the files section:

tiny_custom_model.gguf: The quantized/GGUF version of the model trained on the 2x T4 GPU setup. Best for CPU inference or llama.cpp.
tiny_custom_model_tpu.gguf: The quantized/GGUF version of the model trained on the Google TPU setup.
tiny_model_t4.pth / .safetensors: The raw PyTorch and Safetensors weights for the GPU-trained version.
tiny_model_tpu.pth / .safetensors: The raw PyTorch and Safetensors weights for the TPU-trained version.

🚀 How to Use (PyTorch)

import torch
import torch.nn as nn
from transformers import AutoTokenizer

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token 

# Define Model Architecture (TinyLLM)
class TinyLLM(nn.Module):
    def __init__(self, vocab_size=50257, d_model=128, n_heads=4, n_layers=4, max_seq_len=512):
        super().__init__()
        self.d_model = d_model
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        
        decoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=n_heads, dim_feedforward=4 * d_model, 
            batch_first=True, activation="gelu"
        )
        self.transformer = nn.TransformerEncoder(decoder_layer, num_layers=n_layers)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        self.token_emb.weight = self.lm_head.weight

    def forward(self, x):
        seq_len = x.size(1)
        mask = nn.Transformer.generate_square_subsequent_mask(seq_len).to(x.device)
        positions = torch.arange(0, seq_len, dtype=torch.long, device=x.device).unsqueeze(0)
        x = self.token_emb(x) + self.pos_emb(positions)
        x = self.transformer(x, mask=mask, is_causal=True)
        return self.lm_head(x)

# Load Weights
model = TinyLLM()
model.load_state_dict(torch.load('tiny_model_t4.pth')) # Or use the TPU .pth file
model.eval()

# Generate Text Example (Requires custom generation loop for this barebones architecture)

Downloads last month: 25

GGUF

Model size

13.7M params

Architecture

tiny_custom_llm

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Nebulixlabs/Nutral-v2-Tiny

Base model

Nebulixlabs/Nutral-v1-Tiny

Quantized

(1)

this model

Nebulixlabs
/

Nutral-v2-Tiny

Model Card: TinyLLM (5M Parameters)

📌 Model Overview

🏗️ Architecture Specifications

⚙️ Training Details

💻 Hardware Configuration

🛠️ Train from Scratch (`train.py`)

Hardware Requirements & Specs (Default Support)

📦 Available Model Files

🚀 How to Use (PyTorch)

Model tree for Nebulixlabs/Nutral-v2-Tiny

Dataset used to train Nebulixlabs/Nutral-v2-Tiny

Model Card: TinyLLM (5M Parameters)

📌 Model Overview

🏗️ Architecture Specifications

⚙️ Training Details

💻 Hardware Configuration

🛠️ Train from Scratch (train.py)

Hardware Requirements & Specs (Default Support)

📦 Available Model Files

🚀 How to Use (PyTorch)

Model tree for Nebulixlabs/Nutral-v2-Tiny

Dataset used to train Nebulixlabs/Nutral-v2-Tiny

🛠️ Train from Scratch (`train.py`)