gpt-oss-2.5B-A1.3B

A tiny version of unsloth/gpt-oss-20b-BF16 designed for testing and development purposes.

Model Details

Base Model: unsloth/gpt-oss-20b-BF16
Architecture: GPT-OSS (Mixture-of-Experts)
Total Parameters: 2.5B
Activated Parameters: ~1.3B (4 out of 8 experts active per token)

Architecture Configuration

Parameter	Original Model	Tiny Model
Number of Layers	24	6
Layer Types	Alternating sliding_attention/full_attention	Alternating sliding_attention/full_attention
Hidden Size	2880	2880
Number of Experts	32	8
Experts per Token	4	4
Attention Heads	64	64
KV Heads	8	8
Vocab Size	201088	201088
Max Position Embeddings	131072	131072

Checkpoint Structure

The model is saved as a single model.safetensors file (unlike the original which is sharded into 9 files). This is appropriate for the smaller model size.

Creation Method

This model was created by:

Loading the original unsloth/gpt-oss-20b-BF16 model
Extracting the first 6 layers (maintaining the alternating attention pattern)
Reducing the number of experts from 32 to 8 (keeping the first 8 experts from each layer)
Copying embeddings and LM head weights from the original
Fine-tuning on a small toy dataset to validate learning capability

Validation

The model successfully passes validation tests:

Success: 1.0000132322311401 <= 10.0

==================================================
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to fly.
==================================================

Perplexity on test text: 1.00 (target: ≤10.0) ✓

The model demonstrates:

Proper weight initialization
Ability to learn during fine-tuning
Coherent text generation
Low perplexity on training data

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/gpt-oss-2.5B-A1.3B")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/gpt-oss-2.5B-A1.3B")

text = "According to all known laws"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

Notes

This model uses the GPT-OSS architecture with sliding window attention and full attention layers
The model has been fine-tuned on a small copypasta dataset to ensure proper initialization and learning capability
Suitable for development, testing compression algorithms, and experimentation
Not intended for production use

Downloads last month: 21

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/gpt-oss-2.5B-A1.3B

Quantizations

1 model