Instructions to use noumenalabs/t5-small-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use noumenalabs/t5-small-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="noumenalabs/t5-small-gguf",
	filename="t5-small-f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use noumenalabs/t5-small-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Use Docker

docker model run hf.co/noumenalabs/t5-small-gguf:Q4_K_M

LM Studio
Jan
Ollama
How to use noumenalabs/t5-small-gguf with Ollama:
```
ollama run hf.co/noumenalabs/t5-small-gguf:Q4_K_M
```

Unsloth Studio

How to use noumenalabs/t5-small-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Atomic Chat new
Docker Model Runner
How to use noumenalabs/t5-small-gguf with Docker Model Runner:
```
docker model run hf.co/noumenalabs/t5-small-gguf:Q4_K_M
```

Lemonade

How to use noumenalabs/t5-small-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull noumenalabs/t5-small-gguf:Q4_K_M

Run and chat with the model

lemonade run user.t5-small-gguf-Q4_K_M

List all available models

lemonade list

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

T5 GGUF Analysis

This document records the T5-small GGUF evaluation run.

Environment

Verified runtime:

item	value
Python	`3.11.12`
Torch	`2.9.0+cu129`
Torch CUDA	`12.9`
CUDA available	`True`
GPU	`NVIDIA GeForce RTX 3070 Laptop GPU`

Models

The run evaluated these GGUFs:

model	role
`t5-small-f32.gguf`	unquantized reference baseline
`t5-small-f16.gguf`	high-precision comparison and quantization source
`t5-small-q8_0.gguf`	quantized
`t5-small-q5_k_m.gguf`	quantized
`t5-small-q4_k_m.gguf`	quantized
`t5-small-q4_0.gguf`	quantized
`t5-small-q3_k_m.gguf`	quantized
`t5-small-q2_k.gguf`	quantized

Conversion Check Results

The conversion check compares greedy HF outputs against greedy f32 GGUF outputs. It validates that the unquantized GGUF is a usable reference before comparing quantized models against it.

dataset	examples	exact match	chrF	first token match
CoLA	2,000	1.000	1.000	1.000
summarization	2,000	0.117	0.953	0.990
translation en-de	2,000	0.993	0.996	1.000
translation en-fr	2,000	0.986	0.995	1.000
overall	8,000	0.774	0.986	0.997

Interpretation:

The f32 GGUF tracks HF closely overall.
Summarization has low exact match but high chrF, which points to wording differences rather than broad conversion drift.
Translation and CoLA are effectively matching at the output level.

Generation Results

Generation used greedy decoding with n_predict=64. Agreement and similarity are measured against the f32 GGUF baseline output.

model	agreement vs f32	similarity vs f32
`t5-small-f16`	0.990	0.998
`t5-small-q8_0`	0.723	0.947
`t5-small-q5_k_m`	0.526	0.889
`t5-small-q4_k_m`	0.474	0.870
`t5-small-q4_0`	0.417	0.837
`t5-small-q3_k_m`	0.375	0.814
`t5-small-q2_k`	0.287	0.660

Per-dataset generation metrics:

dataset	model	exact match vs reference	chrF vs reference	agreement vs f32	similarity vs f32
CoLA	`t5-small-f16`	0.697	0.950	1.000	1.000
CoLA	`t5-small-f32`	0.697	0.950	-	-
CoLA	`t5-small-q2_k`	0.697	0.950	1.000	1.000
CoLA	`t5-small-q3_k_m`	0.697	0.949	1.000	1.000
CoLA	`t5-small-q4_0`	0.697	0.950	0.995	1.000
CoLA	`t5-small-q4_k_m`	0.698	0.950	0.999	1.000
CoLA	`t5-small-q5_k_m`	0.697	0.950	1.000	1.000
CoLA	`t5-small-q8_0`	0.697	0.950	1.000	1.000
summarization	`t5-small-f16`	0.000	0.133	0.979	0.995
summarization	`t5-small-f32`	0.000	0.133	-	-
summarization	`t5-small-q2_k`	0.000	0.068	0.000	0.254
summarization	`t5-small-q3_k_m`	0.000	0.123	0.039	0.510
summarization	`t5-small-q4_0`	0.000	0.123	0.071	0.550
summarization	`t5-small-q4_k_m`	0.000	0.131	0.137	0.642
summarization	`t5-small-q5_k_m`	0.000	0.128	0.210	0.689
summarization	`t5-small-q8_0`	0.000	0.133	0.541	0.852
translation en-de	`t5-small-f16`	0.020	0.361	0.989	0.999
translation en-de	`t5-small-f32`	0.020	0.361	-	-
translation en-de	`t5-small-q2_k`	0.015	0.315	0.090	0.738
translation en-de	`t5-small-q3_k_m`	0.018	0.353	0.234	0.876
translation en-de	`t5-small-q4_0`	0.019	0.357	0.304	0.905
translation en-de	`t5-small-q4_k_m`	0.019	0.359	0.380	0.920
translation en-de	`t5-small-q5_k_m`	0.019	0.359	0.448	0.935
translation en-de	`t5-small-q8_0`	0.019	0.360	0.680	0.970
translation en-fr	`t5-small-f16`	0.017	0.381	0.993	0.999
translation en-fr	`t5-small-f32`	0.017	0.381	-	-
translation en-fr	`t5-small-q2_k`	0.007	0.276	0.057	0.646
translation en-fr	`t5-small-q3_k_m`	0.015	0.368	0.226	0.868
translation en-fr	`t5-small-q4_0`	0.015	0.372	0.299	0.891
translation en-fr	`t5-small-q4_k_m`	0.017	0.377	0.380	0.919
translation en-fr	`t5-small-q5_k_m`	0.016	0.380	0.446	0.933
translation en-fr	`t5-small-q8_0`	0.016	0.380	0.672	0.967

Interpretation:

f16 is effectively equivalent to f32 for generated outputs.
q8_0 preserves most behavior but still diverges on longer-form tasks.
q5_k_m and q4_k_m are usable middle points depending on size and quality target.
q2_k degrades heavily for summarization and translation.

Perplexity And KL Results

Perplexity is reported per dataset. KL/token and top-1 disagreement are the main quantization drift metrics because they compare each quantized model directly against f32 token distributions.

Token-weighted summary across all datasets:

model	tokens	KL/token	top-1 disagree
`t5-small-f16`	308,028	0.00000	0.0005
`t5-small-f32`	308,028	-	-
`t5-small-q8_0`	308,028	0.00187	0.0160
`t5-small-q5_k_m`	308,028	0.01004	0.0386
`t5-small-q4_k_m`	308,028	0.02038	0.0521
`t5-small-q4_0`	308,028	0.04847	0.0704
`t5-small-q3_k_m`	308,028	0.05892	0.0897
`t5-small-q2_k`	308,028	0.27523	0.1914

Per-dataset perplexity:

model	CoLA	summarization	translation en-de	translation en-fr
`t5-small-f32`	1.3490	138.5925	5.0317	3.8267
`t5-small-f16`	1.3491	138.6029	5.0317	3.8268
`t5-small-q8_0`	1.3494	133.1739	5.0314	3.8245
`t5-small-q5_k_m`	1.3498	139.2235	5.0748	3.8488
`t5-small-q4_k_m`	1.3535	155.2379	5.1135	3.8759
`t5-small-q4_0`	1.3593	215.7687	5.1394	3.9305
`t5-small-q3_k_m`	1.3490	153.6497	5.2163	3.9680
`t5-small-q2_k`	1.3577	262.6867	6.0281	4.4851

Per-dataset KL/token:

model	CoLA	summarization	translation en-de	translation en-fr
`t5-small-f16`	0.00000	0.00000	0.00000	0.00000
`t5-small-q8_0`	0.00029	0.00194	0.00191	0.00181
`t5-small-q5_k_m`	0.00544	0.01159	0.00923	0.00838
`t5-small-q4_k_m`	0.00811	0.02593	0.01732	0.01437
`t5-small-q4_0`	0.01239	0.07497	0.02886	0.02339
`t5-small-q3_k_m`	0.00539	0.07696	0.04827	0.04073
`t5-small-q2_k`	0.00350	0.36274	0.22476	0.18650

Interpretation:

The KL ranking is stable and clear: f16, q8_0, q5_k_m, q4_k_m, q4_0, q3_k_m, then q2_k.
q8_0 has very small distributional drift from f32.
q5_k_m is the strongest compact quantization in this run.
q4_k_m is materially better than q4_0 by KL/token and top-1 disagreement.
q2_k has high drift and large top-1 disagreement on generation-heavy datasets.

Recommended Default

For T5-small in this workflow:

Use t5-small-f32.gguf as the reference baseline.
Use t5-small-q8_0.gguf when preserving behavior matters most.
Use t5-small-q5_k_m.gguf as the best compact default from this run.
Use t5-small-q4_k_m.gguf only when size pressure is stronger than quality.
Avoid t5-small-q2_k.gguf for summarization or translation quality checks.

GOOGLE T5-small License: Apache 2.0 We followed and adopted their licnese.

Downloads last month: 280

GGUF

Model size

60.5M params

Architecture

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support