Instructions to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25

SGLang

How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with Docker Model Runner:
```
docker model run hf.co/sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25
```

SmallThinker-4BA0.6B-Instruct REAP 0.25

This repository contains a REAP-pruned checkpoint derived from Tiiny/SmallThinker-4BA0.6B-Instruct.

The files in this repository are the model files directly at repository root, including the safetensors shards, tokenizer files, config, and custom SmallThinker modeling code.

Creation Notes

This pruned checkpoint was prepared in Codex with GPT5.5 assistance at the repository owner's direction. Codex was used to adapt the REAP workflow for SmallThinker, run pruning and smoke evaluation, and prepare the upload artifacts.

Pruning Summary

Base model: Tiiny/SmallThinker-4BA0.6B-Instruct
Pruning method: REAP layerwise expert pruning
Calibration dataset: theblackcat102/evol-codealpaca-v1
Requested compression ratio: 0.25
Effective experts pruned per layer: 8 / 32
Primary experts retained per layer: 24
Active experts per token: 4
Router weight renormalization: enabled
Calibration settings:
- model_max_length=2048
- batches_per_category=128
- batch_size=1
- batch_group_size=8
- truncate=false

Local Smoke Evaluation

Greedy generation was checked on Japanese, English, and Chinese prompts.

Language	Language check	Notes
Japanese	OK	Understands the language, but output can become repetitive or partially degraded.
English	OK	Most stable among the three tested languages.
Chinese	OK	Produces Chinese answers, though sentence-count instructions may not be followed exactly.

Average generation time in the local smoke run was about 11.512 seconds across the three prompts on the test machine with CPU offload.

Usage

This model uses custom code, so load it with trust_remote_code=True.

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

Colab / Transformers Compatibility

This checkpoint uses Hugging Face custom modeling code. If loading in Google Colab or another notebook environment fails while importing modeling_smallthinker.py with an error such as cannot import name 'HybridCache' from 'transformers.cache_utils', the installed Transformers package is too old for the SmallThinker custom code. Upgrade Transformers and restart the runtime before loading the model:

!pip -q install -U "transformers>=4.55.0" "accelerate>=1.7.0" "safetensors"

After restarting, this import should succeed:

from transformers.cache_utils import HybridCache

If your installed Transformers version raises a later import error involving LossKwargs, upgrade Transformers or apply an equivalent compatibility shim. The local pruning run was tested with transformers==4.55.0 plus a REAP-side compatibility shim. GGUF runtimes may work even when this Python loading path fails, because GGUF does not execute Hugging Face modeling_smallthinker.py.

Caveats

This is an experimental pruned checkpoint. It was validated with load and short generation smoke tests, not a full benchmark suite. Quality can vary by language and task, especially in Japanese after pruning.

Downloads last month: 41

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25

Base model

Tiiny/SmallThinker-4BA0.6B-Instruct

Finetuned

(3)

this model