Instructions to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25
- SGLang
How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25 with Docker Model Runner:
docker model run hf.co/sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25
SmallThinker-4BA0.6B-Instruct REAP 0.25
This repository contains a REAP-pruned checkpoint derived from
Tiiny/SmallThinker-4BA0.6B-Instruct.
The files in this repository are the model files directly at repository root,
including the safetensors shards, tokenizer files, config, and custom
SmallThinker modeling code.
Creation Notes
This pruned checkpoint was prepared in Codex with GPT5.5 assistance at the repository owner's direction. Codex was used to adapt the REAP workflow for SmallThinker, run pruning and smoke evaluation, and prepare the upload artifacts.
Pruning Summary
- Base model:
Tiiny/SmallThinker-4BA0.6B-Instruct - Pruning method: REAP layerwise expert pruning
- Calibration dataset:
theblackcat102/evol-codealpaca-v1 - Requested compression ratio:
0.25 - Effective experts pruned per layer:
8 / 32 - Primary experts retained per layer:
24 - Active experts per token:
4 - Router weight renormalization: enabled
- Calibration settings:
model_max_length=2048batches_per_category=128batch_size=1batch_group_size=8truncate=false
Local Smoke Evaluation
Greedy generation was checked on Japanese, English, and Chinese prompts.
| Language | Language check | Notes |
|---|---|---|
| Japanese | OK | Understands the language, but output can become repetitive or partially degraded. |
| English | OK | Most stable among the three tested languages. |
| Chinese | OK | Produces Chinese answers, though sentence-count instructions may not be followed exactly. |
Average generation time in the local smoke run was about 11.512 seconds
across the three prompts on the test machine with CPU offload.
Usage
This model uses custom code, so load it with trust_remote_code=True.
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
Colab / Transformers Compatibility
This checkpoint uses Hugging Face custom modeling code. If loading in Google
Colab or another notebook environment fails while importing
modeling_smallthinker.py with an error such as
cannot import name 'HybridCache' from 'transformers.cache_utils', the installed
Transformers package is too old for the SmallThinker custom code. Upgrade
Transformers and restart the runtime before loading the model:
!pip -q install -U "transformers>=4.55.0" "accelerate>=1.7.0" "safetensors"
After restarting, this import should succeed:
from transformers.cache_utils import HybridCache
If your installed Transformers version raises a later import error involving
LossKwargs, upgrade Transformers or apply an equivalent compatibility shim.
The local pruning run was tested with transformers==4.55.0 plus a REAP-side
compatibility shim. GGUF runtimes may work even when this Python loading path
fails, because GGUF does not execute Hugging Face modeling_smallthinker.py.
Caveats
This is an experimental pruned checkpoint. It was validated with load and short generation smoke tests, not a full benchmark suite. Quality can vary by language and task, especially in Japanese after pruning.
- Downloads last month
- 41
Model tree for sasa2000/SmallThinker-4BA0.6B-Instruct-REAP-0.25
Base model
Tiiny/SmallThinker-4BA0.6B-Instruct