Instructions to use pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2")
model = AutoModelForCausalLM.from_pretrained("pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2

SGLang

How to use pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2 with Docker Model Runner:
```
docker model run hf.co/pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2
```

CRISP-DeepSeek-R1-Distill-Llama-8B-v2

DeepSeek-R1-Distill-Llama-8B trained with CRISP (Compressed Reasoning via Iterative Self-Policy Distillation) using the v2 conciseness teacher. Step-99 checkpoint.

Paper: https://arxiv.org/abs/2603.05433

CRISP teaches a reasoning model to think concisely by distilling its own concise behavior back into itself: the teacher is the same model conditioned on a conciseness instruction, the student has no instruction, and training minimizes per-token reverse KL from student to teacher on the student's own rollouts (teacher refreshed every M=50 steps). No ground-truth answers, token budgets, or difficulty estimators enter the loss.

This checkpoint uses the v2 teacher prompt: v2 (difficulty-aware, default): adds a caveat to not over-compress hard/multi-step problems (keep case analysis, edge cases, a final check).

Other CRISP checkpoints: Qwen3-8B (v2), Qwen3-14B (v2), DeepSeek-R1-Distill-Llama-8B (v2). Training data: pb09204048/CRISP.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2")
model = AutoModelForCausalLM.from_pretrained("pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2", device_map="auto")

Benchmark results (DeepSeek-R1-Distill-Llama-8B)

Accuracy (mean@8, %) and token reduction (Red., % vs. base) at a 30K-token budget. Math is scored with a dual-path grader (Answer: or \boxed{}); GPQA-Diamond and MMLU use exact letter-match. This model is the CRISP (v2) row.

Setting	MATH-500	AIME 2024	AIME 2025	GPQA-D	MMLU
Base	71.3 / —	33.3 / —	25.0 / —	47.0 / —	71.5 / —
Concise prompt (v2)	79.7 / 20.5%	42.1 / 2.5%	28.8 / 3.8%	46.0 / 9.4%	73.9 / 9.2%
Concise prompt (v1)	80.8 / 25.1%	45.0 / 10.2%	29.2 / 9.8%	46.5 / 10.2%	74.1 / 9.2%
CRISP (v2)	79.8 / 23.2%	42.1 / −2.5%	26.2 / 0.1%	46.7 / 7.0%	71.4 / 11.4%
CRISP (v1)	82.1 / 31.6%	39.2 / 6.3%	27.1 / 7.1%	48.3 / 10.2%	71.7 / 17.6%

Citation

@article{sang2026crisp,
  title={Crisp: Compressed reasoning via iterative self-policy distillation},
  author={Sang, Hejian and Xu, Yuanda and Zhou, Zhengze and He, Ran and Wang, Zhipeng and Sun, Jiachen},
  journal={arXiv preprint arXiv:2603.05433},
  year={2026}
}

Downloads last month: 127

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Finetuned

(176)

this model

Dataset used to train pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2

Paper for pb09204048/CRISP-DeepSeek-R1-Distill-Llama-8B-v2

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 9