Instructions to use iamrahulreddy/Quintus with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use iamrahulreddy/Quintus with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="iamrahulreddy/Quintus") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("iamrahulreddy/Quintus") model = AutoModelForMultimodalLM.from_pretrained("iamrahulreddy/Quintus") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use iamrahulreddy/Quintus with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "iamrahulreddy/Quintus" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iamrahulreddy/Quintus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/iamrahulreddy/Quintus
- SGLang
How to use iamrahulreddy/Quintus with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "iamrahulreddy/Quintus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iamrahulreddy/Quintus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "iamrahulreddy/Quintus" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "iamrahulreddy/Quintus", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use iamrahulreddy/Quintus with Docker Model Runner:
docker model run hf.co/iamrahulreddy/Quintus
Quintus
Quintus-1.7B is a compact English-focused assistant built from
Qwen/Qwen3-1.7B-Base. The project uses online full-vocabulary knowledge
distillation from a Qwen/Qwen3-8B teacher, followed by a targeted SFT stage
for assistant behavior, identity grounding, and generation stability.
Final model weights: iamrahulreddy/Quintus
Core Technical Points
- Dense KD signal: the final training path streams the teacher's full vocabulary distribution live instead of relying on sparse cached top-k logits.
- Base-student strategy: the student starts from
Qwen/Qwen3-1.7B-Base, leaving more room for distillation before assistant-format tuning. - Assistant-only supervision: prompt text, chat headers, separators, and padding are masked out of the supervised target region.
- Sequence packing: deterministic first-fit decreasing packing improves useful-token throughput at 4096-token context length.
- Public benchmark controls: raw/chat prompt format, metric extraction, generation budget, and artifact hygiene are documented explicitly.
Training Summary
The release training path is a two-stage pipeline:
- Online KD: train the 1.7B base student against live teacher logits from a Qwen3-8B teacher.
- Targeted SFT: tune the distilled checkpoint for assistant-style interaction, persona consistency, and repetition control.
Reuse As A KD Framework
Quintus is released as a trained 1.7B assistant, but the repository is also a reusable reference pipeline for compact-model distillation. The same structure can be adapted to other teacher/student pairs with changes to the model IDs, tokenizer, dataset source, local paths, sequence length, batch schedule, and hardware-specific memory settings in configs/config.yaml.
The reusable pieces are split across the codebase: assistant-only masking, sequence packing, online full-vocabulary KD loss, checkpoint/resume metadata, validation, provenance checks, SFT, and evaluation. The final pattern is:
- Distill a smaller base student from a stronger teacher with online KD.
- Apply targeted SFT to recover assistant behavior, formatting, identity, and generation stability.
Core KD objective:
For the final run,
Configuration snapshot:
| Setting | Value |
|---|---|
| Teacher | Qwen/Qwen3-8B |
| Student | Qwen/Qwen3-1.7B-Base |
| Tokenizer | Qwen/Qwen3-1.7B |
| Data | ~90K English-only samples from DistilQwen_100k |
| Max sequence length | 4096 |
| Epochs | 1 |
| Learning rate | 5.0e-6 |
| Weight decay | 0.1 |
| Warmup ratio | 0.05 |
| Online KD token chunk | 2048 |
| Micro batch | 4 |
| Gradient accumulation | 2 |
| Sequence packing | enabled, pack_length = 4096 |
| Attention | FlashAttention-2 when available |
| Liger kernels | enabled for compatible Qwen-family ops |
| Optimizer | fused AdamW |
torch.compile |
disabled |
| Gradient checkpointing | disabled |
| Seed | 25 |
FlashAttention-2, Liger kernels, and fused AdamW are acceleration paths. Keep the baseline load path compatible with standard Transformers and vLLM APIs before publishing checkpoints.
torch.compilestayed disabled because this KD shape showed high Inductor memory overhead, dynamic-shape graph breaks, recompile overhead, and checkpoint portability risk from_orig_mod.state dict prefixes when compiled modules are not unwrapped before saving.
The B200-oriented defaults are conservative for the 8B teacher to 1.7B student workload. Smaller teacher/student pairs may tolerate larger micro-batches, but full-vocabulary KD scales sharply with vocabulary width.
The editable run configuration lives in configs/config.yaml. Paths and Hub destinations are left as placeholders so each runner can set local directories and repository names directly.
Why Online KD Replaced Offline Top-K KD
Earlier experiments cached only the teacher's top-k logits. That made storage smaller, but with a Qwen vocabulary around 151K tokens, $k = 8$ exposes only:
of the vocabulary support at each position. The sparse signal could perturb the student, but it did not consistently transfer deeper reasoning behavior.
The final online path keeps the teacher and student in memory together and computes KL divergence against the teacher's full-vocabulary distribution. Token chunking keeps that dense objective feasible without materializing a single large KL workspace.
Benchmark Scoreboard
The final public scoreboard compares Qwen/Qwen3-1.7B-Base,
Qwen/Qwen3-1.7B-Instruct, and Quintus-1.7B.
The strongest signal is the reasoning crossover: Quintus beats both the base and official 1.7B instruct model on GSM8K, ARC-Challenge, and WinoGrande while remaining at the same parameter scale.
See docs/benchmarks.md for the numeric table and interpretation. See docs/evaluation_methodology.md for benchmark controls.
Evaluation Notes
Evaluation uses a mixture of EvalPlus and lm-evaluation-harness/vLLM style
benchmarks. The repository keeps evaluation methodology separate because prompt
format can change the result:
- Raw completion comparisons are used for base capability.
- Chat-template comparisons are used for assistant-format behavior.
- Log-likelihood tasks such as ARC-Challenge and PIQA should usually stay raw.
- GSM8K can differ between strict
####parsing and flexible number extraction. - Metric extraction must ignore
stderr, aliases, and wrong filter keys. - Runtime versions, checkpoint identity, generation budget, and stale output cleanup are part of the evaluation contract.
The active benchmark runner is sft/evaluate.py. It covers
EvalPlus code tasks and lm-evaluation-harness/vLLM tasks, including GSM8K
10-shot evaluation with an extended generation budget.
Repository Map
configs/ Public run profile and DeepSpeed Zero-2 template.
src/ Data prep, online KD, losses, packing, checkpoints, provenance.
sft/ Post-KD SFT, local chat, and consolidated evaluation runner.
docs/ Public architecture, training, evaluation, and release notes.
weight_audit/ Checkpoint structure and weight-divergence audit material.
Key files:
- src/train.py: SFT, offline KD compatibility, and final
online_kdtraining entry point. - src/download.py: model setup, dataset loading, schema normalization, tokenization, and assistant-only loss masks.
- src/losses.py: CE/KD objective, including online full-vocab KD token chunking.
- src/sequence_packing.py: deterministic first-fit decreasing sequence packing.
- src/checkpoints.py: checkpoint save/resume metadata and packing compatibility checks.
- src/provenance.py: tokenizer/model/data contract checks.
- sft/train_sft.py: post-KD supervised fine-tuning.
- sft/evaluate.py: EvalPlus and
lm-evaluation-harness/vLLM benchmark runner. - sft/chat.py: local interactive chat wrapper.
Commands
Install the base dependencies:
pip install -r requirements.txt
For training and benchmark runs, install the matching extras:
pip install -r requirements-train.txt
pip install -r requirements-eval.txt
Inspect or prepare data/model assets:
python -m src.download --help
Run the final KD path after editing configs/config.yaml for local paths and hardware:
python -m src.train --phase online_kd
Hub checkpoint uploads are off by default for local runs. Pass
--upload_last_checkpoint or the step/epoch upload flags only after setting the
target repository and HF_TOKEN.
Run the consolidated benchmark suite:
python sft/evaluate.py
Start local chat with a downloaded or local checkpoint:
python sft/chat.py --model_path path/to/quintus/checkpoint
Interactive Chat
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
PUBLIC_REPO_ID = "iamrahulreddy/Quintus"
print(f"Loading Quintus from {PUBLIC_REPO_ID}...")
tokenizer = AutoTokenizer.from_pretrained(PUBLIC_REPO_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
PUBLIC_REPO_ID,
device_map="auto",
dtype=torch.float16,
trust_remote_code=True,
)
stop_tokens = ["<|endoftext|>", "<|im_end|>"]
eos_token_ids = [tokenizer.eos_token_id] if tokenizer.eos_token_id is not None else []
for token in stop_tokens:
token_id = tokenizer.convert_tokens_to_ids(token)
if token_id is not None and token_id not in eos_token_ids:
eos_token_ids.append(token_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
conversation_history = [
{
"role": "system",
"content": (
"You are Quintus, a highly capable AI assistant created by "
"Muskula Rahul. You are helpful, precise, and logically sound."
),
}
]
print()
print("Quintus Chat (type 'quit' to exit)")
print()
while True:
try:
user_input = input("You: ").strip()
if user_input.lower() in ["quit", "exit"]:
print("\nGoodbye!")
break
if not user_input:
continue
conversation_history.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(
conversation_history,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print("Quintus: ", end="", flush=True)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
streamer=streamer,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=eos_token_ids,
)
generated_ids = outputs[0][inputs.input_ids.shape[-1]:]
assistant_response = tokenizer.decode(
generated_ids,
skip_special_tokens=True,
).strip()
conversation_history.append({"role": "assistant", "content": assistant_response})
print()
except KeyboardInterrupt:
print("\n\nGoodbye!")
break
Documentation
- Documentation Index: recommended public reading order.
- Architecture: end-to-end data flow, modules, and training phases.
- Experiment Timeline: why the project moved from offline top-k KD to online full-vocabulary KD.
- Training Playbook: memory rules, packing, kernels, checkpointing, and B200-oriented guidance.
- Pipeline Hardening: silent-failure classes, artifact contracts, and safety checks.
- Evaluation Methodology: raw/chat controls, parser traps, metric extraction, and qualitative evaluation rules.
- Engineering Insights: condensed lessons and design decisions.
- Benchmarks: verified scoreboard and interpretation.
- Weight Audit: structural checkpoint sanity checks and weight-divergence summary.
- Hugging Face Model Card: release-page copy for the public model card.
Limitations
- Quintus is still a 1.7B model and inherits compact-model capacity limits.
- Factual answers can be confidently wrong and should be verified.
- Code generation may still contradict stated complexity or edge-case requirements.
- Raw and chat-template results are not interchangeable.
- Additional preference tuning or DPO would likely improve calibration, refusal behavior, and open-ended assistant polish.
Credits
Quintus builds on open model, dataset, and tooling work from the broader LLM community:
- Qwen Team and the Qwen Hugging Face organization for the Qwen3 model family.
Qwen/Qwen3-8B, used as the distillation teacher.Qwen/Qwen3-1.7B-Base, used as the base student checkpoint.Qwen/Qwen3-1.7B, used for the tokenizer and chat-template contract.- Alibaba PAI for the
DistilQwen_100kdataset used as the primary instruction source after filtering. - Hugging Face Transformers for model loading, tokenization, and generation APIs.
- vLLM, EvalPlus, and lm-evaluation-harness for evaluation infrastructure.
- FlashAttention and Liger Kernel for performance kernels used or validated during training.
License And Author
This software is distributed under the MIT License. Refer to the LICENSE file for full text.
Author: Muskula Rahul - @iamrahulreddy
Citation
If this model, codebase, or training pipeline is useful in your work, please cite this repository and acknowledge the upstream Qwen3 models.
- Downloads last month
- 103
Model tree for iamrahulreddy/Quintus
Base model
Qwen/Qwen3-1.7B-Base
