Instructions to use Glint-Research/Glint-Trace with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Glint-Research/Glint-Trace with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B-Base") model = PeftModel.from_pretrained(base_model, "Glint-Research/Glint-Trace") - Transformers
How to use Glint-Research/Glint-Trace with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Glint-Research/Glint-Trace") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Glint-Research/Glint-Trace", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Glint-Research/Glint-Trace with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Glint-Research/Glint-Trace" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glint-Research/Glint-Trace", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Glint-Research/Glint-Trace
- SGLang
How to use Glint-Research/Glint-Trace with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Glint-Research/Glint-Trace" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glint-Research/Glint-Trace", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Glint-Research/Glint-Trace" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Glint-Research/Glint-Trace", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Glint-Research/Glint-Trace with Docker Model Runner:
docker model run hf.co/Glint-Research/Glint-Trace
Glint-Trace is a QLoRA adapter that teaches a tiny language model (Qwen 3.5 0.8B Base) to think before it answers. The base model is frozen; only a thin LoRA wrap is trained. The adapter learns to emit an explicit β¦ trace that the prompt can later condition on. Small enough to run on a laptop, fast enough to trace in a few seconds.
β Download adapter β³ Quick start β Glint-Research
| Field | Value |
|---|---|
| Base model | Qwen/Qwen3.5-0.8B-Base |
| Method | QLoRA (4-bit base + low-rank adapters) |
| Rank / Ξ± / dropout | 16 / 32 / 0.05 |
| Targets | q_proj Β· k_proj Β· v_proj Β· o_proj Β· gate_proj Β· up_proj Β· down_proj |
| Trainable params | ~2.06M (LoRA only; int4 base frozen) |
| Context | 2 048 tokens |
| Special tokens | <|prompt|> <|response|> <|think|> <|/think|> <|len_*|> |
| Task | Chain-of-thought generation (CAUSAL_LM) |
| PEFT version | 0.18.0 |
# pip install "transformers>=4.45" "peft>=0.18" torch accelerate bitsandbytes
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor
from peft import PeftModel
REPO = "Glint-Research/Glint-Trace" # this repo
BASE = "Qwen/Qwen3.5-0.8B-Base"
PROMPT_TOK, RESPONSE_TOK = "<|prompt|>", "<|response|>"
THINK_OPEN, THINK_CLOSE = "<|think|>", "<|/think|>"
LENGTH_TOKS = { # length-bucket hint, controls trace length
"small": "<|len_s|>", "medium": "<|len_m|>", "large": "<|len_l|>",
"xl": "<|len_xl|>", "xxl": "<|len_xxl|>",
}
class WrapItUpProcessor(LogitsProcessor):
# Linearly bias toward close as the budget runs out.
def __init__(self, stop_id, prompt_len, max_new,
ramp_start=0.5, max_boost=20.0):
self.stop_id, self.prompt_len = stop_id, prompt_len
self.max_new = max_new; self.ramp_start = ramp_start; self.max_boost = max_boost
def __call__(self, input_ids, scores):
gen = max(0, input_ids.shape[1] - self.prompt_len)
frac = gen / self.max_new
if frac < self.ramp_start: return scores
t = min(1.0, (frac - self.ramp_start) / max(1e-6, 1.0 - self.ramp_start))
boost = t * self.max_boost
scores = scores.clone(); scores[:, self.stop_id] += boost
return scores
# --- load ---
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
tok = AutoTokenizer.from_pretrained(REPO)
base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype)
base.resize_token_embeddings(len(tok)) # +9 CoT special tokens
model = PeftModel.from_pretrained(base, REPO).merge_and_unload()
model.eval().to(device)
# --- build the input exactly the way the adapter was trained ---
def generate_cot(prompt: str, response: str, length: str = "medium",
max_new_tokens: int = 800, temperature: float = 0.8,
top_p: float = 0.95, repetition_penalty: float = 1.05):
eot = tok.eos_token_id
open_id = tok.convert_tokens_to_ids(THINK_OPEN)
close_id = tok.convert_tokens_to_ids(THINK_CLOSE)
len_id = tok.convert_tokens_to_ids(LENGTH_TOKS[length])
ids = (
[tok.convert_tokens_to_ids(PROMPT_TOK)]
+ tok.encode(prompt, add_special_tokens=False)
+ [tok.convert_tokens_to_ids(RESPONSE_TOK)]
+ tok.encode(response, add_special_tokens=False)
+ [len_id, open_id]
)
input_ids = torch.tensor([ids], device=device)
attn = torch.ones_like(input_ids)
processors = [WrapItUpProcessor(close_id, input_ids.shape[1], max_new_tokens)]
out = model.generate(
input_ids, attention_mask=attn,
max_new_tokens=max_new_tokens,
do_sample=True, temperature=temperature, top_p=top_p,
repetition_penalty=repetition_penalty,
pad_token_id=eot,
eos_token_id=[close_id, eot],
logits_processor=processors,
)
new_tokens = out[0, input_ids.shape[1]:].tolist()
while new_tokens and new_tokens[-1] in (close_id, eot):
new_tokens.pop()
trace = tok.decode(new_tokens, skip_special_tokens=False).strip()
return f"<think>{trace}</think>\n\n{response}"
# --- example ---
print(generate_cot(
prompt="If 3x + 7 = 22, what is x?",
response="x = 5.",
length="small",
))
The adapter was trained so the assistant turns are wrapped in a single β¦ block. The opening tag is produced unconditionally; the closing tag is reached before the budget runs out, after which the final answer is emitted on the same turn.
The <|len_s|> β¦ <|len_xxl|> tokens are length-bucket hints injected during training; they are inert at inference.
- adapter_config.json β PEFT/LoRA config (r, Ξ±, targets, base model)
- adapter_model.safetensors β LoRA weights + resized embeddings
- tokenizer.json β BPE tokenizer, extended with CoT specials
- tokenizer_config.json
- chat_template.jinja β Qwen chat template with β¦
- header.png β header image
- README.md β this page
SAMPLE:
PROMPT:
What is the capital of france?
RESPONSE:
The capital of France is Paris! Anything else you would like to know about France?
============================================================================== transformers (cuda) β in 28 tok out 132 tok time 2.0s
Okay, the user asked, "What is the capital of France?" I need to provide that directly. But wait, maybe there's more they want. Let me check if there are any other common questions people ask about France. Oh, right, sometimes people ask for interesting facts or a short explanation. Should I include that? The user didn't specify, so maybe stick to the answer unless they ask for more. Also, I should make sure the response is clear and concise. No need for extra info unless it's relevant. Alright, so the answer is Paris. Maybe add a note about being the capital city of France. That should cover it.
The capital of France is Paris! Anything else you would like to know about France?
- Downloads last month
- 86
Model tree for Glint-Research/Glint-Trace
Base model
Qwen/Qwen3.5-0.8B-Base