Instructions to use Glint-Research/Glint-Trace with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Glint-Research/Glint-Trace with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B-Base")
model = PeftModel.from_pretrained(base_model, "Glint-Research/Glint-Trace")

Transformers

How to use Glint-Research/Glint-Trace with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Glint-Research/Glint-Trace")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Glint-Research/Glint-Trace", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Glint-Research/Glint-Trace with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Glint-Research/Glint-Trace"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glint-Research/Glint-Trace",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Glint-Research/Glint-Trace

SGLang

How to use Glint-Research/Glint-Trace with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Glint-Research/Glint-Trace" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glint-Research/Glint-Trace",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Glint-Research/Glint-Trace" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Glint-Research/Glint-Trace",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Glint-Research/Glint-Trace with Docker Model Runner:
```
docker model run hf.co/Glint-Research/Glint-Trace
```

glint_ / research · v1.0

0.8B parameters · QLoRA · r=16 · α=32 · 2000 steps

Glint-Trace is a QLoRA adapter that teaches a tiny language model (Qwen 3.5 0.8B Base) to think before it answers. The base model is frozen; only a thin LoRA wrap is trained. The adapter learns to emit an explicit … trace that the prompt can later condition on. Small enough to run on a laptop, fast enough to trace in a few seconds.

↓ Download adapter ↳ Quick start ↗ Glint-Research

── AT A GLANCE 01 / SPECS

Field	Value
Base model	Qwen/Qwen3.5-0.8B-Base
Method	QLoRA (4-bit base + low-rank adapters)
Rank / α / dropout	16 / 32 / 0.05
Targets	`q_proj · k_proj · v_proj · o_proj · gate_proj · up_proj · down_proj`
Trainable params	~2.06M (LoRA only; int4 base frozen)
Context	2 048 tokens
Special tokens	`<\|prompt\|> <\|response\|> <\|think\|> <\|/think\|> <\|len_*\|>`
Task	Chain-of-thought generation (`CAUSAL_LM`)
PEFT version	0.18.0

── QUICK START 02 / USAGE

# pip install "transformers>=4.45" "peft>=0.18" torch accelerate bitsandbytes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor
from peft import PeftModel

REPO  = "Glint-Research/Glint-Trace"     # this repo
BASE  = "Qwen/Qwen3.5-0.8B-Base"

PROMPT_TOK, RESPONSE_TOK = "<|prompt|>", "<|response|>"
THINK_OPEN, THINK_CLOSE  = "<|think|>", "<|/think|>"
LENGTH_TOKS = {            # length-bucket hint, controls trace length
    "small": "<|len_s|>",  "medium": "<|len_m|>",  "large": "<|len_l|>",
    "xl":    "<|len_xl|>", "xxl":    "<|len_xxl|>",
}

class WrapItUpProcessor(LogitsProcessor):
    # Linearly bias toward  close as the budget runs out.
    def __init__(self, stop_id, prompt_len, max_new,
                 ramp_start=0.5, max_boost=20.0):
        self.stop_id, self.prompt_len = stop_id, prompt_len
        self.max_new = max_new; self.ramp_start = ramp_start; self.max_boost = max_boost
    def __call__(self, input_ids, scores):
        gen = max(0, input_ids.shape[1] - self.prompt_len)
        frac = gen / self.max_new
        if frac < self.ramp_start: return scores
        t = min(1.0, (frac - self.ramp_start) / max(1e-6, 1.0 - self.ramp_start))
        boost = t * self.max_boost
        scores = scores.clone(); scores[:, self.stop_id] += boost
        return scores


# --- load ---
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.bfloat16 if device == "cuda" else torch.float32

tok  = AutoTokenizer.from_pretrained(REPO)
base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype)
base.resize_token_embeddings(len(tok))                    # +9 CoT special tokens
model = PeftModel.from_pretrained(base, REPO).merge_and_unload()
model.eval().to(device)


# --- build the input exactly the way the adapter was trained ---
def generate_cot(prompt: str, response: str, length: str = "medium",
                  max_new_tokens: int = 800, temperature: float = 0.8,
                  top_p: float = 0.95, repetition_penalty: float = 1.05):
    eot      = tok.eos_token_id
    open_id  = tok.convert_tokens_to_ids(THINK_OPEN)
    close_id = tok.convert_tokens_to_ids(THINK_CLOSE)
    len_id   = tok.convert_tokens_to_ids(LENGTH_TOKS[length])

    ids = (
        [tok.convert_tokens_to_ids(PROMPT_TOK)]
        + tok.encode(prompt,   add_special_tokens=False)
        + [tok.convert_tokens_to_ids(RESPONSE_TOK)]
        + tok.encode(response, add_special_tokens=False)
        + [len_id, open_id]
    )
    input_ids = torch.tensor([ids], device=device)
    attn      = torch.ones_like(input_ids)

    processors = [WrapItUpProcessor(close_id, input_ids.shape[1], max_new_tokens)]
    out = model.generate(
        input_ids, attention_mask=attn,
        max_new_tokens=max_new_tokens,
        do_sample=True, temperature=temperature, top_p=top_p,
        repetition_penalty=repetition_penalty,
        pad_token_id=eot,
        eos_token_id=[close_id, eot],
        logits_processor=processors,
    )
    new_tokens = out[0, input_ids.shape[1]:].tolist()
    while new_tokens and new_tokens[-1] in (close_id, eot):
        new_tokens.pop()
    trace = tok.decode(new_tokens, skip_special_tokens=False).strip()
    return f"<think>{trace}</think>\n\n{response}"


# --- example ---
print(generate_cot(
    prompt="If 3x + 7 = 22, what is x?",
    response="x = 5.",
    length="small",
))

── HOW TO READ THE TRACE 03 / FORMAT

The adapter was trained so the assistant turns are wrapped in a single … block. The opening tag is produced unconditionally; the closing tag is reached before the budget runs out, after which the final answer is emitted on the same turn.

The <|len_s|> … <|len_xxl|> tokens are length-bucket hints injected during training; they are inert at inference.

── TRAINING 04 / RUN

Data

132 709 SFT rows (English), 1 340 eval

Dropped

11 019 too long · 9 284 non-English

Optimizer

paged AdamW, QLoRA (NF4, double-quant)

Steps

2 000 · final eval loss ≈ 1.18

Hardware

1× consumer GPU (bf16 compute, int4 base)

── FILES IN THIS REPO 05 / CONTENTS

adapter_config.json — PEFT/LoRA config (r, α, targets, base model)
adapter_model.safetensors — LoRA weights + resized embeddings
tokenizer.json — BPE tokenizer, extended with CoT specials
tokenizer_config.json
chat_template.jinja — Qwen chat template with …
header.png — header image
README.md — this page

BUILT BY glint_ research · 2026 AGPL-3.0

SAMPLE:

PROMPT:

What is the capital of france?

RESPONSE:

The capital of France is Paris! Anything else you would like to know about France?

============================================================================== transformers (cuda) — in 28 tok out 132 tok time 2.0s

Okay, the user asked, "What is the capital of France?" I need to provide that directly. But wait, maybe there's more they want. Let me check if there are any other common questions people ask about France. Oh, right, sometimes people ask for interesting facts or a short explanation. Should I include that? The user didn't specify, so maybe stick to the answer unless they ask for more. Also, I should make sure the response is clear and concise. No need for extra info unless it's relevant. Alright, so the answer is Paris. Maybe add a note about being the capital city of France. That should cover it.

The capital of France is Paris! Anything else you would like to know about France?

Downloads last month: 86

Model tree for Glint-Research/Glint-Trace

Base model

Qwen/Qwen3.5-0.8B-Base

Adapter

(7)

this model