glint_ / research Β· v1.0
glint_ research β€” Glint-Trace
0.8B parameters Β· QLoRA Β· r=16 Β· Ξ±=32 Β· 2000 steps

Glint-Trace is a QLoRA adapter that teaches a tiny language model (Qwen 3.5 0.8B Base) to think before it answers. The base model is frozen; only a thin LoRA wrap is trained. The adapter learns to emit an explicit … trace that the prompt can later condition on. Small enough to run on a laptop, fast enough to trace in a few seconds.

↓ Download adapter ↳ Quick start β†— Glint-Research

── AT A GLANCE 01 / SPECS
Field Value
Base modelQwen/Qwen3.5-0.8B-Base
MethodQLoRA (4-bit base + low-rank adapters)
Rank / Ξ± / dropout16 / 32 / 0.05
Targetsq_proj Β· k_proj Β· v_proj Β· o_proj Β· gate_proj Β· up_proj Β· down_proj
Trainable params~2.06M (LoRA only; int4 base frozen)
Context2 048 tokens
Special tokens<|prompt|> <|response|> <|think|> <|/think|> <|len_*|>
TaskChain-of-thought generation (CAUSAL_LM)
PEFT version0.18.0
── QUICK START 02 / USAGE
# pip install "transformers>=4.45" "peft>=0.18" torch accelerate bitsandbytes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor
from peft import PeftModel

REPO  = "Glint-Research/Glint-Trace"     # this repo
BASE  = "Qwen/Qwen3.5-0.8B-Base"

PROMPT_TOK, RESPONSE_TOK = "<|prompt|>", "<|response|>"
THINK_OPEN, THINK_CLOSE  = "<|think|>", "<|/think|>"
LENGTH_TOKS = {            # length-bucket hint, controls trace length
    "small": "<|len_s|>",  "medium": "<|len_m|>",  "large": "<|len_l|>",
    "xl":    "<|len_xl|>", "xxl":    "<|len_xxl|>",
}

class WrapItUpProcessor(LogitsProcessor):
    # Linearly bias toward  close as the budget runs out.
    def __init__(self, stop_id, prompt_len, max_new,
                 ramp_start=0.5, max_boost=20.0):
        self.stop_id, self.prompt_len = stop_id, prompt_len
        self.max_new = max_new; self.ramp_start = ramp_start; self.max_boost = max_boost
    def __call__(self, input_ids, scores):
        gen = max(0, input_ids.shape[1] - self.prompt_len)
        frac = gen / self.max_new
        if frac < self.ramp_start: return scores
        t = min(1.0, (frac - self.ramp_start) / max(1e-6, 1.0 - self.ramp_start))
        boost = t * self.max_boost
        scores = scores.clone(); scores[:, self.stop_id] += boost
        return scores


# --- load ---
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.bfloat16 if device == "cuda" else torch.float32

tok  = AutoTokenizer.from_pretrained(REPO)
base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype)
base.resize_token_embeddings(len(tok))                    # +9 CoT special tokens
model = PeftModel.from_pretrained(base, REPO).merge_and_unload()
model.eval().to(device)


# --- build the input exactly the way the adapter was trained ---
def generate_cot(prompt: str, response: str, length: str = "medium",
                  max_new_tokens: int = 800, temperature: float = 0.8,
                  top_p: float = 0.95, repetition_penalty: float = 1.05):
    eot      = tok.eos_token_id
    open_id  = tok.convert_tokens_to_ids(THINK_OPEN)
    close_id = tok.convert_tokens_to_ids(THINK_CLOSE)
    len_id   = tok.convert_tokens_to_ids(LENGTH_TOKS[length])

    ids = (
        [tok.convert_tokens_to_ids(PROMPT_TOK)]
        + tok.encode(prompt,   add_special_tokens=False)
        + [tok.convert_tokens_to_ids(RESPONSE_TOK)]
        + tok.encode(response, add_special_tokens=False)
        + [len_id, open_id]
    )
    input_ids = torch.tensor([ids], device=device)
    attn      = torch.ones_like(input_ids)

    processors = [WrapItUpProcessor(close_id, input_ids.shape[1], max_new_tokens)]
    out = model.generate(
        input_ids, attention_mask=attn,
        max_new_tokens=max_new_tokens,
        do_sample=True, temperature=temperature, top_p=top_p,
        repetition_penalty=repetition_penalty,
        pad_token_id=eot,
        eos_token_id=[close_id, eot],
        logits_processor=processors,
    )
    new_tokens = out[0, input_ids.shape[1]:].tolist()
    while new_tokens and new_tokens[-1] in (close_id, eot):
        new_tokens.pop()
    trace = tok.decode(new_tokens, skip_special_tokens=False).strip()
    return f"<think>{trace}</think>\n\n{response}"


# --- example ---
print(generate_cot(
    prompt="If 3x + 7 = 22, what is x?",
    response="x = 5.",
    length="small",
))
── HOW TO READ THE TRACE 03 / FORMAT

The adapter was trained so the assistant turns are wrapped in a single … block. The opening tag is produced unconditionally; the closing tag is reached before the budget runs out, after which the final answer is emitted on the same turn.

The <|len_s|> … <|len_xxl|> tokens are length-bucket hints injected during training; they are inert at inference.

── TRAINING 04 / RUN
Data
132 709 SFT rows (English), 1 340 eval
Dropped
11 019 too long Β· 9 284 non-English
Optimizer
paged AdamW, QLoRA (NF4, double-quant)
Steps
2 000 Β· final eval loss β‰ˆ 1.18
Hardware
1Γ— consumer GPU (bf16 compute, int4 base)
── FILES IN THIS REPO 05 / CONTENTS
  • adapter_config.json β€” PEFT/LoRA config (r, Ξ±, targets, base model)
  • adapter_model.safetensors β€” LoRA weights + resized embeddings
  • tokenizer.json β€” BPE tokenizer, extended with CoT specials
  • tokenizer_config.json
  • chat_template.jinja β€” Qwen chat template with …
  • header.png β€” header image
  • README.md β€” this page
BUILT BY glint_ research Β· 2026 AGPL-3.0

SAMPLE:

PROMPT:

What is the capital of france?

RESPONSE:

The capital of France is Paris! Anything else you would like to know about France?

============================================================================== transformers (cuda) β€” in 28 tok out 132 tok time 2.0s

Okay, the user asked, "What is the capital of France?" I need to provide that directly. But wait, maybe there's more they want. Let me check if there are any other common questions people ask about France. Oh, right, sometimes people ask for interesting facts or a short explanation. Should I include that? The user didn't specify, so maybe stick to the answer unless they ask for more. Also, I should make sure the response is clear and concise. No need for extra info unless it's relevant. Alright, so the answer is Paris. Maybe add a note about being the capital city of France. That should cover it.

The capital of France is Paris! Anything else you would like to know about France?

Downloads last month
86
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Glint-Research/Glint-Trace

Adapter
(7)
this model