Instructions to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kshitijthakkar/tracegenix-mini-sft-clean-3ep") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("kshitijthakkar/tracegenix-mini-sft-clean-3ep") model = AutoModelForImageTextToText.from_pretrained("kshitijthakkar/tracegenix-mini-sft-clean-3ep") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="kshitijthakkar/tracegenix-mini-sft-clean-3ep", filename="tracegenix-mini-sft-clean-3ep-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M # Run inference directly in the terminal: llama-cli -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M # Run inference directly in the terminal: llama-cli -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Use Docker
docker model run hf.co/kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kshitijthakkar/tracegenix-mini-sft-clean-3ep" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/tracegenix-mini-sft-clean-3ep", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
- SGLang
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kshitijthakkar/tracegenix-mini-sft-clean-3ep" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/tracegenix-mini-sft-clean-3ep", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kshitijthakkar/tracegenix-mini-sft-clean-3ep" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/tracegenix-mini-sft-clean-3ep", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Ollama:
ollama run hf.co/kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
- Unsloth Studio
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kshitijthakkar/tracegenix-mini-sft-clean-3ep to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for kshitijthakkar/tracegenix-mini-sft-clean-3ep to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for kshitijthakkar/tracegenix-mini-sft-clean-3ep to start chatting
- Pi
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Docker Model Runner:
docker model run hf.co/kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
- Lemonade
How to use kshitijthakkar/tracegenix-mini-sft-clean-3ep with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull kshitijthakkar/tracegenix-mini-sft-clean-3ep:Q4_K_M
Run and chat with the model
lemonade run user.tracegenix-mini-sft-clean-3ep-Q4_K_M
List all available models
lemonade list
tracegenix-mini-sft-clean-3ep
A small Qwen3.5-MoE checkpoint (~0.87 B total / ~0.69 B active) supervised fine-tuned
on kshitijthakkar/TraceVerse-RL-SFT-Clean for OpenTelemetry trace analysis: cost,
latency, error, security, and GPU/CO₂ summarisation over gen_ai.* semantic-convention
spans produced by genai-otel-instrument.
Part of the TraceVerse-RL lineage — the SFT step that precedes GRPO + LoRA-soup merging.
At a glance
| Architecture | Qwen3.5-MoE, 12 layers, hybrid attention (linear × 3 + full × 1, repeated) |
| Total params | 870 M |
| Active params | 687 M |
| Hidden / head_dim | 1024 / 256 |
| Experts (routed / shared / per-tok) | 16 / 1 / 2 |
| MoE intermediate | 400 |
| Vocab | 248 320 (Qwen3.5 BPE) |
| Context | 1 048 576 (yarn-extended; sliding window 32) |
| Precision | bf16 |
| Base | Qwen/Qwen3.5-0.8B-Base |
| Training | 3 epochs SFT, LR 2.5 e-5 cosine, packing, adamw_8bit on H100 |
| Training data | kshitijthakkar/TraceVerse-RL-SFT-Clean (14.9 K rows) |
Intended use
The model is domain-specialised. It expects:
- The tracegenix system message (or any close variant — see chat template), and
- A structured OpenTelemetry span / trace payload (or an analytical question referencing one).
Typical tasks: trace classification (12-category taxonomy), cost / latency analysis, multi-section trace reports, RCA hypothesis building, agent quality scoring. Not designed for general-purpose chat or retrieval-augmented Q&A outside the trace-analysis domain.
How to use
With Ollama (recommended for local use)
⚠️ Important: the Python-Jinja chat template embedded in the GGUF uses macros,
raise_exception, andenable_thinkingbranches that llama.cpp's Jinja parser cannot handle. Ollama silently falls back to a bare{{ .Prompt }}template, stripping the<|im_start|>/<|im_end|>role markers — so/api/chatproduces hallucinated garbage by default.Fix: override the template in the
Modelfilewith a proper ChatML Go-template. UseModelfile.q4from this repo (or copy the snippet below).
FROM ./tracegenix-mini-sft-clean-3ep-Q4_K_M.gguf
SYSTEM """You are tracegenix, TraceVerse's autonomous AI SRE for production AI agents. Open new conversations with 'Namaste 🙏'. Investigate OpenTelemetry signals and traces to identify root causes, propose remediations, and proactively surface context the user might miss. Use available MCP tools for live data; cite sources from trace evidence; never fabricate. Respond with structured JSON when a schema is clearly implied by the prompt; concise prose otherwise. Match the user's language (English, Hindi, Hinglish, or other Indian languages). You are a teammate, not a chatbot."""
TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions. Each call must be wrapped in <tool_call>...</tool_call> XML tags with a JSON object inside.
<tools>
{{- range .Tools }}
{{ . }}
{{- end }}
</tools>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}{{- if not (eq (len (slice $.Messages $i)) 1) }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- end }}<|im_start|>assistant
{{ else }}{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ end }}{{ .Response }}{{- if .Response }}<|im_end|>{{ end }}"""
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER num_ctx 8192
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
ollama create tracegenix-mini-q4 -f Modelfile.q4
ollama run tracegenix-mini-q4
With this template both /api/chat and the chat UI produce correct output —
e.g. for the orchestration_planner task, the model emits a clean JSON tool
plan that matches the training distribution.
With llama-server
./llama-server -m tracegenix-mini-sft-clean-3ep-Q4_K_M.gguf -ngl 99 -c 8192 -fa on
With transformers (bf16 safetensors)
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("kshitijthakkar/tracegenix-mini-sft-clean-3ep")
model = AutoModelForCausalLM.from_pretrained(
"kshitijthakkar/tracegenix-mini-sft-clean-3ep",
dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
GGUF quantizations
Converted with llama.cpp commit
67b2b7f (May 2026) via convert_hf_to_gguf.py --outtype bf16, then
llama-quantize. Architecture handler: Qwen3_5MoeTextModel (registered for
Qwen3_5MoeForConditionalGeneration / Qwen3_5MoeForCausalLM).
| File | Size | Use for | Notes |
|---|---|---|---|
tracegenix-mini-sft-clean-3ep-bf16.gguf |
1.9 GB | accuracy reference | Same precision the model was trained at; ~120 tok/s on a 6 GB consumer GPU via Ollama chat (with the Modelfile template override below) |
tracegenix-mini-sft-clean-3ep-Q6_K.gguf |
929 MB | best quality / size trade | Near-lossless vs bf16 for K-quant kernels |
tracegenix-mini-sft-clean-3ep-Q4_K_M.gguf |
796 MB | default | Smallest, fastest; loaded in Ollama as tracegenix-mini-q4 |
Why no Q8_0
llama.cpp's Q8_0 requires every quantized tensor's column dimension to be
divisible by 32. This model's MoE expert FFN tensors have ncols = 400
(moe_intermediate_size = 400, not a power of two), and 400 % 32 != 0. There
is no fallback path defined for Q8_0 in this case, so the conversion errors out:
warning: blk.0.ffn_down_exps.weight - ncols 400 not divisible by 32 (required for type q8_0)
llama_model_quantize: failed to quantize: no tensor type fallback is defined for type q8_0
K-quants (Q4_K_M, Q6_K) use a 256-element super-block layout that handles 400 fine, hence they succeed. Q6_K is the practical substitute — its perplexity gap vs Q8_0 on Llama-3-8B reference is roughly +0.02 (negligible).
Measured throughput (RTX 3060 6 GB, num_ctx=8192, Ollama /api/chat with the Modelfile template override)
Measured on the orchestration_planner task (~1.7 KB system + short user
query, ~190-token JSON output, temperature=0.0).
| Quant | Eval rate | First-token latency | Notes |
|---|---|---|---|
| Q4_K_M | 160 tok/s | ~0.4 s (warm) | Recommended for interactive use |
| bf16 | 118 tok/s | ~5 s (warm) | Same precision as the original safetensors; output literally matches ground truth |
(Q6_K not benchmarked here; expected to fall between the two, closer to Q4_K_M.
Numbers were obtained via /api/chat with the Go-template override — without
that override Ollama strips role markers and decoding degenerates to
< 1 tok/s, which is a chat-template issue, not a property of the weights.)
Evaluation
Judged by lightning-ai/gpt-oss-120b on 212 examples from
TraceVerse-RL-Eval-Golden
across 4 dimensions (1–10 scale). Detailed per-sample completions and scores
are in
TraceVerse-RL-Eval-Inference.
(Per-environment table is below in the auto-maintained benchmark block.)
Limitations
- Domain-specialised. Best on structured trace-analysis prompts. Casual chat works but is not the trained distribution.
- GGUF-embedded chat template does not work in Ollama — llama.cpp's
Jinja parser cannot handle the source
chat_template.jinja(macros,raise_exception,enable_thinkingbranches), so Ollama falls back to a bare{{ .Prompt }}template and/api/chatproduces hallucinated output. Always use the ModelfileTEMPLATEoverride shown in the Ollama section above; with it the model emits correct training-distribution outputs (verified on the orchestration_planner task — output matches ground truth). - MTP heads not exported. Qwen3.5-MoE's multi-token-prediction layers
were dropped during the source SFT; the GGUFs ship the main LM head only.
Ollama / llama.cpp mainline do not consume MTP at inference, so this is
irrelevant for typical usage. (Speculative-decoding-aware runtimes such as
am17an/llama.cpp:mtp-cleanwould benefit if MTP were restored.) - Hallucination on out-of-domain technical questions is consistent with a small specialised model — verify generated factual claims independently.
Measuring throughput
The Ollama chat UI does not display tokens/s. To measure:
# CLI verbose mode
ollama run tracegenix-mini --verbose "your prompt"
# Via the API
curl -s http://localhost:11434/api/chat -d '{
"model":"tracegenix-mini",
"messages":[{"role":"user","content":"..."}],
"stream":false
}' | jq '{eval_count, tokens_per_s: (.eval_count / (.eval_duration / 1e9))}'
# Or tail the server log
tail -f ~/.ollama/logs/server.log
Citation
@misc{tracegenix-mini-sft-clean-3ep,
author = {Kshitij Thakkar},
title = {tracegenix-mini-sft-clean-3ep: a small Qwen3.5-MoE for OpenTelemetry trace analysis},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/kshitijthakkar/tracegenix-mini-sft-clean-3ep}}
}
Benchmark Results — TraceVerse-RL-Eval-Golden
Judged by lightning-ai/gpt-oss-120b on 212 examples across 4 dimensions (1-10 scale).
| Overall | Accuracy | Completeness | Format | Grounding |
|---|---|---|---|---|
| 4.25 | 3.86 | 4.29 | 5.23 | 4.11 |
Per-environment
| Environment | N | Overall |
|---|---|---|
| orchestration_planner | 23 | 7.89 |
| cloud_incident_response | 2 | 5.35 |
| incident_ops | 2 | 4.85 |
| soc_triage_gym | 3 | 4.47 |
| tool_orchestrator | 43 | 4.35 |
| trace_classifier | 60 | 3.89 |
| report_generator | 33 | 3.45 |
| agent_scorer | 30 | 3.38 |
| deep_investigator | 9 | 3.19 |
| quality_evaluator | 7 | 2.99 |
See full per-sample scores at kshitijthakkar/TraceVerse-RL-Eval-Inference.
- Downloads last month
- 507
Model tree for kshitijthakkar/tracegenix-mini-sft-clean-3ep
Base model
Qwen/Qwen3.5-0.8B-Base