Instructions to use xbruce22/gemma-4-e2b-reasoning-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use xbruce22/gemma-4-e2b-reasoning-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/gemma-4-E2B-it") model = PeftModel.from_pretrained(base_model, "xbruce22/gemma-4-e2b-reasoning-lora") - Notebooks
- Google Colab
- Kaggle
gemma-4-e2b-reasoning-lora
A LoRA adapter for unsloth/gemma-4-E2B-it that reshapes the model's reasoning style into concise bulleted thinking traces while keeping the final answers intact.
Instead of the base model's long, verbose Thinking Process: blocks, this adapter makes the model emit a short flat - bullet list inside a <|channel>thought ... <channel|> block, then the answer — exactly the condensed reasoning style it was trained on.
What's in this repo
| File | Why |
|---|---|
adapter_model.safetensors |
The trained LoRA weights (12.08M params, ~46 MB) |
adapter_config.json |
LoRA config (r=8, alpha=8, target modules) |
tokenizer.json, tokenizer_config.json, chat_template.jinja |
Gemma4 tokenizer + chat template |
chat.py |
Ready-to-run interactive chat script (streaming) |
README.md |
This file |
This is a LoRA adapter only, not a standalone model. You load the base model (
unsloth/gemma-4-E2B-it) and apply this adapter on top — see below.
Quick start (chat)
pip install torch transformers peft
python chat.py
chat.py auto-detects CUDA / Intel XPU / CPU, loads the base model, applies this adapter, merges it, and starts a streaming chat with thinking ON. In-chat commands: /q quit · /reset clear history · /raw show special-token markers · /think toggle thinking.
How to use the LoRA adapter (code)
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
BASE = "unsloth/gemma-4-E2B-it"
ADAPTER = "xbruce22/gemma-4-e2b-reasoning-lora"
device = "cuda" if torch.cuda.is_available() else (
"xpu" if hasattr(torch, "xpu") and torch.xpu.is_available() else "cpu")
dtype = torch.float32 if device == "cpu" else torch.bfloat16
base = AutoModelForCausalLM.from_pretrained(BASE, dtype=dtype).to(device)
model = PeftModel.from_pretrained(base, ADAPTER)
# Optional: merge LoRA into the weights for faster inference
model = model.merge_and_unload()
model.eval()
processor = AutoProcessor.from_pretrained(BASE)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write DFS in python, keep short."},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = processor(text=[text], return_tensors="pt").to(device)
# Text-only: drop multimodal-only fields generate() rejects
for k in list(inputs):
if "token_type" in k or "pixel" in k or "audio" in k:
inputs.pop(k)
with torch.inference_mode():
out = model.generate(
**inputs, max_new_tokens=1024, do_sample=True,
temperature=1.0, top_p=0.95, top_k=64,
pad_token_id=processor.tokenizer.pad_token_id)
gen = out[0][inputs["input_ids"].shape[1]:]
print(processor.decode(gen, skip_special_tokens=True))
Notes:
- Pass
enable_thinking=Truetoapply_chat_templateso the template injects<|think|>and the model produces the<|channel>thought ... <channel|>reasoning block before the answer. - Recommended Gemma-4 sampling:
temperature=1.0, top_p=0.95, top_k=64. - If you don't
merge_and_unload(), keep using thePeftModeldirectly — both work.
Expected output style
Prompt: Write DFS in python, keep short.
── thinking ──
- User wants a DFS implementation in Python, explicitly requesting it be "short"
- Settled on iterative version using a stack and visited set ...
- Concise version: no classes, just a function — keeps it short while remaining correct
── answer ──
def dfs(graph, start, visited=None):
...
The reasoning is now terse, bulleted, and scannable — the style it was fine-tuned to produce.
Training details
- Method: LoRA (r=8, alpha=8, dropout=0) on the text language model's attention (
q/k/v/o_proj) + MLP (gate/up/down_proj) modules. Vision and audio towers frozen (text-only finetune). - Trainable params: 12,079,104 (0.236% of 5.1B).
- Data: 25,614 reasoning rows from
Jackrong/GLM-5.1-Reasoning-1M-Cleaned(main subset). The verboseimd…answerthinking traces were condensed into terse flat bullet lists (via a condenser prompt); the original final answers were kept verbatim. - Training format: Gemma4 chat format with thinking ON —
<|channel>thought\n...bullets...\n<channel|>then the final answer,<|turn>turn markers, assistant-only loss (user/system tokens masked to -100). - Hardware: Intel XPU (Intel Graphics 0xe211, 24 GB), bf16,
adamw_torch, gradient checkpointing. No 4-bit / bitsandbytes (no XPU build). - Schedule: 1 full epoch, 6400 steps, per-device batch 1 × gradient accumulation 4, lr 2e-4 linear, 5 warmup steps, max_seq_length 1536. ~5.7 h.
- Final train_loss: 0.795 (loss MA 1.22 → 0.76, token accuracy 0.74 → 0.79, no OOM).
License
Apache-2.0 (adapter weights). The base model unsloth/gemma-4-E2B-it follows Gemma's terms.
- Downloads last month
- 36