guava-05-22

Full fine-tune of Qwen/Qwen3.5-4B for closed-loop tool-calling robot manipulation. Part of the guava project.

This checkpoint: models at step 228 (epoch 3.0), final training loss = 0.2162.

⚠ Loading: use the multimodal auto-class

Qwen/Qwen3.5-4B is a vision-language model. Load with AutoModelForImageTextToText (or Qwen3_5ForConditionalGeneration directly), NOT AutoModelForCausalLM — the latter returns the text-only variant without language_model and will fail at generation.

Training hyperparameters

Base model Qwen/Qwen3.5-4B
Dtype bfloat16
Tuner Full fine-tune (LM trained, ViT + aligner frozen)
Epochs 3.0
LR / schedule 1e-05 / cosine, 0.05 warmup
Per-device batch / grad accum 2 / 2
Max length 10240
Final loss @ step 228 0.2162

System prompt

This model was trained against a specific prompt — see system_prompt.txt. Use that exact content as the system message; any other prompt produces a distribution shift.

Usage (transformers, no PEFT)

import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "AIcell/guava-05-22", torch_dtype=torch.bfloat16, device_map="cuda",
)
proc = AutoProcessor.from_pretrained("AIcell/guava-05-22")

system_prompt = open("system_prompt.txt").read().strip()
scene_img = Image.open("scene.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [
        {"type": "image", "image": scene_img},
        {"type": "text", "text":
            "Task: <your task description>.\n\n"
            "Gripper is at [...] rotation [...] width X%."},
    ]},
]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True,
    tokenize=True, return_dict=True, return_tensors="pt",
).to("cuda")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Per-turn assistant output: a <think>…</think> block followed by exactly one <tool_call>{"name": "<tool>", "arguments": {…}}</tool_call> (or Task complete. / Task failed. to terminate).

vLLM serving (no LoRA flags needed)

vllm serve AIcell/guava-05-22 \
    --port 8000 --max-model-len 24576 \
    --reasoning-parser qwen3 --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --limit-mm-per-prompt '{"image": 20}'

Source

Training script, eval harness, and upload tooling: https://github.com/hdacnw/guava

Downloads last month
34
Safetensors
Model size
504k params
Tensor type
BF16
·
Video Preview
loading

Model tree for AIcell/guava-05-22

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(247)
this model