Image-Text-to-Text
Transformers
Safetensors
Indonesian
qwen3_5
image-captioning
qwen3.5
bahasa-indonesia
lora
lora-merged
connector-tuned
vlm
multimodal
json-output
conversational
Instructions to use Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector") model = AutoModelForMultimodalLM.from_pretrained("Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector
- SGLang
How to use Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector with Docker Model Runner:
docker model run hf.co/Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector
Qwen3.5-4B ImCap — Indonesian Image Captioning (LoRA + Connector, merged)
Fine-tune dari Qwen/Qwen3.5-4B-Base untuk image captioning Bahasa Indonesia.
Berbeda dari varian LoRA-only, model ini juga melatih connector visual (visual.merger)
selain LoRA pada language layers; vision backbone tetap beku. Semua sudah di-merge ke base weights.
Skema Training
| Komponen | Status | Catatan |
|---|---|---|
| Language layers | LoRA (r=32, alpha=64) | q/k/v/o/gate/up/down_proj |
Connector (visual.merger) |
Dilatih penuh (modules_to_save) |
LR terpisah lebih kecil (2e-5) |
| Vision backbone | Beku | — |
| Status adapter | Merged ke base | standalone |
| Bahasa output | Bahasa Indonesia | format JSON {"caption": "..."} |
enable_thinking |
False |
|
| EOS token | <|im_end|> (+ <|endoftext|>) |
|
| Precision | bfloat16 |
Cara Pakai (Inference)
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
REPO = "Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector"
processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO, torch_dtype=torch.bfloat16, device_map="auto",
attn_implementation="flash_attention_2", # "sdpa" jika bukan Ampere+
)
model.eval()
System Prompt (WAJIB sama dengan training)
SYSTEM_PROMPT = (
"Annotator dataset image captioning. Tulis caption Bahasa Indonesia yang deskriptif.\n\n"
"Aturan:\n"
"- Deskripsikan subjek utama, detail visual (warna, posisi, atribut), dan latar belakang.\n"
"- Jika gambar mengandung teks penting (meme, infografis, berita, poster), sertakan isi teksnya dalam caption.\n"
"- KHUSUS UNTUK GAMBAR MEME: Analisis dan jelaskan makna sarkasme, ironi, atau humor yang terkandung di dalamnya jika ada.\n"
"- Panjang caption fleksibel: 2-3 kalimat untuk gambar biasa, lebih panjang jika ada teks/informasi penting atau sarkasme.\n"
"- Hanya deskripsikan yang terlihat. Jangan tebak identitas/nama. Jangan awali dengan \"gambar ini menunjukkan\".\n"
'- Output: Harus berupa JSON valid dengan format: {"caption": "isi caption disini"}'
)
Render + Generate
import json, re
img = Image.open("path/to/image.jpg").convert("RGB")
img.thumbnail((560, 560), Image.Resampling.LANCZOS)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Buatkan caption deskriptif untuk gambar ini."},
]},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = processor(text=[text], images=[[img]], return_tensors="pt").to(model.device)
input_len = inputs.input_ids.shape[1]
im_end_id = processor.tokenizer.convert_tokens_to_ids("<|im_end|>")
eot_id = processor.tokenizer.convert_tokens_to_ids("<|endoftext|>")
eos_ids = list({im_end_id, eot_id} - {-1})
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=256, do_sample=False, use_cache=True,
eos_token_id=eos_ids, pad_token_id=processor.tokenizer.pad_token_id,
)
raw = processor.tokenizer.decode(out[0, input_len:], skip_special_tokens=True)
def extract_caption(raw: str):
cleaned = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL).strip()
try:
obj = json.loads(cleaned)
if isinstance(obj, dict) and "caption" in obj:
return obj["caption"], "valid"
except Exception:
pass
m = re.search(r"\{.*\}", cleaned, flags=re.DOTALL)
if m:
try:
obj = json.loads(m.group(0))
if isinstance(obj, dict) and "caption" in obj:
return obj["caption"], "recovered"
except Exception:
pass
return cleaned, "invalid"
print(extract_caption(raw))
Keterbatasan
- Output Bahasa Indonesia saja.
- Optimal di GPU dengan FlashAttention-2 (Ampere+); pada T4 gunakan
attn_implementation="sdpa". - Jangan tebak identitas/nama orang dari gambar.
Lisensi
Mengikuti base model: Apache 2.0.
- Downloads last month
- 28
Model tree for Adicandra/Qwen3.5-4B-ImCap-LoRA-Connector
Base model
Qwen/Qwen3.5-4B-Base