Instructions to use Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA") model = AutoModelForImageTextToText.from_pretrained("Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA
- SGLang
How to use Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA with Docker Model Runner:
docker model run hf.co/Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA
Qwen3.5-4B ImCap — Indonesian Image Captioning (LoRA merged)
Model ini merupakan hasil fine-tuning (LoRA, sudah di-merge ke base weights) dari
Qwen/Qwen3.5-4B-Base untuk tugas image captioning Bahasa Indonesia.
Deskripsi Singkat
| Atribut | Nilai |
|---|---|
| Base model | Qwen/Qwen3.5-4B-Base |
| Metode fine-tune | LoRA (rank/alpha sesuai config SFT) |
| Status adapter | Merged ke base weights |
| Bahasa output | Bahasa Indonesia 🇮🇩 |
| Format output | JSON — {"caption": "..."} |
enable_thinking saat training |
False |
| EOS token | <|im_end|> (+ <|endoftext|>) |
| Precision | bfloat16 |
Cara Pakai (Inference)
Minimal (bfloat16, GPU)
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText
REPO = "Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA"
processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
System Prompt (WAJIB sama dengan training)
SYSTEM_PROMPT = (
"Annotator dataset image captioning. Tulis caption Bahasa Indonesia yang deskriptif.\n\n"
"Aturan:\n"
"- Deskripsikan subjek utama, detail visual (warna, posisi, atribut), dan latar belakang.\n"
"- Jika gambar mengandung teks penting (meme, infografis, berita, poster), sertakan isi teksnya dalam caption.\n"
"- KHUSUS UNTUK GAMBAR MEME: Analisis dan jelaskan makna sarkasme, ironi, atau humor yang terkandung di dalamnya jika ada.\n"
"- Panjang caption fleksibel: 2-3 kalimat untuk gambar biasa, lebih panjang jika ada teks/informasi penting atau sarkasme.\n"
"- Hanya deskripsikan yang terlihat (serta konteks humor/sarkasme jika itu meme). Jangan tebak identitas/nama. Jangan awali dengan \"gambar ini menunjukkan\".\n"
'- Output: Harus berupa JSON valid dengan format: {"caption": "isi caption disini"}'
)
Render Chat + Generate
import json, re
USER_PROMPT = "Buatkan caption deskriptif untuk gambar ini."
MAX_IMAGE_SIZE = (560, 560)
# ── Load & resize gambar ──────────────────────────────────────────────────────
img = Image.open("path/to/image.jpg").convert("RGB")
img.thumbnail(MAX_IMAGE_SIZE, Image.Resampling.LANCZOS)
# ── Susun messages ────────────────────────────────────────────────────────────
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": USER_PROMPT},
]},
]
# ── Render template (enable_thinking=False wajib) ────────────────────────────
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
# ── Tokenize ──────────────────────────────────────────────────────────────────
inputs = processor(text=[text], images=[[img]], return_tensors="pt").to(model.device)
input_len = inputs.input_ids.shape[1]
# ── EOS tokens ───────────────────────────────────────────────────────────────
im_end_id = processor.tokenizer.convert_tokens_to_ids("<|im_end|>")
eot_id = processor.tokenizer.convert_tokens_to_ids("<|endoftext|>")
eos_ids = list({im_end_id, eot_id} - {-1})
# ── Generate — greedy (deterministik) ────────────────────────────────────────
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
use_cache=True,
eos_token_id=eos_ids,
pad_token_id=processor.tokenizer.pad_token_id,
)
raw = processor.tokenizer.decode(out[0, input_len:], skip_special_tokens=True)
# ── Parse JSON output ─────────────────────────────────────────────────────────
def extract_caption(raw: str):
cleaned = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL).strip()
try:
obj = json.loads(cleaned)
if isinstance(obj, dict) and "caption" in obj:
return obj["caption"], "valid"
except Exception:
pass
m = re.search(r"\{.*\}", cleaned, flags=re.DOTALL)
if m:
try:
obj = json.loads(m.group(0))
if isinstance(obj, dict) and "caption" in obj:
return obj["caption"], "recovered"
except Exception:
pass
return cleaned, "invalid"
caption, status = extract_caption(raw)
print(f"[{status}] {caption}")
Generate dengan Sampling
# Tambahkan argumen berikut ke model.generate() untuk sampling:
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.8,
top_k=20,
min_p=0.0,
repetition_penalty=1.2,
use_cache=True,
eos_token_id=eos_ids,
pad_token_id=processor.tokenizer.pad_token_id,
)
Format Output
Model selalu menghasilkan JSON valid:
{"caption": "Seekor kucing oranye sedang duduk di atas meja kayu berwarna cokelat, menatap ke arah kanan frame dengan mata setengah terpejam. Latar belakang berupa dinding putih yang sedikit buram."}
Jika gambar mengandung teks (meme, infografis, poster), teks tersebut akan disertakan dalam caption beserta konteks humor/sarkasme bila relevan.
Diagnostik: Cek Apakah <think> Muncul
Model di-train dengan enable_thinking=False. Pastikan argumen tersebut selalu diteruskan ke apply_chat_template. Jika token <think> masih muncul di output:
- Pastikan
enable_thinking=Falsediapply_chat_template. - Gunakan
suppress_tokens=[think_id]saatgenerate()sebagai fallback. - Fungsi
extract_caption()di atas sudah mem-strip blok<think>...</think>secara otomatis.
Keterbatasan & Catatan
- Model hanya menghasilkan caption Bahasa Indonesia; tidak dirancang untuk bahasa lain.
- Untuk input resolusi tinggi (> 560×560), model tetap berfungsi tetapi performa optimal pada thumbnail 560×560.
- Jangan tebak identitas/nama orang dari gambar — sesuai aturan system prompt.
- Evaluasi formal (CIDEr, BLEU, METEOR) belum tersedia; performa diukur secara kualitatif.
Lisensi
Mengikuti lisensi base model: Apache 2.0.
Citation
Jika menggunakan model ini dalam penelitian, silakan sitasi base model Qwen3.5 dan sebutkan repo ini sebagai fine-tune untuk image captioning Bahasa Indonesia.
- Downloads last month
- 47
Model tree for Adicandra/Qwen3.5-4B-ImageCaptioning-LoRA
Base model
Qwen/Qwen3.5-4B-Base