Instructions to use AbstractPhil/Qwen3.5-0.8B-json-captioner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AbstractPhil/Qwen3.5-0.8B-json-captioner") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("AbstractPhil/Qwen3.5-0.8B-json-captioner") model = AutoModelForMultimodalLM.from_pretrained("AbstractPhil/Qwen3.5-0.8B-json-captioner") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AbstractPhil/Qwen3.5-0.8B-json-captioner" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AbstractPhil/Qwen3.5-0.8B-json-captioner", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AbstractPhil/Qwen3.5-0.8B-json-captioner
- SGLang
How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AbstractPhil/Qwen3.5-0.8B-json-captioner" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AbstractPhil/Qwen3.5-0.8B-json-captioner", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AbstractPhil/Qwen3.5-0.8B-json-captioner" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AbstractPhil/Qwen3.5-0.8B-json-captioner", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with Docker Model Runner:
docker model run hf.co/AbstractPhil/Qwen3.5-0.8B-json-captioner
Qwen3.5-0.8B-json-captioner
A merged, standalone image → structured-JSON captioner: Qwen/Qwen3.5-0.8B with the task_1 caption-structuring LoRA fused into the weights. It looks at an image (or an image-synthesis prompt) and emits a grounded, literal caption as JSON via an emit_caption_schema tool call. No PEFT/adapter loading required at inference — load it like any transformers model.
What it is
- Base:
Qwen/Qwen3.5-0.8B—qwen3_5architecture, ~873M params, image-text-to-text, Apache-2.0. - Adapter:
AbstractPhil/qwen3.5-0.8b-task_1-lora-v2, folded in withpeft'smerge_and_unload(). - Result: a single checkpoint with the base architecture —
AutoModelForImageTextToText+AutoProcessor, nopeft.
The merge was faithfulness-checked (base+LoRA logits vs. merged, in-memory and reloaded-from-disk) before upload.
Intended use
Turn an image into a fixed-schema caption JSON for downstream training pipelines (it was built to fill the structured-caption field of an image-caption super-dataset). It is a narrow extraction model, not a general chat or VQA model.
Training
Two-stage curriculum (qwen_lora_train_v2.py)
The v2 adapter was trained via a two-stage curriculum, warm-started from the v1 LoRA (AbstractPhil/qwen3.5-0.8b-task_1-lora, which was trained on the Claude gold set alone).
Stage 1 — Bulk pretraining on ~50,000 grounded rows from AbstractPhil/cc-task1-json (Qwen-generated Conceptual Captions conversions, filtered to grounded==True). High volume, ~99%-clean but 0.8B-quality. 1 epoch.
Stage 2 — Refinement on ~20,505 Claude Sonnet 4.6 gold extractions from AbstractPhil/json-coco-format, config task_1. These are higher-fidelity, more robust tool-call examples produced by the ClaudeProvider (strict prompt mode, forced emit_caption_schema tool choice, filtered to grounding_rate==1.0). 2 epochs.
The hypothesis: v1 may have been quality-capped by the small 20K Claude set; bulk CC data broadens it, and the gold refinement stage re-anchors. Three checkpoints exist for comparison: v1 (Claude only) → v2-stage1 (+ 50K CC) → v2 (CC then Claude refine).
Data format
Each training example is a tool-calling conversation: system prompt (caption-structuring assistant, grounded literal extraction), user turn (a plain text caption, e.g. "A long restaurant table with rattan rounded back chairs."), and assistant turn emitting an emit_caption_schema tool call with five fields: subjects, actions, setting, style, mood. Training was text caption → structured JSON — not on images.
LoRA config (from adapter_config.json)
| parameter | value |
|---|---|
rank r |
32 |
lora_alpha |
64 |
lora_dropout |
0.05 |
target_modules |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
bias |
none |
task_type |
CAUSAL_LM |
| rsLoRA / DoRA | off |
Training hyperparameters (from qwen_lora_train_v2.py)
| parameter | value |
|---|---|
| trainer | transformers.Trainer |
| optimizer | AdamW (default) |
| LR (both stages) | 1e-4 (below v1's 2e-4 — continuing a trained adapter) |
| LR schedule | cosine with 3% warmup |
| batch size | 16 |
| gradient accumulation | 1 (effective batch = 16) |
| precision | bf16 |
| max sequence length | 2048 |
| label masking | -100 over system+user prefix; loss on assistant tokens only |
| seed | 42 |
The adapter modifies only the language-model projections; the base's vision encoder is untouched. That is why, although training was text-only, the merged model also does image → JSON at inference: feed an image and the vision-conditioned generation inherits the same tool-call structuring behavior.
Important: the task scaffold is not baked into the weights
The system prompt and the tools definition the LoRA was trained against live in the dataset AbstractPhil/json-coco-format (config task_1), not in the model. For the structured output this model is tuned for, apply that same system prompt + tools at inference (shown below). Without them the model still runs, but you lose the schema grounding.
Usage
import json, torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, AutoModelForImageTextToText
REPO = "AbstractPhil/Qwen3.5-0.8B-json-captioner"
processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO, dtype=torch.bfloat16, device_map="cuda").eval()
processor.tokenizer.padding_side = "left"
if processor.tokenizer.pad_token_id is None:
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
# Task scaffold (system prompt + tools). Read the JSONL directly: the dataset card
# declares a 'Json' feature type that datasets>=4.0 rejects, so load_dataset() fails
# ("Feature type 'Json' not found") — hf_hub_download + json.loads(first line) is robust.
_p = hf_hub_download("AbstractPhil/json-coco-format", "data/task_1.jsonl", repo_type="dataset")
with open(_p, encoding="utf-8") as f:
scaffold = json.loads(f.readline())
SYSTEM_PROMPT = scaffold["messages"][0]["content"]
TOOLS = scaffold["tools"]
image = Image.open("example.jpg").convert("RGB")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract the structured representation of what this image shows."},
]},
]
inputs = processor.apply_chat_template(
messages, tools=TOOLS, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt", enable_thinking=False).to(model.device)
out = model.generate(
**inputs, max_new_tokens=768, do_sample=False,
pad_token_id=processor.tokenizer.pad_token_id,
stop_strings=["</tool_call>"], tokenizer=processor.tokenizer)
text = processor.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(text) # -> <tool_call><function=emit_caption_schema><parameter=...>...</tool_call>
The continuation is a Qwen tool call; parse the <function=...><parameter=...> block into a dict to get the caption JSON. Text-only input (an image-synthesis prompt instead of an image) works too — pass the prompt as the user text and drop the image content block.
Notes
- Precision:
bfloat16is recommended (the merge was done in bf16). - Attention backend:
sdpais correct on Blackwell (sm_120) and Turing (sm_75), where flash-attn kernels don't run. On Ampere/Ada/Hopper (sm_80/86/89/90) you can passattn_implementation="flash_attention_2"ifflash-attnis installed, for a faster prefill. - Decoding: deterministic (
do_sample=False) withstop_strings=["</tool_call>"]to halt once the tool call closes.
Provenance
Produced by merging the LoRA into the base via merge_and_unload(safe_merge=True), then save_pretrained (weights + config) and processor.save_pretrained (image processor + tokenizer + chat template). Qwen/Qwen3.5-0.8B is a standard transformers architecture, so the repo is self-contained — no custom remote code.
License
Apache-2.0, inherited from the base model Qwen/Qwen3.5-0.8B.
Limitations
- Small (0.8B): extraction quality is bounded by the
task_1LoRA's training; it is not a general-purpose captioner or chat model. - Image → JSON is a transfer capability. The adapter was trained on text caption → JSON, so image grounding rides on the base VLM's vision encoder plus the LoRA's structuring behavior — it was not directly trained on image inputs. Expect text → JSON to be its strongest mode.
- The output schema is fixed by the
emit_caption_schematool (subjects, actions, setting, style, mood) — anything outside that schema is out of scope. - Tuned toward grounded, literal extraction; it is not designed for creative or interpretive captions.
- Downloads last month
- 73