Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models

| 🤗 HuggingFace  | 📄 Paper  | 🧑‍💻 Code  |


Jagle-VL-2.2B-Jagle is a 2.2B vision language model trained on Jagle dataset.

Model Architecture

The model architecture is inspired by InternVL3.0 and consists of a language model, a vision encoder, and a lightweight projector.

Training Data

This model is trained on Jagle dataset.

Usage

Install requirements.

uv add "torch==2.8.0" "transformers==4.57.0" "flash-attn==2.8.3" "pillow==11.3.0"

Below is the sample code to run the model.

import torch
from transformers import AutoProcessor, AutoModel

model_id = "llm-jp/Jagle-VL-2.2B-Jagle"

# load model
model = (
    AutoModel.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        use_flash_attn=True,
    )
    .eval()
    .cuda()
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)


def generate(messages, max_new_tokens=256, temperature=0.0):
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    if "pixel_values" in inputs:
        inputs["pixel_values"] = inputs["pixel_values"].to(dtype=model.dtype)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0,
        temperature=temperature if temperature > 0 else None,
    )

    text = processor.decode(outputs[0], skip_special_tokens=False)
    text = text.replace("<|channel|>final<|message|>", "")
    text = text.replace("<|return|>", "")
    text = text.replace(processor.tokenizer.eos_token, "")
    return text.strip()


# -----------------------
# 1. Text-only
# -----------------------
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "富士山について簡潔に説明してください。"}],
    }
]
print(generate(messages))
# 富士山は、日本最高峰の山で、標高3,776メートルです。静岡県と山梨県にまたがっており、世界遺産にも登録されています。

# -----------------------
# 2. Single image
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/kaonashi.jpg"},
            {"type": "text", "text": "このキャラクターの名前は何ですか?"},
        ],
    }
]
print(generate(messages))
# カオナシ

# -----------------------
# 3. Multi-image
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/Shiba_inu.jpg"},
            {"type": "image", "image": "assets/yesoensis.jpg"},
            {"type": "text", "text": "それぞれの動物の名前を教えてください。"},
        ],
    }
]
print(generate(messages))
# 柴犬と鹿です。

# -----------------------
# 4. Multi-turn example
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/kaonashi.jpg"},
            {
                "type": "text",
                "text": "このキャラクターが登場する映画のタイトルは何ですか?",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "千と千尋の神隠し"}],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "監督は誰ですか?"}],
    },
]

print(generate(messages))
# 宮崎駿

For more details, please refer to the official GitHub repository: https://github.com/llm-jp/llm-jp-4-vl

LICENSE

Apache License 2.0

Citation

@misc{sugiura2026jaglebuildinglargescalejapanese,
      title={Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models}, 
      author={Issa Sugiura and Keito Sasagawa and Keisuke Nakao and Koki Maeda and Ziqi Yin and Zhishen Yang and Shuhei Kurita and Yusuke Oda and Ryoko Tokuhisa and Daisuke Kawahara and Naoaki Okazaki},
      year={2026},
      eprint={2604.02048},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.02048}, 
}
Downloads last month
33
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including llm-jp/Jagle-VL-2.2B-Jagle

Papers for llm-jp/Jagle-VL-2.2B-Jagle