Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models

Jagle-VL-2.2B-Jagle-FineVision is a 2.2B vision language model trained on Jagle and FineVision dataset.

Model Architecture

The model architecture is inspired by InternVL3.0 and consists of a language model, a vision encoder, and a lightweight projector.

LLM: Qwen/Qwen3-1.7B (1.7B)
Vision Encoder: google/siglip2-so400m-patch16-512 (0.4B)
Projector: 2-layer MLP

Training Data

This model is trained on Jagle and FineVision dataset.

Usage

Install requirements.

uv add "torch==2.8.0" "transformers==4.57.0" "flash-attn==2.8.3" "pillow==11.3.0"

Below is the sample code to run the model.

import torch
from transformers import AutoProcessor, AutoModel

model_id = "llm-jp/Jagle-VL-2.2B-Jagle-FineVision"

# load model
model = (
    AutoModel.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        use_flash_attn=True,
    )
    .eval()
    .cuda()
)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)


def generate(messages, max_new_tokens=256, temperature=0.0):
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    if "pixel_values" in inputs:
        inputs["pixel_values"] = inputs["pixel_values"].to(dtype=model.dtype)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0,
        temperature=temperature if temperature > 0 else None,
    )

    text = processor.decode(outputs[0], skip_special_tokens=False)
    text = text.replace("<|channel|>final<|message|>", "")
    text = text.replace("<|return|>", "")
    text = text.replace(processor.tokenizer.eos_token, "")
    return text.strip()


# -----------------------
# 1. Text-only
# -----------------------
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "富士山について簡潔に説明してください。"}],
    }
]
print(generate(messages))
# 富士山は、日本最高峰の山で、標高3,776メートルです。静岡県と山梨県にまたがっており、世界遺産にも登録されています。

# -----------------------
# 2. Single image
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/kaonashi.jpg"},
            {"type": "text", "text": "このキャラクターの名前は何ですか？"},
        ],
    }
]
print(generate(messages))
# カオナシ

# -----------------------
# 3. Multi-image
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/Shiba_inu.jpg"},
            {"type": "image", "image": "assets/yesoensis.jpg"},
            {"type": "text", "text": "それぞれの動物の名前を教えてください。"},
        ],
    }
]
print(generate(messages))
# 柴犬と鹿です。

# -----------------------
# 4. Multi-turn example
# -----------------------
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/kaonashi.jpg"},
            {
                "type": "text",
                "text": "このキャラクターが登場する映画のタイトルは何ですか？",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [{"type": "text", "text": "千と千尋の神隠し"}],
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": "監督は誰ですか？"}],
    },
]

print(generate(messages))
# 宮崎駿

For more details, please refer to the official GitHub repository: https://github.com/llm-jp/llm-jp-4-vl

LICENSE

Apache License 2.0

FineVision, which is used to train this model, is a curated dataset aggregated from multiple existing datasets.
Some portions include data derived from outputs generated by proprietary models (e.g., OpenAI, Anthropic, and other closed-source systems). Users must comply with the applicable terms of use of those models when using this model.

Citation

@misc{sugiura2026jaglebuildinglargescalejapanese,
      title={Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models}, 
      author={Issa Sugiura and Keito Sasagawa and Keisuke Nakao and Koki Maeda and Ziqi Yin and Zhishen Yang and Shuhei Kurita and Yusuke Oda and Ryoko Tokuhisa and Daisuke Kawahara and Naoaki Okazaki},
      year={2026},
      eprint={2604.02048},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.02048}, 
}

Downloads last month: 37

Safetensors

Model size

2B params

Tensor type

BF16

Collection including llm-jp/Jagle-VL-2.2B-Jagle-FineVision

Jagle

Collection

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models • 5 items • Updated Apr 12 • 2

Papers for llm-jp/Jagle-VL-2.2B-Jagle-FineVision

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Paper • 2604.02048 • Published Apr 2 • 1

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 310