Instructions to use llm-jp/Jagle-VL-2.2B-Jagle-FineVision with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use llm-jp/Jagle-VL-2.2B-Jagle-FineVision with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="llm-jp/Jagle-VL-2.2B-Jagle-FineVision", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("llm-jp/Jagle-VL-2.2B-Jagle-FineVision", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models
| 🤗 HuggingFace | 📄 Paper | 🧑💻 Code |
Jagle-VL-2.2B-Jagle-FineVision is a 2.2B vision language model trained on Jagle and FineVision dataset.
Model Architecture
The model architecture is inspired by InternVL3.0 and consists of a language model, a vision encoder, and a lightweight projector.
- LLM: Qwen/Qwen3-1.7B (1.7B)
- Vision Encoder: google/siglip2-so400m-patch16-512 (0.4B)
- Projector: 2-layer MLP
Training Data
This model is trained on Jagle and FineVision dataset.
Usage
Install requirements.
uv add "torch==2.8.0" "transformers==4.57.0" "flash-attn==2.8.3" "pillow==11.3.0"
Below is the sample code to run the model.
import torch
from transformers import AutoProcessor, AutoModel
model_id = "llm-jp/Jagle-VL-2.2B-Jagle-FineVision"
# load model
model = (
AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
use_flash_attn=True,
)
.eval()
.cuda()
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
def generate(messages, max_new_tokens=256, temperature=0.0):
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(dtype=model.dtype)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=temperature > 0,
temperature=temperature if temperature > 0 else None,
)
text = processor.decode(outputs[0], skip_special_tokens=False)
text = text.replace("<|channel|>final<|message|>", "")
text = text.replace("<|return|>", "")
text = text.replace(processor.tokenizer.eos_token, "")
return text.strip()
# -----------------------
# 1. Text-only
# -----------------------
messages = [
{
"role": "user",
"content": [{"type": "text", "text": "富士山について簡潔に説明してください。"}],
}
]
print(generate(messages))
# 富士山は、日本最高峰の山で、標高3,776メートルです。静岡県と山梨県にまたがっており、世界遺産にも登録されています。
# -----------------------
# 2. Single image
# -----------------------
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "assets/kaonashi.jpg"},
{"type": "text", "text": "このキャラクターの名前は何ですか?"},
],
}
]
print(generate(messages))
# カオナシ
# -----------------------
# 3. Multi-image
# -----------------------
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "assets/Shiba_inu.jpg"},
{"type": "image", "image": "assets/yesoensis.jpg"},
{"type": "text", "text": "それぞれの動物の名前を教えてください。"},
],
}
]
print(generate(messages))
# 柴犬と鹿です。
# -----------------------
# 4. Multi-turn example
# -----------------------
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "assets/kaonashi.jpg"},
{
"type": "text",
"text": "このキャラクターが登場する映画のタイトルは何ですか?",
},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": "千と千尋の神隠し"}],
},
{
"role": "user",
"content": [{"type": "text", "text": "監督は誰ですか?"}],
},
]
print(generate(messages))
# 宮崎駿
For more details, please refer to the official GitHub repository: https://github.com/llm-jp/llm-jp-4-vl
LICENSE
Apache License 2.0
FineVision, which is used to train this model, is a curated dataset aggregated from multiple existing datasets.
Some portions include data derived from outputs generated by proprietary models (e.g., OpenAI, Anthropic, and other closed-source systems).
Users must comply with the applicable terms of use of those models when using this model.
Citation
@misc{sugiura2026jaglebuildinglargescalejapanese,
title={Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models},
author={Issa Sugiura and Keito Sasagawa and Keisuke Nakao and Koki Maeda and Ziqi Yin and Zhishen Yang and Shuhei Kurita and Yusuke Oda and Ryoko Tokuhisa and Daisuke Kawahara and Naoaki Okazaki},
year={2026},
eprint={2604.02048},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02048},
}
- Downloads last month
- 37