Sarashina2-Vision-14B
Sarashina2-Vision-14B is a Japanese Large Vision Language Model trained by SB Intuitions.
This model is based on Sarashina2-13B and Image Encoder of Qwen2-VL-7B.
It achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs.
How to use
1. Install dependencies
pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate
2. Inference
The following script loads the model and allows inference.
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
# Define model path
model_path = "sbintuitions/sarashina2-vision-14b"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか?
### Assistant:"""
sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-14b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
text=[text_prompt],
images=[image],
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])
# Inference: Generation of the output
output_ids = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.0,
do_sample=False,
stopping_criteria=stopping_criteria,
)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。"""
Example

Prompt | Output |
---|---|
この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか? | この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。 |
真ん中に映っている赤と白の物は何ですか? | 赤と白の物はクレーンです。 |
Training
Sarashina2-Vision is created through the following three-stage learning process:
- We tune the parameters in the projector by caption datasets.
- We tune the parameters in the Vision Encoder and projector by caption datasets.
- We tune the parameters in the projector and LLM by Visual Instruction datasets.
Evaluation Results
Model | Model Size | JMMMU*1 | Heron-Bench*2 | JDocQA |
---|---|---|---|---|
heron-chat-git-ja-stablelm-base-7b-v1 | 7B | 0.294 | 0.461 | 0.069 |
llava-calm2-siglip | 7B | 0.07 | 0.521 | 0.084 |
Llama-3-EvoVLM-JP-v2 | 8B | 0.389 | 0.509 | 0.103 |
Asagi-14B | 14B | 0.302 | 0.433 | 0.06 |
llm-jp-3-vila-14b | 14B | 0.23 | 0.665 | 0.176 |
EZO-InternVL2-26B | 26B | 0.389 | 0.609 | 0.196 |
Sarashina2-Vision-8B | 8B | 0.393 | 0.648 | 0.229 |
Sarashina2-Vision-14B | 14B | 0.433 | 0.644 | 0.245 |
- Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
- GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.
Ethical Considerations and Limitations
Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations.
LICENSE
- Downloads last month
- 4
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support model that require custom code execution.
Model tree for sbintuitions/sarashina2-vision-14b
Base model
sbintuitions/sarashina2-13b