YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Overview
MAESTRO-4B is the lightweight multimodal orchestrator used in MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles.
Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:
- whether to invoke an external expert,
- which expert model to call,
- which task-specific skill to use,
- and when to terminate with a final answer.
The full MAESTRO system is available at jinyangwu/Maestro. The repository includes example train/validation data under data/ and skill implementations under skills/.
Important This checkpoint is an orchestrator policy, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.
Key Features
- RL-trained orchestration policy: Learns model-skill routing through outcome-based reinforcement learning.
- Hierarchical skill registry: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
- Model-skill composition: Treats expert model selection and skill invocation as a unified action.
- Plug-and-play extensibility: Can exploit newly added experts and skills without retraining in the reported setup.
- Efficient 4B controller: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.
Performance Highlights
The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.
| Setting | Result |
|---|---|
| In-domain multimodal benchmarks | 70.1% average accuracy |
| Closed-source reference baselines | GPT-5: 69.3%, Gemini-2.5-Pro: 68.7% |
| Augmented out-of-domain registry without retraining | 59.5% average accuracy |
| Average latency in the reported setup | 2.88s |
These numbers describe the full MAESTRO system with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.
Quickstart
Load the orchestrator checkpoint
Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "Jinyang23/Maestro-4B"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
Run the full MAESTRO framework
Clone the project repository:
git clone https://github.com/jinyangwu/Maestro
cd Maestro
Create the Python environment and install dependencies:
conda create -n maestro python=3.10 -y
conda activate maestro
pip install -r requirements.txt
Set an OpenAI API key before training or rollout:
export OPENAI_API_KEY=<your_api_key>
Before training, deploy the auxiliary model services. Replace each /path/to/<model> placeholder with a local model directory or Hugging Face model id.
Example:
vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9
Default service ports used by the skills:
| Port | Model service |
|---|---|
2362 |
qwen3-VL-8B-Instruct |
2364 |
Chart-R1 |
2368 |
Intern-S1-mini |
2369 |
medgemma-1.5-4b-it |
2370 |
DeepEyes-7B |
2376 |
GLM-4.6V-Flash |
2388 |
GLM-OCR |
2389 |
PR1-Qwen2.5-VL-3B-Detection |
Start training with:
bash train.sh
To train from a local checkpoint or a different model id, override MODEL_NAME:
MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh
Model Details
- Model name:
Jinyang23/Maestro-4B - Role: MAESTRO multimodal orchestration policy
- Base model:
Qwen3-VL-4B-Thinking - Training method: outcome-based reinforcement learning with GRPO-style optimization
- Action space: latent reasoning, model-skill search actions, and terminal answers
- Skill interface: hierarchical skill registry from the MAESTRO repository
- Expected usage: high-level controller for external expert models and modular skills
Intended Use
This model is intended for research on:
- multimodal agent orchestration,
- reinforcement learning for tool and skill use,
- model routing and expert selection,
- hierarchical skill libraries,
- agentic evaluation across heterogeneous tasks.
It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.
Citation
If you use this model or the MAESTRO framework in your research, please cite:
@misc{wu2026maestro,
title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
year={2026},
eprint={2605.22177},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.22177},
}
Links
Acknowledgement
This project builds on open-source reinforcement learning and model-serving ecosystems, including verl and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.
- Downloads last month
- -