YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Overview

MAESTRO-4B is the lightweight multimodal orchestrator used in MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles.

Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:

whether to invoke an external expert,
which expert model to call,
which task-specific skill to use,
and when to terminate with a final answer.

The full MAESTRO system is available at jinyangwu/Maestro. The repository includes example train/validation data under data/ and skill implementations under skills/.

Important This checkpoint is an orchestrator policy, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.

Key Features

RL-trained orchestration policy: Learns model-skill routing through outcome-based reinforcement learning.
Hierarchical skill registry: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
Model-skill composition: Treats expert model selection and skill invocation as a unified action.
Plug-and-play extensibility: Can exploit newly added experts and skills without retraining in the reported setup.
Efficient 4B controller: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.

Performance Highlights

The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.

Setting	Result
In-domain multimodal benchmarks	70.1% average accuracy
Closed-source reference baselines	GPT-5: 69.3%, Gemini-2.5-Pro: 68.7%
Augmented out-of-domain registry without retraining	59.5% average accuracy
Average latency in the reported setup	2.88s

These numbers describe the full MAESTRO system with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.

Quickstart

Load the orchestrator checkpoint

Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "Jinyang23/Maestro-4B"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Run the full MAESTRO framework

Clone the project repository:

git clone https://github.com/jinyangwu/Maestro
cd Maestro

Create the Python environment and install dependencies:

conda create -n maestro python=3.10 -y
conda activate maestro
pip install -r requirements.txt

Set an OpenAI API key before training or rollout:

export OPENAI_API_KEY=<your_api_key>

Before training, deploy the auxiliary model services. Replace each /path/to/<model> placeholder with a local model directory or Hugging Face model id.

Example:

vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9

Default service ports used by the skills:

Port	Model service
`2362`	`qwen3-VL-8B-Instruct`
`2364`	`Chart-R1`
`2368`	`Intern-S1-mini`
`2369`	`medgemma-1.5-4b-it`
`2370`	`DeepEyes-7B`
`2376`	`GLM-4.6V-Flash`
`2388`	`GLM-OCR`
`2389`	`PR1-Qwen2.5-VL-3B-Detection`

Start training with:

bash train.sh

To train from a local checkpoint or a different model id, override MODEL_NAME:

MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh

Model Details

Model name: Jinyang23/Maestro-4B
Role: MAESTRO multimodal orchestration policy
Base model: Qwen3-VL-4B-Thinking
Training method: outcome-based reinforcement learning with GRPO-style optimization
Action space: latent reasoning, model-skill search actions, and terminal answers
Skill interface: hierarchical skill registry from the MAESTRO repository
Expected usage: high-level controller for external expert models and modular skills

Intended Use

This model is intended for research on:

multimodal agent orchestration,
reinforcement learning for tool and skill use,
model routing and expert selection,
hierarchical skill libraries,
agentic evaluation across heterogeneous tasks.

It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.

Citation

If you use this model or the MAESTRO framework in your research, please cite:

@misc{wu2026maestro,
      title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
      author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
      year={2026},
      eprint={2605.22177},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22177}, 
}

Acknowledgement

This project builds on open-source reinforcement learning and model-serving ecosystems, including verl and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.

Downloads last month: -

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Jinyang23/Maestro-4B

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Paper • 2605.22177 • Published 2 days ago • 17

Jinyang23
/

Maestro-4B