YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

MAESTRO-4B: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Overview

MAESTRO-4B is the lightweight multimodal orchestrator used in MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles.

Rather than solving every task with a single monolithic model, MAESTRO frames multimodal agent execution as a sequential decision-making problem over a hierarchical model-skill registry. At each reasoning step, the 4B orchestrator decides:

  • whether to invoke an external expert,
  • which expert model to call,
  • which task-specific skill to use,
  • and when to terminate with a final answer.

The full MAESTRO system is available at jinyangwu/Maestro. The repository includes example train/validation data under data/ and skill implementations under skills/.

Important This checkpoint is an orchestrator policy, not a standalone all-purpose VLM. To reproduce MAESTRO-style rollout, use this model together with the skill registry and auxiliary model services provided in the GitHub repository.

Key Features

  • RL-trained orchestration policy: Learns model-skill routing through outcome-based reinforcement learning.
  • Hierarchical skill registry: Selects coarse Level-1 skills and dispatches to fine-grained Level-2 solvers.
  • Model-skill composition: Treats expert model selection and skill invocation as a unified action.
  • Plug-and-play extensibility: Can exploit newly added experts and skills without retraining in the reported setup.
  • Efficient 4B controller: Uses a compact orchestrator to coordinate larger or specialized frozen expert models.

Performance Highlights

The MAESTRO paper evaluates the full orchestration system across representative multimodal benchmarks covering mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis.

Setting Result
In-domain multimodal benchmarks 70.1% average accuracy
Closed-source reference baselines GPT-5: 69.3%, Gemini-2.5-Pro: 68.7%
Augmented out-of-domain registry without retraining 59.5% average accuracy
Average latency in the reported setup 2.88s

These numbers describe the full MAESTRO system with its model-skill registry and external services, not isolated single-model inference from this checkpoint alone.

Quickstart

Load the orchestrator checkpoint

Below is a minimal Transformers-style loading example. Full model-skill orchestration requires the MAESTRO repository and the auxiliary services described below.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "Jinyang23/Maestro-4B"

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Run the full MAESTRO framework

Clone the project repository:

git clone https://github.com/jinyangwu/Maestro
cd Maestro

Create the Python environment and install dependencies:

conda create -n maestro python=3.10 -y
conda activate maestro
pip install -r requirements.txt

Set an OpenAI API key before training or rollout:

export OPENAI_API_KEY=<your_api_key>

Before training, deploy the auxiliary model services. Replace each /path/to/<model> placeholder with a local model directory or Hugging Face model id.

Example:

vllm serve /path/to/Intern-S1-mini --served-model-name Intern-S1-mini --tensor_parallel_size 1 --max-num-seqs 512 --trust-remote-code --port 2368 --gpu_memory_utilization 0.9

Default service ports used by the skills:

Port Model service
2362 qwen3-VL-8B-Instruct
2364 Chart-R1
2368 Intern-S1-mini
2369 medgemma-1.5-4b-it
2370 DeepEyes-7B
2376 GLM-4.6V-Flash
2388 GLM-OCR
2389 PR1-Qwen2.5-VL-3B-Detection

Start training with:

bash train.sh

To train from a local checkpoint or a different model id, override MODEL_NAME:

MODEL_NAME=/path/to/Qwen3-VL-4B-Thinking bash train.sh

Model Details

  • Model name: Jinyang23/Maestro-4B
  • Role: MAESTRO multimodal orchestration policy
  • Base model: Qwen3-VL-4B-Thinking
  • Training method: outcome-based reinforcement learning with GRPO-style optimization
  • Action space: latent reasoning, model-skill search actions, and terminal answers
  • Skill interface: hierarchical skill registry from the MAESTRO repository
  • Expected usage: high-level controller for external expert models and modular skills

Intended Use

This model is intended for research on:

  • multimodal agent orchestration,
  • reinforcement learning for tool and skill use,
  • model routing and expert selection,
  • hierarchical skill libraries,
  • agentic evaluation across heterogeneous tasks.

It is especially useful when integrated with the full MAESTRO framework, where the orchestrator can call external expert services during rollout.

Citation

If you use this model or the MAESTRO framework in your research, please cite:

@misc{wu2026maestro,
      title={MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles},
      author={Jinyang Wu and Guocheng Zhai and Ruihan Jin and Yuhao Shen and Zhengxi Lu and Fan Zhang and Haoran Luo and Zheng Lian and Zhengqi Wen and Jianhua Tao},
      year={2026},
      eprint={2605.22177},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.22177}, 
}

Links

Acknowledgement

This project builds on open-source reinforcement learning and model-serving ecosystems, including verl and vLLM. We thank the authors and contributors of these projects, as well as the developers of the expert models and skill implementations used by MAESTRO.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Jinyang23/Maestro-4B