Qwen3.5-4b-prism-3D

This is the Vitrus Prismatic CADView checkpoint from cv-4b-prism02/checkpoint-25000, trained to global step 25,000.

The model is a CAD-conditioned 3D pose grounding VLM. It sees a tabletop scene, CAD reference images, and a 32-view CAD orientation bank, then emits a text pose target:

C:<x> <y> <z> VIEW:<k> SPIN:<deg> DLAT:<deg> DLON:<deg>

Architecture

The checkpoint wraps Qwen/Qwen3.5-4B with a Prismatic vision fuser:

Qwen3.5-4B provides the image-text generation backbone.
facebook/dinov2-large provides geometry-rich patch features.
model_prismatic.PrismaticFuser aligns DINOv2 patches to Qwen's merged visual token grid.
cadview_prismatic.PrismaticVLM patches Qwen's vision tower while keeping Qwen's native generation path, M-RoPE, and KV-cache behavior intact.

The checkpoint is stored as a PyTorch state_dict (pytorch_model.bin) because this training wrapper is not a vanilla save_pretrained Transformers model. See loading_example.py for the expected loading path.

Training Data

Trained on Vitrus CADView / CAD-pose grounding data, released separately as vitrus/synthetic-cad-view.

Source layout:

gs://vitrus-assets/cad_pose_grounding/v2/scenes
gs://vitrus-assets/cad_pose_grounding/v2/cad_refs
gs://vitrus-assets/cad_pose_grounding/v2/atlas/n32
gs://vitrus-assets/cad_pose_grounding/v2/symmetry_groups.json

The v2 pool contains 6,320 machined CAD parts with strict scene-level train/holdout splits. The symmetry manifest contains 6,320 part groups, including 3,034 non-trivial proper-rotation or continuous orientation orbits.

Checkpoint Details

Run: cv-4b-prism02
Source checkpoint: gs://vitrus-assets/cad_pose_grounding/ckpt/cv-4b-prism02/checkpoint-25000/
Global step: 25,000
Epoch: 18.35
Base model: Qwen/Qwen3.5-4B
DINO tower: facebook/dinov2-large
DINO input resolution: 448
View tokens: <view_0> through <view_31> added to the tokenizer before loading weights

Recent training loss near the checkpoint was approximately 0.25.

Loading

pip install -r requirements.txt
python loading_example.py

The model requires the custom wrapper files included in this repository:

model_prismatic.py
cadview_prismatic.py

Intended Use

This release is intended for research on CAD-conditioned visual grounding, robotic part localization, monocular 3D pose estimation, and geometry-aware VLMs. It is not a general chat model.

Limitations

The model was trained for a specific CADView text target and expects the scene/reference/bank input structure used by the Vitrus CAD-pose pipeline. It should be evaluated on held-out CAD identities and real robot scenes before use in closed-loop manipulation.

License

Apache-2.0. The upstream base models Qwen/Qwen3.5-4B and facebook/dinov2-large also report Apache-2.0 licenses on Hugging Face.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vitrus/Qwen3.5-4b-prism-3D

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B