Instructions to use vitrus/Qwen3.5-4b-prism-3D with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vitrus/Qwen3.5-4b-prism-3D with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="vitrus/Qwen3.5-4b-prism-3D")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("vitrus/Qwen3.5-4b-prism-3D", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use vitrus/Qwen3.5-4b-prism-3D with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "vitrus/Qwen3.5-4b-prism-3D" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vitrus/Qwen3.5-4b-prism-3D", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/vitrus/Qwen3.5-4b-prism-3D
- SGLang
How to use vitrus/Qwen3.5-4b-prism-3D with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "vitrus/Qwen3.5-4b-prism-3D" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vitrus/Qwen3.5-4b-prism-3D", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "vitrus/Qwen3.5-4b-prism-3D" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "vitrus/Qwen3.5-4b-prism-3D", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use vitrus/Qwen3.5-4b-prism-3D with Docker Model Runner:
docker model run hf.co/vitrus/Qwen3.5-4b-prism-3D
Qwen3.5-4b-prism-3D
This is the Vitrus Prismatic CADView checkpoint from cv-4b-prism02/checkpoint-25000, trained to global step 25,000.
The model is a CAD-conditioned 3D pose grounding VLM. It sees a tabletop scene, CAD reference images, and a 32-view CAD orientation bank, then emits a text pose target:
C:<x> <y> <z> VIEW:<k> SPIN:<deg> DLAT:<deg> DLON:<deg>
Architecture
The checkpoint wraps Qwen/Qwen3.5-4B with a Prismatic vision fuser:
- Qwen3.5-4B provides the image-text generation backbone.
facebook/dinov2-largeprovides geometry-rich patch features.model_prismatic.PrismaticFuseraligns DINOv2 patches to Qwen's merged visual token grid.cadview_prismatic.PrismaticVLMpatches Qwen's vision tower while keeping Qwen's native generation path, M-RoPE, and KV-cache behavior intact.
The checkpoint is stored as a PyTorch state_dict (pytorch_model.bin) because this training wrapper is not a vanilla save_pretrained Transformers model. See loading_example.py for the expected loading path.
Training Data
Trained on Vitrus CADView / CAD-pose grounding data, released separately as vitrus/synthetic-cad-view.
Source layout:
gs://vitrus-assets/cad_pose_grounding/v2/scenes
gs://vitrus-assets/cad_pose_grounding/v2/cad_refs
gs://vitrus-assets/cad_pose_grounding/v2/atlas/n32
gs://vitrus-assets/cad_pose_grounding/v2/symmetry_groups.json
The v2 pool contains 6,320 machined CAD parts with strict scene-level train/holdout splits. The symmetry manifest contains 6,320 part groups, including 3,034 non-trivial proper-rotation or continuous orientation orbits.
Checkpoint Details
- Run:
cv-4b-prism02 - Source checkpoint:
gs://vitrus-assets/cad_pose_grounding/ckpt/cv-4b-prism02/checkpoint-25000/ - Global step: 25,000
- Epoch: 18.35
- Base model:
Qwen/Qwen3.5-4B - DINO tower:
facebook/dinov2-large - DINO input resolution: 448
- View tokens:
<view_0>through<view_31>added to the tokenizer before loading weights
Recent training loss near the checkpoint was approximately 0.25.
Loading
pip install -r requirements.txt
python loading_example.py
The model requires the custom wrapper files included in this repository:
model_prismatic.pycadview_prismatic.py
Intended Use
This release is intended for research on CAD-conditioned visual grounding, robotic part localization, monocular 3D pose estimation, and geometry-aware VLMs. It is not a general chat model.
Limitations
The model was trained for a specific CADView text target and expects the scene/reference/bank input structure used by the Vitrus CAD-pose pipeline. It should be evaluated on held-out CAD identities and real robot scenes before use in closed-loop manipulation.
License
Apache-2.0. The upstream base models Qwen/Qwen3.5-4B and facebook/dinov2-large also report Apache-2.0 licenses on Hugging Face.