TripoSR β€” Single-Image to 3D Mesh (ONNX)

ONNX export of stabilityai/TripoSR β€” Stability AI + Tripo AI's single-image-to-mesh model. ViT image encoder (DINOv2) + triplane decoder + implicit field, all trained jointly so the model can hallucinate the back and occluded sides of an object from a single front-facing photo. Object-scale, not scene-scale: works best on a single subject centered in frame, ideally with a clean background.

This is the complement to depth-based 3D pipelines (depth β†’ point cloud β†’ Poisson mesh), which can only capture what the camera actually sees. TripoSR fills in plausible-but-not-truthful geometry for the unseen sides β€” good for content creation, wrong for scientific measurement.

Re-exported from upstream PyTorch weights via torch.onnx.export. Provenance trail: Tochilkin et al. → cloned VAST-AI-Research/TripoSR (for the tsr/ architecture module) + stabilityai/TripoSR (for config.yaml + model.ckpt weights from the Hub) → two separately-traced ONNX graphs (image→triplane and (triplane,xyz)→(density,color)) → these files.

Toolchain: torch 2.4.x (CUDA 12.4), torchvision 0.19, transformers 4.45.2, einops>=0.7, omegaconf>=2.3, jaxtyping>=0.2.20, onnx>=1.16, onnxconverter-common>=1.14, opset 17, do_constant_folding=True. Full conversion script: scripts/export-triposr.ps1 in the DatumIngest repo. The script also writes a requirements.txt / requirements-torch.txt / requirements-freeze.txt / README.txt quartet into the output directory so the exact venv state is recoverable from just the uploaded files.

Why two graphs instead of one: TripoSR is a feedforward image-to-triplane model whose downstream "render" step samples a NeRF MLP at many query points on a 3D grid (chunked over ~16M points at 256Β³ resolution). Tracing the entire thing as one graph would either (a) bake in a specific grid resolution as a constant, or (b) require dynamic-shape grid construction inside the ONNX graph β€” both fragile. Splitting it makes the per-image cost (triplane.onnx) and the per-chunk cost (nerf.onnx) independently schedulable on the host. Marching cubes runs entirely outside the ONNX graph in the host engine.

Credit: Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao (Stability AI + Tripo AI / VAST-AI-Research). Paper: "TripoSR: Fast 3D Object Reconstruction from a Single Image", 2024.

What this repo contains

TripoSR ships as a two-graph pipeline, with the TripoSR runtime config + a reproducibility manifest alongside. Total bundle is ~3.4 GB; the .onnx_data sidecar must travel with its .onnx (ORT loads external-data sidecars implicitly by filename match in the same directory).

File Role Size
triplane.onnx Image (RGB 512Γ—512) β†’ triplane features [B, 3, C, 64, 64]. Run once per image. ~920 KB graph
triplane.onnx_data External weights for triplane.onnx. Spilled to a sidecar via save_as_external_data so the .onnx stays under the 2 GB protobuf limit. ~3.1 GB
nerf.onnx (triplane, xyz points [K, 3]) β†’ (density [K], color features [K, *]). Run many times per image, chunked over a 3D query grid. Uses grid_sample internally β€” requires opset β‰₯ 17. ~180 KB (graph + weights)
config.yaml TripoSR runtime config (DINOv1 spec, triplane channel count, NeRF MLP dims, render radius). The architecture module reads this β€” keep alongside the .onnx files. <1 KB
requirements.txt PyPI pin set used at export time. Recreates the working venv. <1 KB
requirements-torch.txt torch + torchvision pins (PyTorch cu124 index). <1 KB
requirements-freeze.txt Full pip freeze capturing transitive closure for byte-identical recreation. ~2 KB
README.txt Provenance manifest written by the export script: HF model id, the TripoSR architecture-repo commit sha that was traced, file inventory, recreation steps. ~2 KB

If you ran the export with -Fp16, you'll also see triplane_fp16.onnx + triplane_fp16.onnx_data + nerf_fp16.onnx siblings (~half the disk footprint, IO types kept fp32 so consumer code is identical to the fp32 path).

What's NOT in the ONNX graphs: the marching-cubes step that extracts a triangle mesh from the density grid. That's a classical algorithm; it runs as a downstream pipeline step in whatever consumer renders the mesh (the DatumIngest engine has it as part of the mesh_compute_* SQL function family). Same convention used by InstantMesh, CRM, Shap-E, and other implicit-field mesh-gen models.

Input / output

Graph Input(s) Output(s)
triplane.onnx image β€” [batch, 3, 512, 512] float32, RGB, pre-resized to 512Γ—512 and normalized to [0, 1] on the host (the upstream PIL-side image_processor is bypassed because it doesn't trace cleanly) triplane β€” [batch, 3, C, 64, 64] float32 (C depends on TripoSR config; ~32-96 channels typically)
nerf.onnx triplane β€” [1, 3, C, 64, 64]; xyz β€” [K, 3] query points in radius-normalized coords [-R, R] where Rβ‰ˆ0.87 density β€” [K] activated density (the values marching-cubes runs on); color β€” [K, *] activated RGB features (sample at MC vertex positions for per-vertex color)

Dynamic axes: triplane.onnx is dynamic in batch only (image dims are fixed at 512Γ—512 β€” TripoSR is trained at that resolution). nerf.onnx is dynamic in batch + points, so chunk size can vary at runtime.

How to use

The runtime pattern matches what the script's generated README.txt documents:

import onnxruntime as ort
import numpy as np
from PIL import Image

triplane_sess = ort.InferenceSession("triplane.onnx")
nerf_sess     = ort.InferenceSession("nerf.onnx")

# 1. Pre-resize + normalize the image on the host (the ONNX bypasses PIL).
img = Image.open("subject.png").convert("RGB").resize((512, 512))
arr = np.asarray(img, dtype=np.float32) / 255.0
arr = arr.transpose(2, 0, 1)[None, ...]                              # 1x3x512x512

# 2. Encode image to triplane features. One ORT.Run.
triplane = triplane_sess.run(None, {"image": arr.astype(np.float32)})[0]

# 3. Build a 3D query grid + chunk over it. 256Β³ is the standard
#    resolution; ~16M points total, chunked at ~256K per nerf.onnx call.
RESOLUTION = 256
RADIUS     = 0.87
CHUNK      = 262_144

coords = np.linspace(-RADIUS, RADIUS, RESOLUTION, dtype=np.float32)
xx, yy, zz = np.meshgrid(coords, coords, coords, indexing="ij")
xyz = np.stack([xx, yy, zz], axis=-1).reshape(-1, 3)                 # [16.7M, 3]

densities = np.empty(xyz.shape[0], dtype=np.float32)
for i in range(0, xyz.shape[0], CHUNK):
    chunk = xyz[i : i + CHUNK]
    d, _ = nerf_sess.run(None, {"triplane": triplane, "xyz": chunk})
    densities[i : i + chunk.shape[0]] = d
density_grid = densities.reshape(RESOLUTION, RESOLUTION, RESOLUTION)

# 4. Marching cubes on the density grid (host-side).
import mcubes
vertices, triangles = mcubes.marching_cubes(density_grid, 0.0)

# 5. Optional: per-vertex color via a second nerf.onnx pass at vertex positions.
#    vertices are in voxel coords; rescale back to radius-normalized [-R, R] first.
verts_radius = ((vertices / (RESOLUTION - 1)) * 2.0 - 1.0) * RADIUS
_, color = nerf_sess.run(None, {
    "triplane": triplane,
    "xyz":      verts_radius.astype(np.float32),
})

# vertices: Nx3 float; triangles: Mx3 int β€” feed to Three.js / trimesh / GLB writer.

The exact input/output names per ONNX file are stable across exports of this script (image / triplane for graph 1; triplane / xyz / density / color for graph 2), but inspect with Netron if you're integrating with a different runtime that's picky about names.

When to pick TripoSR vs depth-based 3D

The two pipelines are complementary, not competing:

Pipeline Strength Weakness
TripoSR (this) Complete 3D object (front + back + interior), generates geometry the camera couldn't see, single image input Hallucinated for unseen parts (plausible but not truthful), object-scale only, no metric units
Depth model + Poisson reconstruction Metrically faithful (with ZoeDepth), scientifically usable, scene-friendly, multi-image composable Only captures the visible surface β€” no back, no occlusions filled

For "give me a complete model of this object I photographed," pick TripoSR. For "give me an accurate reconstruction of this scene," pick depth + Poisson.

License

Stability AI Community License β€” see LICENSE file in this repo. Key terms:

  • Free for research, personal use, and commercial use under $1M annual revenue.
  • Above $1M annual revenue, you need a separate commercial agreement with Stability AI.
  • Redistribution permitted with attribution + license preservation + Acceptable Use Policy adherence.

The license is more permissive than its name suggests for most use cases. The $1M threshold applies to users of the model β€” distributing the ONNX (this repo) doesn't trigger it. See stability.ai/community-license-agreement for the full text and stability.ai/use-policy for the AUP.

Related models worth knowing

If you want to go further than TripoSR's quality (at higher engineering cost):

  • TRELLIS (Microsoft Research, 2024) β€” MIT, often higher quality output, multi-view conditioned generation. PyTorch only β€” no clean ONNX export yet.
  • Hunyuan3D-2 (Tencent, 2024) β€” Tencent Hunyuan license, very high quality. PyTorch only.
  • CRM (Convolutional Reconstruction Model) (Tsinghua, 2024) β€” Apache-2.0, similar architecture to TripoSR. PyTorch only.

TripoSR remains the easiest single-image-to-mesh model to actually ship as ONNX, which is why it's the catalog entry point.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Heliosoph/triposr-onnx

Quantized
(1)
this model