PyTorch 2.12.0 + CUDA 13.3 Prebuilt Wheels (cp312)

Prebuilt Linux x86_64 Python wheels for the PyTorch 2.12.0 / CUDA 13.3 GPU stack, compiled against CPython 3.12 (cp312). These let you skip multi-hour source builds of flash-attn, causal-conv1d, triton, torchao, and friends on CUDA 13.3.

⚠️ These are not a model — this repo hosts Python .whl files. The HF "model" repo type is used purely as LFS-backed storage.

Build matrix

Field	Value
PyTorch	`2.12.0+cu133`
CUDA toolkit	`13.3`
Python ABI	CPython 3.12 (`cp312`)
Platform	Linux `x86_64`
CUDA arch list	`8.6; 8.9; 9.0; 10.0; 12.0`
Build date	2026-06-14

Build environment

Field	Value
Distro	Ubuntu 24.04.4 LTS (Noble Numbat)
Kernel	`6.17.0-35-generic` (`x86_64`)
glibc	2.39
Compiler	gcc 13.3.0
nvcc	CUDA 13.3, V13.3.33

Runtime requirement: the linux_x86_64 wheels (torch, flash-attn, causal-conv1d, bitsandbytes, torchao, torch{audio,vision}, mslk) are glibc ≥ 2.39 builds — they need a distro at least as new as Ubuntu 24.04 / Debian 13 / RHEL 10. Older glibc will fail to load them. triton is manylinux_2_27/manylinux_2_28 and is more portable (glibc ≥ 2.28).

Target GPU architectures

SM	Arch	Example GPUs
`8.6`	Ampere	RTX 30-series, A10 / A40
`8.9`	Ada Lovelace	RTX 40-series, L4 / L40(S)
`9.0`	Hopper	H100 / H200
`10.0`	Blackwell (datacenter)	B100 / B200 / GB200
`12.0`	Blackwell (consumer)	RTX 50-series

Wheels

CUDA arches below are the SM targets verified from the compiled binaries (cuobjdump), not just the requested build list.

Package	Version	File	Size	CUDA arches (SM)
torch	2.12.0+cu133	`torch-2.12.0+cu133-cp312-cp312-linux_x86_64.whl`	653M	8.6 / 8.9 / 9.0 / 10.0 / 12.0 (+ `90a` `100a/f` `103a` `120a` `121a` variants)
vllm	0.1.dev1+gc621af169	`vllm-0.1.dev1+gc621af169.cu133.torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl`	548M	8.6 / 8.9 / 9.0 / 10.0 / 12.0 (+ `100f`)
flashinfer-jit-cache	0.6.12	`flashinfer_jit_cache-0.6.12+torch2.12.0.cu133-cp39-abi3-manylinux_2_28_x86_64.whl`	1.0G	8.6 / 8.9 / 9.0 / 10.0 / 12.0 (`90a` `100a` `120f` variants)
flashinfer-python	0.6.12	`flashinfer_python-0.6.12+torch2.12.0.cu133-py3-none-any.whl`	14M	pure Python (uses `flashinfer-jit-cache` / runtime JIT for kernels)
triton	3.7.0	`triton-3.7.0+torch2.12.0.cu133-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl`	193M	JIT — bundles `ptxas` + `ptxas-blackwell` (compiles per-GPU at runtime)
flash-attn	2.8.3	`flash_attn-2.8.3+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl`	234M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
causal-conv1d	1.6.2.post1	`causal_conv1d-1.6.2.post1+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl`	201M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
bitsandbytes	0.49.1	`bitsandbytes-0.49.1+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl`	2.4M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchao	0.18.0+gitc92676e	`torchao-0.18.0+gitc92676e.torch2.12.0.cu133-cp310-abi3-linux_x86_64.whl`	3.6M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchvision	0.27.0+cu133	`torchvision-0.27.0+cu133-cp312-cp312-linux_x86_64.whl`	2.6M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchaudio	2.11.0+cu133	`torchaudio-2.11.0+cu133-cp312-cp312-linux_x86_64.whl`	2.9M	8.6 / 8.9 / 9.0 / 10.0 / 12.0
mslk-cuda-nightly	2026.6.14	`mslk_cuda_nightly-2026.6.14+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl`	19M	10.0 / 12.0 (Blackwell only)

Installation

Requires Python 3.12 on Linux x86_64 with an NVIDIA driver supporting CUDA 13.3.

Install torch first so the other wheels resolve against it:

# 1. Grab the wheels
pip install huggingface_hub
hf download thad0ctor/torch2.12-cu133-cp312-wheels --local-dir wheels --repo-type model

# 2. Install torch first, then the rest
pip install wheels/torch-2.12.0+cu133-cp312-cp312-linux_x86_64.whl
pip install wheels/*.whl

Or install a single wheel straight from the Hub:

pip install \
  https://huggingface.co/thad0ctor/torch2.12-cu133-cp312-wheels/resolve/main/flash_attn-2.8.3+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl

Notes

Local version tags (e.g. +torch2.12.0.cu133) encode the exact torch/CUDA combo the wheel was built against — keep the whole stack on the same combo to avoid ABI mismatches.
torchao is built as cp310-abi3 (stable ABI) so it loads under cp312.
Most wheels cover 8.6 → 12.0; pre-Ampere cards (sm < 8.6) are not supported.
mslk-cuda-nightly is Blackwell-only (sm_100 / sm_120) — it will not run on Ampere/Ada/Hopper.
triton is a JIT compiler: it bundles ptxas (incl. a Blackwell build) and compiles kernels for your actual GPU at runtime, so it has no fixed baked-in arch set.
vllm is a from-source build off vLLM main at commit c621af169 (vcs version 0.1.dev1, i.e. an untagged dev build). Pin the exact commit if you need to reproduce this wheel — pip install vllm==0.1.dev1+gc621af169.cu133 is not resolvable from PyPI.
flashinfer ships as two wheels — install both:
- flashinfer-python is the pure-Python frontend (py3-none-any, portable).
- flashinfer-jit-cache is the matching ahead-of-time precompiled kernel cache (1 GB of cubins) so you don't pay runtime JIT compilation. It is cp39-abi3 / manylinux_2_28 (glibc ≥ 2.28) and its cubins target 8.6 / 8.9 / 9.0a / 10.0a / 12.0f. The two versions must match (0.6.12).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support