PyTorch 2.12.0 + CUDA 13.3 Prebuilt Wheels (cp312)

Prebuilt Linux x86_64 Python wheels for the PyTorch 2.12.0 / CUDA 13.3 GPU stack, compiled against CPython 3.12 (cp312). These let you skip multi-hour source builds of flash-attn, causal-conv1d, triton, torchao, and friends on CUDA 13.3.

⚠️ These are not a model β€” this repo hosts Python .whl files. The HF "model" repo type is used purely as LFS-backed storage.

Build matrix

Field Value
PyTorch 2.12.0+cu133
CUDA toolkit 13.3
Python ABI CPython 3.12 (cp312)
Platform Linux x86_64
CUDA arch list 8.6; 8.9; 9.0; 10.0; 12.0
Build date 2026-06-14

Build environment

Field Value
Distro Ubuntu 24.04.4 LTS (Noble Numbat)
Kernel 6.17.0-35-generic (x86_64)
glibc 2.39
Compiler gcc 13.3.0
nvcc CUDA 13.3, V13.3.33

Runtime requirement: the linux_x86_64 wheels (torch, flash-attn, causal-conv1d, bitsandbytes, torchao, torch{audio,vision}, mslk) are glibc β‰₯ 2.39 builds β€” they need a distro at least as new as Ubuntu 24.04 / Debian 13 / RHEL 10. Older glibc will fail to load them. triton is manylinux_2_27/manylinux_2_28 and is more portable (glibc β‰₯ 2.28).

Target GPU architectures

SM Arch Example GPUs
8.6 Ampere RTX 30-series, A10 / A40
8.9 Ada Lovelace RTX 40-series, L4 / L40(S)
9.0 Hopper H100 / H200
10.0 Blackwell (datacenter) B100 / B200 / GB200
12.0 Blackwell (consumer) RTX 50-series

Wheels

CUDA arches below are the SM targets verified from the compiled binaries (cuobjdump), not just the requested build list.

Package Version File Size CUDA arches (SM)
torch 2.12.0+cu133 torch-2.12.0+cu133-cp312-cp312-linux_x86_64.whl 653M 8.6 / 8.9 / 9.0 / 10.0 / 12.0 (+ 90a 100a/f 103a 120a 121a variants)
vllm 0.1.dev1+gc621af169 vllm-0.1.dev1+gc621af169.cu133.torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl 548M 8.6 / 8.9 / 9.0 / 10.0 / 12.0 (+ 100f)
flashinfer-jit-cache 0.6.12 flashinfer_jit_cache-0.6.12+torch2.12.0.cu133-cp39-abi3-manylinux_2_28_x86_64.whl 1.0G 8.6 / 8.9 / 9.0 / 10.0 / 12.0 (90a 100a 120f variants)
flashinfer-python 0.6.12 flashinfer_python-0.6.12+torch2.12.0.cu133-py3-none-any.whl 14M pure Python (uses flashinfer-jit-cache / runtime JIT for kernels)
triton 3.7.0 triton-3.7.0+torch2.12.0.cu133-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl 193M JIT β€” bundles ptxas + ptxas-blackwell (compiles per-GPU at runtime)
flash-attn 2.8.3 flash_attn-2.8.3+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl 234M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
causal-conv1d 1.6.2.post1 causal_conv1d-1.6.2.post1+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl 201M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
bitsandbytes 0.49.1 bitsandbytes-0.49.1+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl 2.4M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchao 0.18.0+gitc92676e torchao-0.18.0+gitc92676e.torch2.12.0.cu133-cp310-abi3-linux_x86_64.whl 3.6M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchvision 0.27.0+cu133 torchvision-0.27.0+cu133-cp312-cp312-linux_x86_64.whl 2.6M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
torchaudio 2.11.0+cu133 torchaudio-2.11.0+cu133-cp312-cp312-linux_x86_64.whl 2.9M 8.6 / 8.9 / 9.0 / 10.0 / 12.0
mslk-cuda-nightly 2026.6.14 mslk_cuda_nightly-2026.6.14+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl 19M 10.0 / 12.0 (Blackwell only)

Installation

Requires Python 3.12 on Linux x86_64 with an NVIDIA driver supporting CUDA 13.3.

Install torch first so the other wheels resolve against it:

# 1. Grab the wheels
pip install huggingface_hub
hf download thad0ctor/torch2.12-cu133-cp312-wheels --local-dir wheels --repo-type model

# 2. Install torch first, then the rest
pip install wheels/torch-2.12.0+cu133-cp312-cp312-linux_x86_64.whl
pip install wheels/*.whl

Or install a single wheel straight from the Hub:

pip install \
  https://huggingface.co/thad0ctor/torch2.12-cu133-cp312-wheels/resolve/main/flash_attn-2.8.3+torch2.12.0.cu133-cp312-cp312-linux_x86_64.whl

Notes

  • Local version tags (e.g. +torch2.12.0.cu133) encode the exact torch/CUDA combo the wheel was built against β€” keep the whole stack on the same combo to avoid ABI mismatches.
  • torchao is built as cp310-abi3 (stable ABI) so it loads under cp312.
  • Most wheels cover 8.6 β†’ 12.0; pre-Ampere cards (sm < 8.6) are not supported.
  • mslk-cuda-nightly is Blackwell-only (sm_100 / sm_120) β€” it will not run on Ampere/Ada/Hopper.
  • triton is a JIT compiler: it bundles ptxas (incl. a Blackwell build) and compiles kernels for your actual GPU at runtime, so it has no fixed baked-in arch set.
  • vllm is a from-source build off vLLM main at commit c621af169 (vcs version 0.1.dev1, i.e. an untagged dev build). Pin the exact commit if you need to reproduce this wheel β€” pip install vllm==0.1.dev1+gc621af169.cu133 is not resolvable from PyPI.
  • flashinfer ships as two wheels β€” install both:
    • flashinfer-python is the pure-Python frontend (py3-none-any, portable).
    • flashinfer-jit-cache is the matching ahead-of-time precompiled kernel cache (1 GB of cubins) so you don't pay runtime JIT compilation. It is cp39-abi3 / manylinux_2_28 (glibc β‰₯ 2.28) and its cubins target 8.6 / 8.9 / 9.0a / 10.0a / 12.0f. The two versions must match (0.6.12).
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support