Instructions to use linoyts/Bernini-R-1.3B-Transformer-sm120-cu130-r12 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use linoyts/Bernini-R-1.3B-Transformer-sm120-cu130-r12 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("linoyts/Bernini-R-1.3B-Transformer-sm120-cu130-r12", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
This README has been auto-generated by the HF Job run linked below and the whole repository is a reproducible artifact of this Job
Ahead-of-time repository
AoT repos contain pre-compiled binaries of PyTorch models, enabling:
- fast startup times (no
torch.compileneeded) - significant speedup
- ZeroGPU compatibility
How to use
import os
import tempfile
import numpy as np
import torch
import spaces
from bernini.cli import DEFAULT_NEG_PROMPT
from bernini.pipeline import BerniniRendererPipeline
from bernini.prompt_enhancer import get_system_prompt_for_task
MODEL_ID = os.environ.get("BERNINI_MODEL_ID", "ByteDance/Bernini-R-1.3B-Diffusers")
pipeline = BerniniRendererPipeline.from_pretrained(
MODEL_ID,
device="cuda",
load_ckpt_weights=False,
use_unipc=True, # required by the *_apg guidance modes
use_src_id_rotary_emb=True,
)
# Bernini's transformer blocks call `varlen_attention`, which does
# `cu_seqlens.tolist()` + a python loop (data-dependent -> not torch.export-able).
# For the demo's single-video case (one packed segment) full attention over the
# packed sequence is identical to the per-segment varlen result, so we swap in an
# export-traceable single-sequence SDPA. Patched process-wide, before capture.
import torch.nn.functional as F # noqa: E402
import bernini.attention as _A # noqa: E402
def _export_varlen(q, k, v, cu_seqlens_q=None, cu_seqlens_k=None,
max_seqlen_q=None, max_seqlen_k=None, causal: bool = False):
# q: [Sq, H, D], k/v: [Sk, H, D] -> [1, H, S, D] for SDPA, back to [S, H, D].
qi = q.transpose(0, 1).unsqueeze(0)
ki = k.transpose(0, 1).unsqueeze(0)
vi = v.transpose(0, 1).unsqueeze(0)
oi = F.scaled_dot_product_attention(qi, ki, vi, is_causal=causal)
return oi.squeeze(0).transpose(0, 1)
_A.varlen_attention = _export_varlen
# transformer_wan.py did `from ..attention import varlen_attention` at import, so
# it holds its OWN reference — patch the one actually called inside the blocks.
import bernini.models.transformer_wan as _TW # noqa: E402
_TW.varlen_attention = _export_varlen
spaces.aoti_load(
module=pipeline.model.diff_dec.transformer,
repo_id='linoyts/Bernini-R-1.3B-Transformer-sm120-cu130-r12',
)
How to reproduce or customize
# Install hf CLI
curl -LsSf https://hf.co/cli/install.sh | bash
# Login
hf auth login
# Get the job file and edit (user section) if needed
hf download linoyts/Bernini-R-1.3B-Transformer-sm120-cu130-r12 job.py --local-dir .
# Run the job and change flavor or image if needed
hf jobs uv run job.py \
--flavor rtx-pro-6000 \
--image pytorch/pytorch:2.9.1-cuda13.0-cudnn9-devel \
--secrets HF_TOKEN
The following job environment variables can be used to customize the repo name generation:
OUTPUT_REPO_NAMESPACE: taken fromHF_TOKENotherwiseOUTPUT_REPO_BASE_NAME: defaults tomoduleclass nameOUTPUT_REPO_ID: fully overtakes name generation
Samples
Generated as part of the compilation job: before and after compilation
| Before compilation (39.43s) | After compilation (28.93s) |
|---|---|
Speedup: 1.36x (note that this might not always reflect actual performance gain)
Environment
Click to expand
PyTorch version: 2.12.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.1.2
Libc version: glibc-2.35
Python version: 3.10.19 (main, Oct 31 2025, 23:02:46) [Clang 21.1.4 ] (64-bit runtime)
Python platform: Linux-6.12.88-119.157.amzn2023.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 13.0.48
CUDA_MODULE_LOADING set to:
GPU models and configuration: GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
Nvidia driver version: 580.159.03
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8559C
CPU family: 6
Model: 207
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 2
BogoMIPS: 4800.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4.5 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 192 MiB (96 instances)
L3 cache: 640 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds: Not affected
Vulnerability Tsa: Not affected
Vulnerability Tsx async abort: Not affected
Vulnerability Vmscape: Not affected
Versions of relevant libraries:
[pip3] Could not collect
[conda] numpy 2.3.4 py311h2e04523_0 conda-forge
[conda] nvidia-cublas 13.0.0.19 pypi_0 pypi
[conda] nvidia-cuda-cupti 13.0.48 pypi_0 pypi
[conda] nvidia-cuda-nvrtc 13.0.48 pypi_0 pypi
[conda] nvidia-cuda-runtime 13.0.48 pypi_0 pypi
[conda] nvidia-cudnn-cu13 9.13.0.50 pypi_0 pypi
[conda] nvidia-cufft 12.0.0.15 pypi_0 pypi
[conda] nvidia-curand 10.4.0.35 pypi_0 pypi
[conda] nvidia-cusolver 12.0.3.29 pypi_0 pypi
[conda] nvidia-cusparse 12.6.2.49 pypi_0 pypi
[conda] nvidia-cusparselt-cu13 0.8.0 pypi_0 pypi
[conda] nvidia-nccl-cu13 2.27.7 pypi_0 pypi
[conda] nvidia-nvjitlink 13.0.39 pypi_0 pypi
[conda] nvidia-nvtx 13.0.39 pypi_0 pypi
[conda] optree 0.17.0 pypi_0 pypi
[conda] torch 2.9.1+cu130 pypi_0 pypi
[conda] torchaudio 2.9.1+cu130 pypi_0 pypi
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchvision 0.24.1+cu130 pypi_0 pypi
[conda] triton 3.5.1 pypi_0 pypi
Job run
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support