Optimized Acceleration Port for RTX 2080 Ti 22GB (Supports Single & Dual-GPU Deployments)
Hi everyone,
For those looking to deploy or experiment with the Lance model on cost-effective, high-VRAM hardware environments—specifically the modified NVIDIA RTX 2080 Ti 22GB—I have published an optimization and acceleration repository.
This project addresses specific bottlenecks on the Turing architecture and provides verified configurations for both single-card and distributed consumer-grade setups.
Key Features & Validated Setups:
- Hardware-Aware Optimization: Tailored kernel and quantization configurations designed to maximize the compute capabilities of 2080 Ti tensor cores.
- Single-GPU Baseline: Validated stability and optimal throughput utilizing the full 22GB VRAM boundary.
- Dual-GPU Scaling (2x 2080 Ti 22GB): Implemented and verified efficient model/pipeline parallelism to unlock a 44GB VRAM pool with minimized communication overhead.
The installation guides, deployment scripts, and code are fully open-sourced here:
👉 https://github.com/lvyufeng/Lance-2080ti
If you are deploying Lance on 22GB clusters or running into scaling issues on older architectures, this setup should provide a reliable blueprint. Feedback and contributions are highly welcome!
Hi — really nice work on this port. I independently built multi-GPU support for Lance on
a different consumer setup (5× RTX 3060, 12 GB each + only 8 GB system RAM, Ampere /
bf16 / flash-attn), and landed on the same core ideas you did: layer-wise pipeline
parallelism across GPUs, VAE pinned to a separate card, and eager-SDPA instead of
flex/flash-attention. Good to see the approach converge from two directions.
A few things my version adds that might be complementary to yours:
- Low system-RAM load. The stock path materializes the 3B model in fp32 on CPU
(~12 GB) before moving it to GPU — fine on a 22 GB-VRAM box with plenty of host RAM,
but it OOM-kills an 8 GB-RAM host during weight init. I build underaccelerate.init_empty_weights()and stream the safetensors with plainseek/read
(nommap, no fp32 copy), casting to bf16 straight onto each GPU — peak CPU RSS stays
under ~2 GB. This is orthogonal to VRAM optimization, so it'd stack cleanly with your
token-chunking / offload work. - N-GPU device_map (not fixed 2-GPU), with cuda:0 given a reduced layer share since
it also holds embed/lm_head/ViT, and the last card reserved for the VAE. - VAE tiling (decode + encode) — overlapping spatial tiles with feather-blend, so
768² video decode and higher-res editing/i2v source-encode fit a 12 GB card instead of
capping resolution/frames. Validated bf16-exact vs untiled. - Full task coverage — understanding, t2i, t2v, i2v, image_edit, video_edit all
validated end-to-end (the editing/i2v paths needed a couple of cross-device fixes for
the source latent under sharding).
Branch + design notes here if useful: https://github.com/bytedance/Lance/pull/43. Happy to cross-reference or
collaborate — your MLP token-chunking + my streaming load + your FP16/Turing path could
fold together into one consumer-GPU story. Either way, thanks for putting this out there.