NVFP4 on RTX 5090: 120k Context & 8-bit KV Cache Feasibility

#1
by nsfilho - opened

Based on the NVFP4 configuration you set up, do you think it would be possible to run it on a 5090 with at least 120k of context (with an 8-bit quantized KV cache)?

NeuralNet org

This quantization was done using llm-compressor. It seems it is not the best compression and the only way I was able to run this model was with RTX 5000 PRO (48GB VRAM)

Sign up or log in to comment