Can I load this for local inference on an RTX 4090 with 24 GB dedicated memory somehow?
Currently we do not have a quantized model. However, you can use OpenChat to load on 2 RTX 4090s with vLLM tensor parallel.
Β· Sign up or log in to comment