nvidia/GLM-5.2-NVFP4 · Can we use this model with nvfp4 kv cache?

Can we use this model with nvfp4 kv cache?

by positiveone - opened about 17 hours ago

as discribe in https://developer.nvidia.cn/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/
because 1M context kv-cache is really big even for B200!!!

g-a-b-y

about 15 hours ago

I had the same question. I'm using vLLM which supports this via --kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.

felixmr1

about 12 hours ago

I had the same question. I'm using vLLM which supports this via --kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.

I don't think vLLM supports nvfp4 as kv cache type. Where have you seen this?

g-a-b-y

about 8 hours ago

I had the same question. I'm using vLLM which supports this via --kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.

I don't think vLLM supports nvfp4 as kv cache type. Where have you seen this?

It was added two releases ago in v0.21.0. The pull request is here: https://github.com/vllm-project/vllm/pull/40177

Docs are here: https://docs.vllm.ai/en/latest/cli/serve/#-kv-cache-dtype

felixmr1

about 6 hours ago

I had the same question. I'm using vLLM which supports this via --kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.

I don't think vLLM supports nvfp4 as kv cache type. Where have you seen this?

It was added two releases ago in v0.21.0. The pull request is here: https://github.com/vllm-project/vllm/pull/40177

Docs are here: https://docs.vllm.ai/en/latest/cli/serve/#-kv-cache-dtype

Ah very cool! Thanks for updating me.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment