Can we use this model with nvfp4 kv cache?
as discribe in https://developer.nvidia.cn/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/
because 1M context kv-cache is really big even for B200!!!
I had the same question. I'm using vLLM which supports this via --kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.
I had the same question. I'm using vLLM which supports this via
--kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.
I don't think vLLM supports nvfp4 as kv cache type. Where have you seen this?
I had the same question. I'm using vLLM which supports this via
--kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.I don't think vLLM supports
nvfp4as kv cache type. Where have you seen this?
It was added two releases ago in v0.21.0. The pull request is here: https://github.com/vllm-project/vllm/pull/40177
Docs are here: https://docs.vllm.ai/en/latest/cli/serve/#-kv-cache-dtype
I had the same question. I'm using vLLM which supports this via
--kv-cache-dtype nvfp4, haven't tested deploying GLM-5.2 with it yet.I don't think vLLM supports
nvfp4as kv cache type. Where have you seen this?It was added two releases ago in v0.21.0. The pull request is here: https://github.com/vllm-project/vllm/pull/40177
Docs are here: https://docs.vllm.ai/en/latest/cli/serve/#-kv-cache-dtype
Ah very cool! Thanks for updating me.