8xH20 141GB cuda out of memory

#6
by ErisLU - opened

Hello, I have tried to deploy the model with 8xH20, but encoutered cuda out of memory. Do you know what is the reason ?

Deploy command:

SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
python -m sglang.launch_server
--model-path /model/huggingface.co/PhalaCloud/GLM-5.2-W4AFP8
--quantization w4afp8
--disable-shared-experts-fusion
--tp 8
--kv-cache-dtype fp8_e4m3
--reasoning-parser glm45
--tool-call-parser glm47
--context-length 1048576
--mem-fraction-static 0.85
--trust-remote-code
--host 0.0.0.0
--port 8419

Error logs:

[2026-06-25 17:45:46 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
[2026-06-25 17:45:46 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=2.33 GB

0%| | 0/52 [00:00<?, ?it/s]
Capturing batches (bs=512 avail_mem=0.13 GB): 0%| | 0/52 [00:00<?, ?it/s][2026-06-25 17:45:47 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 626, in init
self.capture()
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 786, in capture
_capture_one_stream()
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 773, in _capture_one_stream
) = self.capture_one_batch_size(bs, forward, stream_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 988, in capture_one_batch_size
attn_backend.init_forward_metadata_out_graph(forward_batch, in_capture=True)
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/dsa_backend.py", line 451, in init_forward_metadata_out_graph
self._apply_cuda_graph_metadata(
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/dsa_backend.py", line 1028, in _apply_cuda_graph_metadata
self._build_forward_metadata_cuda_graph(
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/dsa_backend.py", line 971, in _build_forward_metadata_cuda_graph
real_page_table = self._transform_table_1_to_real(page_table_1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/attention/dsa_backend.py", line 441, in _transform_table_1_to_real
return page_table[:, strided_indices] // page_size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^~~~~~~~~~~
File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 47, in wrapped
return f(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 1161, in floordiv
return torch.floor_divide(self, other)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 1 has a total capacity of 139.72 GiB of which 2.06 MiB is free. Process 1814564 has 139.71 GiB memory in use. Of the allocated memory 135.57 GiB is allocated by PyTorch, and 271.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)

Sign up or log in to comment