Getting the model to run?
Hello all, wondering if any of you had to do something special to get the model to run? I have a 2xH100 GPUs, each with 80 GB VRAM but I keep getting CUDA out of memory errors when loading the model ("torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB. GPU 0 has a total capacity of 79.32 GiB of which 964.44 MiB is free. Process 18230 has 78.37 GiB memory in use. Of the allocated memory 77.79 GiB is allocated by PyTorch, and 511.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) when instantiating TELayerNormColumnParallelLinear when instantiating MLP when instantiating TransformerLayer).
Setting PYTORCH_CUDA_ALLOC_CONF as it suggests does not help.
Anyone ran across the same issue?
Here is my environment:
- huggingface_hub version: 0.23.4
- Platform: Linux-4.18.0-553.16.1.el8_10.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /opt/modeling/vkg/Llama-3.1-Nemotron/data/token
- Has saved token ?: True
- Who am I ?: vkg
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.0a0+40ec155e58.nv24.3
- Jinja2: 3.1.3
- Graphviz: 0.20.3
- keras: N/A
- Pydot: N/A
- Pillow: 10.2.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.24.4
- pydantic: 2.8.2
- aiohttp: 3.9.3
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /opt/modeling/vkg/Llama-3.1-Nemotron/data/hub
- HF_ASSETS_CACHE: /opt/modeling/vkg/Llama-3.1-Nemotron/data/assets
- HF_TOKEN_PATH: /opt/modeling/vkg/Llama-3.1-Nemotron/data/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
Thanks in advance.
It took me a month to learn how to load its cousin (rewards). It only worked with a docker image for Triton inference server.
Take a look at this issue https://github.com/NVIDIA/NeMo-Aligner/issues/351