1M inference error on A100 80GBx4 System
#1
by
shi3z
- opened
Thank you for the excellent results. I immediately tried to run it on my 8x A100 80GB system, but I encountered this error. Do you know of any solutions?
>>> pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
Fetching 20 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 20/20 [00:00<00:00, 6944.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-35:
Traceback (most recent call last):
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-37:
Traceback (most recent call last):
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-36:
Traceback (most recent call last):
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-38:
Traceback (most recent call last):
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 398, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/api.py", line 89, in pipeline
return pipeline_class(model_path,
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 217, in __init__
self.gens_set.add(self.engine.create_instance())
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 358, in create_instance
return TurboMindInstance(self, cuda_stream_id)
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py", line 390, in __init__
t.join()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/home/shi3z/.pyenv/versions/anaconda3-2023.09-0/envs/vllm/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
Can you share the backend_config
?