lyraChatGLM处理完一次对话不会释放显存

#39
by xtaos - opened

chatGlm6b在对话结束后会自动释放显存,但是我发现lyraChatGLM不会释放显存会一直累积加直至爆满。目前我是使用torch.cuda.empty_cache()方法在对话结束后清除显存,但是这样的话就意味着多个用户不能并发对话。
请问有什么方案解决这个问题呢?

我想问lyraChatGLM支持并发访问吗?因为我发现并发请求只会成功一个请求另一个请求会报错
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 276, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in call
raise e
File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 165, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/usr/local/lib/python3.8/dist-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "lyrachat.py", line 50, in create_item
output_texts = model.generate(prompts, output_length=max_output_length,top_k=top_k, top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty, do_sample=True)
File "/lyrachatglm/lyraChatGLM/lyra_glm.py", line 145, in generate
outputs = self.model(start_ids=input_token_ids,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1480, in _call_impl
return forward_call(*args, **kwargs)
File "/lyrachatglm/lyraChatGLM/model.py", line 594, in forward
outputs = self.model.forward(start_ids,
RuntimeError: [FT][ERROR] Assertion fail: /group/30063/users/vanewu/FasterTransformer/src/fastertransformer/th_op/glm/GlmOp.h:466

你直接把多个请求放一个batch里不就行了?

chatGlm6b在对话结束后会自动释放显存,但是我发现lyraChatGLM不会释放显存会一直累积加直至爆满。目前我是使用torch.cuda.empty_cache()方法在对话结束后清除显存,但是这样的话就意味着多个用户不能并发对话。
请问有什么方案解决这个问题呢?

请问一下,您现在的基础环境是啥?

chatGlm6b在对话结束后会自动释放显存,但是我发现lyraChatGLM不会释放显存会一直累积加直至爆满。目前我是使用torch.cuda.empty_cache()方法在对话结束后清除显存,但是这样的话就意味着多个用户不能并发对话。
请问有什么方案解决这个问题呢?

请问一下,您现在的基础环境是啥?

我的环境:
V100-SXM2-32GB
Ubuntu 22.04
Driver Version: 525.105.17
CUDA Driver Version: 12.0

Sign up or log in to comment