Error loading with text-generation-inference server version

#3
by hollinwilkins - opened

I am using this docker container: ghcr.io/huggingface/text-generation-inference:0.9.2 with these arguments:

--model-id=huggingface/falcon-40b-gptq
--quantize=gptq
--num-shard=1
--max-input-length=10
--max-total-tokens=20
--max-batch-total-tokens=20
--max-batch-prefill-tokens=10

Resource requests for the Docker container are:
resources:
limits:
cpu: "11"
memory: 120Gi](resources:
limits:
cpu: "11"
memory: 120Gi)

This is on an A100 with 80GB memory, but for some reason I get the following error:
Error: Warmup(Generation("Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease --max-batch-total-tokens or --max-batch-prefill-tokens"))

The tiiuae/falcon-40b model works fine with this hardware setup and the default max_ arguments as long as I use --quantize bitsandbytes, so I don't think it actually has no memory.

Full Stack Trace:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 63, in Warmup
self.model.warmup(batch, request.max_total_tokens)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 729, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease --max-batch-total-tokens or --max-batch-prefill-tokens
rank=0
2023-07-17T20:09:34.579624Z ERROR warmup{max_input_length=10 max_prefill_tokens=10 max_total_tokens=20}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease --max-batch-total-tokens or --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease --max-batch-total-tokens or --max-batch-prefill-tokens"))

Sign up or log in to comment