Can run on SageMaker g5 instance?

#4
by nsegev - opened

Hi,

I know the full Falcon-180b runs in Sagemaker on a p4de.24xlarge instance (8 * A100-80GB).
An 8-bit variant runs in Sagemaker on a p4d.24xlarge instance (8 * A100-40GB).
I'm trying to see if the 4bit GPTQ variant will work on a g5.48xlarge instance (8 * A10-24GB).

I'm using HuggingFace TGI, any ideas why I'm seeing the following error: "NotImplementedError: Tensor Parallelism is not implemented for 14 not divisible by 8".

Thanks

With the release of TGI 1.1.0 it is possible to load this model to a g5.48x on AWS.
But there seems to be a memory error when the model loads and tries to prefill.
Error repeats for both 4-bit and 3-bit versions, which is odd

Has anyone managed to deploy via TGI 1.1.0 on g5.48xlarge? I am seeing the same CUDA illegal memory access error during prefill:

#033[2m2023-10-20T03:26:32.395175Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_router#033[0m#033[2m:#033[0m #033[2mrouter/src/main.rs#033[0m#033[2m:#033[0m#033[2m213:#033[0m Warming up model
#033[2m2023-10-20T03:26:34.571922Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 672, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 753, in generate_token
    raise e
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 750, in generate_token
    out = self.forward(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 717, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 643, in forward
    hidden_states = self.transformer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 603, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 521, in forward
    attn_output = self.self_attention(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 353, in forward
    return self.dense(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 349, in forward
    out = QuantLinearFunction.apply(
  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/quant_linear.py", line 216, in matmul248
    matmul_248_kernel[grid](
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 110, in run
    timings = {
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
    config: self._bench(*args, config=config, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/custom_autotune.py", line 90, in _bench
    return triton.testing.do_bench(
  File "/opt/conda/lib/python3.9/site-packages/triton/testing.py", line 144, in do_bench
    torch.cuda.synchronize()
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/__init__.py", line 688, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 83, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 207, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 72, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 674, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
#033[2m2023-10-20T03:26:34.573224Z#033[0m #033[31mERROR#033[0m #033[1mwarmup#033[0m#033[1m{#033[0m#033[3mmax_input_length#033[0m#033[2m=#033[0m1024 #033[3mmax_prefill_tokens#033[0m#033[2m=#033[0m4096#033[1m}#033[0m#033[2m:#033[0m#033[1mwarmup#033[0m#033[2m:#033[0m #033[2mtext_generation_client#033[0m#033[2m:#033[0m #033[2mrouter/client/src/lib.rs#033[0m#033[2m:#033[0m#033[2m33:#033[0m Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Decreasing max-batch-prefill-tokens does not help resolve this error.

It seems that you need to adjust the max-batch-prefill-tokens value since you ran out of memory on the instance.

Any update on this? I'm still getting the same error on the g5.48xl with TGI 1.1.0 (8x24 GB VRAM) with the GPTQ version of falcon 180.
I tried down to 100 max prefill tokens and I still get "You need to decrease --max-batch-prefill-tokens"
How to estimate the extra memory requirement needed after the model is loaded ?

I'll be honest, I couldn't get the GPTQ version working on TGI 1.1.0, but 1.1.0 does support bitsandbytes-nf4 which did work for me on g5.48xl
My configuration is:

config = {
'SM_NUM_GPUS': json.dumps(8),
'MAX_TOTAL_TOKENS': json.dumps(2048 + 512),
'MAX_INPUT_LENGTH': json.dumps(2048),
'HUGGING_FACE_HUB_TOKEN': HUGGING_FACE_HUB_TOKEN,
'HF_MODEL_ID': 'tiiuae/falcon-180B-chat',
'HF_MODEL_QUANTIZE': 'bitsandbytes-nf4',
}
f_model = HuggingFaceModel(role=SAGE_ROLE, image_uri=LLM_CONTAINER, env=config)
predictor = hf_model.deploy(initial_instance_count=1, instance_type='ml.g5.48xlarge',
container_startup_health_check_timeout=900)

I think they had a problem with TGI 1.1.0 (https://github.com/huggingface/text-generation-inference/issues/1000), you could try 1.1.1 and see if resolved

Sign up or log in to comment