Running in HuggingFace TGI container

#2
by bbuehl - opened

Has anyone attempted to run this model, or any of the NousResearch variants, using the HuggingFace text generation inference (TGI) container? I've been running the meta-llama/Llama-2-13b-chat-hf and meta-llama/Llama-2-70b-chat-hf models using the container on g5.12xlarge and g5.48xlarge ec2 instances, respectively, without issues. However, when I try to run the NousResearch models, I encounter out-of-memory errors when I set the max-input-length and max-batch-prefill-tokens settings at or near the fine-tuned context size. Any guidance on which parameters are required make it work?

Sign up or log in to comment