NousResearch/Yarn-Llama-2-70b-32k · Running in HuggingFace TGI container

Has anyone attempted to run this model, or any of the NousResearch variants, using the HuggingFace text generation inference (TGI) container? I've been running the meta-llama/Llama-2-13b-chat-hf and meta-llama/Llama-2-70b-chat-hf models using the container on g5.12xlarge and g5.48xlarge ec2 instances, respectively, without issues. However, when I try to run the NousResearch models, I encounter out-of-memory errors when I set the max-input-length and max-batch-prefill-tokens settings at or near the fine-tuned context size. Any guidance on which parameters are required make it work?