It now works on Text Generation Inference Engine via LiteLLM Proxy!
This instruct model still needs work in order to behave similar to mistral-7B-instruct.
I run it with Huggingface TGI on a local machine with 4 x 3090 GPUs each with 24GB VRAM.
Unfortunately it will mix in its answers Greek and English. Most of the time the answers were irrelevant to the question.
Hi Spyros
@ssakel
, Thanks for the feedback.
Would you be willing to share (some of) the chats and the deployment hyperparameters (e.g., temperature) you used with us?
Yes of course!
My setup is the following:
- TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096
- litellm OpenAI Proxy with the following YAML:
- model_name: Meltemi-TGI
litellm_params:
model: huggingface/Meltemi-7B-Instruct-v1
api_base: http://0.0.0.0:8080
- Gradio Application written in Python.
Following are a couple of screenshots:
- TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096
Why you are using --model-id /data/Mistral-7B-Instruct-v0.2 ?
Yes the --model-id /data/Mistral-7B-Instruct-v0.2
looks curious. Maybe Mistral is used instead of Meltemi?
Also, I'm not sure about the templating mechanism used by tgi.
Meltemi-instruct works well with something like the following: <|system|>\n{{ SYSTEM_PROMPT }}\n</s><|user|>\n{{ USER_PROMPT }}\n</s><|assistant|>
, where </s>
is the eos token.
In Ollama I tested with the following template code. Can you try something similar in TGI?
{{- if .System }}
<|system|>
{{ .System }}
</s>
{{- end }}
<|user|>
{{ .Prompt }}
</s>
<|assistant|>
Sorry wrong copy paste here is the TGI command
docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Meltemi-7B-Instruct-v1 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096
I will try it with ollama and the appropriate template.
Just pushed a quantized version to ollama
Try
ollama run ilsp/meltemi-instruct
If you are using it through the console, I have observed some issues where words are being cut (probably an ollama issue). If you use it through the open web ui this should be fixed.
I just tried it with ollama and seems to work very well (so far I tried it from the command line).
Thanks George Paraskevopoulos
@geopar
!!!!!!!!!!!
When I run it from the litellm openai proxy, I get an exception, so it could be a template issue as mentioned by @geopar . I will try to put a custom template -as per @geopar instructions- in litellm proxy and see if it works. Hopefully this will work for accessing Meltemi via both ollama and TGI without worrying about which model/template I use.
Using the template of @geopar on Litellm configuration it worked perfectly with TGI. Thank you guys for this model that speaks Greek!!!! And thanks to @geopar for his help!!!
BTY: I changed the comment's title as it was misleading, since this was a template config issue.
FYI:
This is the Litellm YAML config for running Meltemi on TGI:
- model_name: Meltemi-TGI
litellm_params:
model: huggingface/Meltemi-7B-Instruct-v1
api_base: http://0.0.0.0:8080
roles: {"system":{"pre_message":"<|system|>system\n", "post_message":""}, "user":{"pre_message":"<|user|>user\n","post_message":""}, "assistant":{"pre_message":"<|assistant|>assistant\n","post_message":" "}}
I tested it on 4xNvidia 3090 (total 96gb vram) cards and an older 1xQuadro M6000-24gb card and it worked exactly the same except for the speed of course. Here are the TGI logs and statistics for each configuration asking the same question:
INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-quadro-m6000-24gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="95.490662472s" validation_time="412.864µs" queue_time="43.181µs" inference_time="95.490206667s" time_per_token="134.115458ms" seed="Some(1510580074201680330)"}: text_generation_router::server: router/src/server.rs:489: Success
INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-geforce-rtx-3090"))}:generate_stream{parameters=GenerateParameter
s { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_
full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="9.178317375s" val
idation_time="774.753µs" queue_time="48.772µs" inference_time="9.177494051s" time_per_token="14.613844ms" seed="Some(12459441799522371169)"}: text_generation_router::server: router/src/serve
r.rs:489: Success
Thank you
@ssakel
for sharing this integration and your configuration!
If you haven't seen it yet, we have also uploaded quantized versions that can help with the deployment
https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-AWQ
https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-GGUF
Closing the issue for now as resolved