It now works on Text Generation Inference Engine via LiteLLM Proxy!

by ssakel - opened Mar 28, 2024

Mar 28, 2024

This instruct model still needs work in order to behave similar to mistral-7B-instruct.
I run it with Huggingface TGI on a local machine with 4 x 3090 GPUs each with 24GB VRAM.
Unfortunately it will mix in its answers Greek and English. Most of the time the answers were irrelevant to the question.

geopar

Institute for Language and Speech Processing org Mar 28, 2024

•

edited Mar 28, 2024

Hi Spyros @ssakel , Thanks for the feedback.
Would you be willing to share (some of) the chats and the deployment hyperparameters (e.g., temperature) you used with us?

ssakel

Mar 28, 2024

Yes of course!

My setup is the following:

TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096
litellm OpenAI Proxy with the following YAML:

model_name: Meltemi-TGI
litellm_params:
model: huggingface/Meltemi-7B-Instruct-v1
api_base: http://0.0.0.0:8080

Gradio Application written in Python.

Following are a couple of screenshots:

ianss

Mar 28, 2024

TGI in docker: docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Mistral-7B-Instruct-v0.2 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096

Why you are using --model-id /data/Mistral-7B-Instruct-v0.2 ?

geopar

Institute for Language and Speech Processing org Mar 28, 2024

Yes the --model-id /data/Mistral-7B-Instruct-v0.2 looks curious. Maybe Mistral is used instead of Meltemi?

Also, I'm not sure about the templating mechanism used by tgi.
Meltemi-instruct works well with something like the following: <|system|>\n{{ SYSTEM_PROMPT }}\n</s><|user|>\n{{ USER_PROMPT }}\n</s><|assistant|>, where </s> is the eos token.

In Ollama I tested with the following template code. Can you try something similar in TGI?

{{- if .System }}
<|system|>
{{ .System }}
</s>
{{- end }}
<|user|>
{{ .Prompt }}
</s>
<|assistant|>

ssakel

Mar 28, 2024

Sorry wrong copy paste here is the TGI command

docker run --gpus '"device=0,1,3,4"' --shm-size 1g -p 8080:80 -v /mnt/vault/fastmodels:/data --name ssake_tgi ghcr.io/huggingface/text-generation-inference:1.4 --model-id /data/Meltemi-7B-Instruct-v1 --max-input-length 4096 --max-total-tokens 8192 --max-batch-prefill-tokens 4096

I will try it with ollama and the appropriate template.

geopar

Institute for Language and Speech Processing org Mar 28, 2024

•

edited Mar 28, 2024

Just pushed a quantized version to ollama

Try

ollama run ilsp/meltemi-instruct

If you are using it through the console, I have observed some issues where words are being cut (probably an ollama issue). If you use it through the open web ui this should be fixed.

ssakel

Mar 29, 2024

I just tried it with ollama and seems to work very well (so far I tried it from the command line).
Thanks George Paraskevopoulos @geopar !!!!!!!!!!!

When I run it from the litellm openai proxy, I get an exception, so it could be a template issue as mentioned by @geopar . I will try to put a custom template -as per @geopar instructions- in litellm proxy and see if it works. Hopefully this will work for accessing Meltemi via both ollama and TGI without worrying about which model/template I use.

ssakel changed discussion title from Still needs work to It now works on Text Generation Inference Engine via LiteLLM Proxy! Mar 30, 2024

ssakel

Mar 30, 2024

Using the template of @geopar on Litellm configuration it worked perfectly with TGI. Thank you guys for this model that speaks Greek!!!! And thanks to @geopar for his help!!!

BTY: I changed the comment's title as it was misleading, since this was a template config issue.

FYI:
This is the Litellm YAML config for running Meltemi on TGI:

model_name: Meltemi-TGI
litellm_params:
model: huggingface/Meltemi-7B-Instruct-v1
api_base: http://0.0.0.0:8080
roles: {"system":{"pre_message":"<|system|>system\n", "post_message":""}, "user":{"pre_message":"<|user|>user\n","post_message":""}, "assistant":{"pre_message":"<|assistant|>assistant\n","post_message":" "}}

I tested it on 4xNvidia 3090 (total 96gb vram) cards and an older 1xQuadro M6000-24gb card and it worked exactly the same except for the speed of course. Here are the TGI logs and statistics for each configuration asking the same question:

INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-quadro-m6000-24gb"))}:generate_stream{parameters=GenerateParameters { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="95.490662472s" validation_time="412.864µs" queue_time="43.181µs" inference_time="95.490206667s" time_per_token="134.115458ms" seed="Some(1510580074201680330)"}: text_generation_router::server: router/src/server.rs:489: Success

INFO compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("4-nvidia-geforce-rtx-3090"))}:generate_stream{parameters=GenerateParameter
s { best_of: None, temperature: Some(0.3), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(2020), return_
full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="9.178317375s" val
idation_time="774.753µs" queue_time="48.772µs" inference_time="9.177494051s" time_per_token="14.613844ms" seed="Some(12459441799522371169)"}: text_generation_router::server: router/src/serve
r.rs:489: Success

geopar

Institute for Language and Speech Processing org Mar 30, 2024

Thank you @ssakel for sharing this integration and your configuration!
If you haven't seen it yet, we have also uploaded quantized versions that can help with the deployment

https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-AWQ
https://huggingface.co/ilsp/Meltemi-7B-Instruct-v1-GGUF

Closing the issue for now as resolved

geopar changed discussion status to closed Mar 30, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment