VLLM serving Tulu3 Llama 405B

by JettLam - opened Feb 14

Feb 14

Hi,

I have tested the same command depicted in the hugging face to host the model through VLLM. I have 2 node w 8 GPUs each set up, sufficient VRAM.

In the CLI, I ran the following line
vllm serve /path/to/model —tensor-parallel-size 8 —pipeline-parallel-size 2

The model is able to be hosted but the respond is only exclamation marks regardless of the input. Will love to hear how your team manages to serve it through VLLM.

Thanks!

amanrangapur

Ai2 org 29 days ago

Hey @JettLam , what dtype are you using? If your vocab size is > 2**16, make sure you’re using uint32 for token indices.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment