VLLM serving Tulu3 Llama 405B

#3
by JettLam - opened

Hi,

I have tested the same command depicted in the hugging face to host the model through VLLM. I have 2 node w 8 GPUs each set up, sufficient VRAM.

In the CLI, I ran the following line
vllm serve /path/to/model —tensor-parallel-size 8 —pipeline-parallel-size 2

The model is able to be hosted but the respond is only exclamation marks regardless of the input. Will love to hear how your team manages to serve it through VLLM.

Thanks!

Hey @JettLam , what dtype are you using? If your vocab size is > 2**16, make sure you’re using uint32 for token indices.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment