seems broken..

#2
by Boffy - opened

using text-generation-webui with this model https://huggingface.co/spaces/HuggingFaceH4/starchat-playground .. it doesn't even seem remotely on the same level with questions and responses generated.. not sure if I'm missing some setting instruct mode? /setup to run locally.. what model is the online version using???

What model are you testing? Because you've posted in StarCoder Plus, but linked StarChat Beta, which are different models with different capabilities and prompting methods.

I have a StarChat Beta model here: https://huggingface.co/TheBloke/starchat-beta-GPTQ

If you are using StarChat Beta like you linked, are you using the right prompt template and tokens? I just edited the README to make it clearer what the prompt template is:

Prompt template

<|system|> system message goes here <|end|>
<|user|> prompt goes here <|end|>
<|assistant|>

Example:

<|system|> Below is a conversation between a human user and a helpful AI coding assistant. <|end|>
<|user|> How do I sort a list in Python? <|end|>
<|assistant|>

If you are using StarCoder Plus then please be aware that it is not an instruction tuned model. From its README:

image.png

So it should be able to auto-complete, or fill in the middle. But it's not going to work with "How do I sort a list in Python?". That's what StarChat Beta is for.

ok I might have got confused on that...... downloading starchat-beta.ggmlv3.q5_1.bin now.. hopefully I get it working.. the online demo https://huggingface.co/spaces/HuggingFaceH4/starchat-playground beta.. worked pretty well so hopefully locally it will be the same.. definitly faster online (is it just much faster gpu hardware behind the scene being used for that?)... I'm not even sure I'm getting the speed out of my local setup.. 4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq, using windows and the one-click-installers and text-generation-webui all upto date with git repo's and updated ..just not sure if I'm missing something I assume the one-click-installer is getting all the correct libraries I did specifiy nvidia in the install for it and update.

4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq

it's going to be slow(er) compared to something like google colab or hf, they're using farms of computers with GPUs like the A1000 to run this infrastructure. for me (TITAN RTX) on average it takes anywhere from ~2-15 seconds to generate a full response depending on length.

Sign up or log in to comment