Very short responses on SillyTavern
Hey, I'm hosting this model on Oobabooga and trying to use it for RP with SillyTavern, but the default Llama 3 presets (or the ones linked in Unholy) return very short responses - 50-100 tokens, even if I set response length to something crazy, or I set the minimum token amount, which is ignored.
Is that normal?
I'm loading the GGUF with llama.cpp using my GPU.
Note that Alpaca-Roleplay presets return normal sized responses, but a bit crazy due to differences in structure. I've been fooling around with Llama 3 formatting (mostly blindly, as I don't fully get it), but couldn't get any improvement.
Hello, I received a bunch of positive feedback of this model and I used it myself, so I assume it's solid, the following maybe sound logic or dumb to you but I can give you some tips :
- Make sure the greeting message is long enough and describe the beginning of the conversation in detail, short greeting with short answers out of the box at the beginning tend to make the model copycat this short writing style
- If you use exemple message or dialog, in the setting on ST or in the character card, make them big enough too to avoid too short reply by the character
- You can also modify the system prompt to your liking to force the model to write more
- Also be sure to have ST + your back end on the last version, in addition to the OG L3 prompting to avoid token issue (early stopping token?)
- In last resort, you could (if you use that) uncheck the option to avoid multiple return to line in ST, as L3 use 2* /n in the official prompt format and was trained on it. Could fix some issue too desesperately? Haha
Please report back if needed!
@arnorex you could also consider using these https://huggingface.co/Virt-io/SillyTavern-Presets or these https://huggingface.co/Lewdiculous/Model-Requests/tree/main/data/presets/cope-llama-3-0.1 presets for Silly Tavern. Especially for Llama3 models the quality got a lot better for me once I started to use the presets on silly tavern. I also recommend adjusting the prompts to your liking and specific use case.
edit: -wrong thread- sorry. listen to wespro, attention is all you need, thus: the input determines the output. block the eos token (and the BOS why not?) and go ham with the smooth sampling (0.2 or less at T=2 on a 70b 2.4 quant. this model will be very different)
@Undi95 @WesPro Thanks for your recommendations, unfortunately none did help resolve this situation... the issue I've got is most likely caused by low quality chars. I didn't know it bases further responses that much on first message.
Swapped char to a more verbose one and immediately got better results (strictly matching format of initial message).
Just a different experience as I've had with other models, this one seems less flexible. Not ideal, but the quality of Llama 3 + your modifications are excellent.
Edit: One thing I found weird is that the model doesn't respond to Text Completion presets at all. I can set Temp to 0.5 or 5 and responses are almost identical. Same for aforementioned Min Length, which is also ignored (set it to 200 and max is 300, output still is 50-100 tokens).