Manual settings for best output?

#11
by Reign2294 - opened

I am getting 0.5 tokens/s on a RTX 4090. Can anyone tell me is this normal or am I doing something wrong? Cheers!

cmd flags: none

Warnings on loading:
"['do_sample](UserWarning: do_sample is set to False. However, temperature is set to 0.9 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.)"

"UserWarning: do_sample is set to False. However, top_p is set to 0.6 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed."

"WARNING:models\Gryphe_MythoMax-L2-13b\tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated."

"WARNING:models\Gryphe_MythoMax-L2-13b\special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated."

Settings:
Model loader: Transformers
gpu-memory in MiB for device :0
cpu-memory in MiB: 0
load-in-4bit params:

  • compute_dtype: float16
  • quant_type nf4
    alpha_value: 1
    rope_freq_base: 0
    compress_pos_emb: 1
    cpu: unselected
    load-in-8bit: unselected
    bf16:: unselected
    auto-devices: unselected
    disk: unselected
    load-in-4bit: unselected
    use-double-quant: unselected
    trust-remote-code: unselected

Parameters:
Preset:simple-1
max_new_tokens: 200
temp: 0.7
top_p: 0.9
top_k: 20
typical p: 1
epsilon_cutoff: 0
eta_cutoff: 0
tfs: 1
top_a: 0
repetition penalty: 1.15
repetitiion_penalty_range: 0
encoder_rep_penalty: 1
no_repeat_ngram_size: 0
min_length: 0
seed: -1
do_sample: selected

If it helps, I intend to use this model in SillyTavern... unsure if anyone has specific settings for there too.
Thanks!

So it looks like th software you're using doesn't have sampling enabled, so it's basically ignoring the settings you're using. It should be a check box or something similar. Just make sure Ddo sampling is enabled and checked or set somewhere. oobabooga webui is a good one to use with tavern.

I wasn't able to get it to fit within 24GiB on the 4090 without checking either load-in-4-bit, or load-in-8-bit. Even if it did, as soon as the context started to build, it would run out of VRAM and start sharing with system ram, slowing things waaay down.

But with load-in-4-bit, it works great. Also set the GPU memory slider to max, and set the float type to bfloat16, that's one of the major benefits of the 4090

Sign up or log in to comment