Undi95/Llama-3-LewdPlay-8B-evo-GGUF

One thing I had challenges with here was the context size. I wanted to give some of that information here incase anyone else runs into that same issue.

The standard context size for this model is 8k. If you're running this through text_gen_webui, you'll need to play with the compress_pos_emb or the alpha_value to scale it accordingly. The description is pretty good enough.

If you're using LocalAI instead, those features aren't available to you, but what is available is the rope_freq_base. The YAML I have for this is below:

context_size: 32768
f16: true
threads: 4
gpu_layers: 90
name: rp-llama3-lewdplay-8b
tensor_split: "90,0"
main_gpu: "0"
backend: llama-cpp
prompt_cache_all: false
parameters:
  model: Llama-3-LewdPlay-8B-evo.q8_0.gguf
  temperature: 0.6
  top_k: 40
  top_p: 0.95
  batch: 512
  tfz: 1.0
  n_keep: 0
  rope_freq_base:  8000000

The important one is the rope_freq_base, it's scaled up, and works well at 32k context size.

Undi95
/

Llama-3-LewdPlay-8B-evo-GGUF

Context size