matatonic's picture
Update README.md
9744c43
metadata
license: llama2

My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations.

  • Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
  • Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors.
  • Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled.
  • Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5
  • alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)")
  • Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b!
  • Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x
  • Maybe obvious to some but there is no perplexity impact to using an 8bit cache.

Made using exllamav2/convert.py with the following command:

python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \
 -cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \
 -o tmp/ \
 -c parquet/wikitext-test.parquet \
 -b 4.800

Perplexity (wikitext) evaluated as:

Model Perplexity Comment (alpha_value)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.21780776977539 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.900b 3.2188525199890137 4096 ctx (not released)
firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw 3.22019362449646 4096 ctx (8b cache)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.239454746246338 5120 ctx (1.375)
LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2 3.2419090270996094 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.2434027194976807 6400 ctx (1.6)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.2434027194976807 6400 ctx (1.6, 8b cache)
xwin-lm-70b-v0.1.Q4_K_S.gguf 3.2480294704437256 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.253002405166626 6144 ctx (1.75)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True 3.266364574432373 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.278069496154785 6656 ctx (1.95)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True 3.2803425788879395 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.304278612136841 7168 ctx (2.125)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.359946727752685 8192 ctx (2.5)

*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful.