superhot-13b-16k-4bit-32g-safetensors

Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 16384 (or lower) and 8.

Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora

Base LLaMA 13B: https://huggingface.co/huggyllama/llama-13b

SuperHOT 13B 16k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-13b-16k-no-rlhf-test

BASE_MODEL=huggyllama_llama-13b LORA=kaiokendev_superhot-13b-16k-no-rlhf-test python export_hf_checkpoint.py

Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ

python quant_with_alpaca.py --pretrained_model_dir superhot-13b-16k-safetensors --quantized_model_dir superhot-13b-16k-4bit-32g-safetensors --bits 4 --group_size 32 --desc_act --num_samples 256 --save_and_reload

Perplexity:

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-13b-16k-4bit-32g-safetensors \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 16384 \
         -cpe 8 \
         -ppl_cn 40 \
         -ppl_cs 16384 \
         -ppl_ct 16384
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 16384 -> 16384
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/tokenizer.model
 -- Model config: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/config.json
 -- Model: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/4bit-32g.safetensors
 -- Sequence length: 16384
 -- RoPE compression factor: 8.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 2.50 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 32
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 7,952.62 MB
 -- Loading dataset...
 -- Testing 21 chunks...
 ** Perplexity: 6.8223