Edit model card

superhot-13b-16k-4bit-32g-safetensors

Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 16384 (or lower) and 8.

Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora

Base LLaMA 13B: https://huggingface.co/huggyllama/llama-13b

SuperHOT 13B 16k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-13b-16k-no-rlhf-test

BASE_MODEL=huggyllama_llama-13b LORA=kaiokendev_superhot-13b-16k-no-rlhf-test python export_hf_checkpoint.py

Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ

python quant_with_alpaca.py --pretrained_model_dir superhot-13b-16k-safetensors --quantized_model_dir superhot-13b-16k-4bit-32g-safetensors --bits 4 --group_size 32 --desc_act --num_samples 256 --save_and_reload

Perplexity:

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-13b-16k-4bit-32g-safetensors \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 16384 \
         -cpe 8 \
         -ppl_cn 40 \
         -ppl_cs 16384 \
         -ppl_ct 16384
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 16384 -> 16384
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/tokenizer.model
 -- Model config: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/config.json
 -- Model: /workspace/models/superhot-13b-16k-4bit-32g-safetensors/4bit-32g.safetensors
 -- Sequence length: 16384
 -- RoPE compression factor: 8.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 2.50 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 32
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 7,952.62 MB
 -- Loading dataset...
 -- Testing 21 chunks...
 ** Perplexity: 6.8223
Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.