Edit model card

superhot-13b-8k-4bit--1g-safetensors

Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 8192 (or lower) and 4.

Merged base LLaMA and LoRA with this: https://github.com/tloen/alpaca-lora

Base LLaMA 13B: https://huggingface.co/huggyllama/llama-13b

SuperHOT 13B 8k no-rlhf-test LoRA: https://huggingface.co/kaiokendev/superhot-13b-8k-no-rlhf-test

BASE_MODEL=huggyllama_llama-13b LORA=kaiokendev_superhot-13b-8k-no-rlhf-test python export_hf_checkpoint.py

Quantized with AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ

python quant_with_alpaca.py --pretrained_model_dir superhot-13b-8k-safetensors --quantized_model_dir superhot-13b-8k-no-rlhf-test-GPTQ --bits 4 --group_size -1 --desc_act --num_samples 256 --save_and_reload

Perplexity:

CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
         -d /workspace/models/superhot-13b-8k-no-rlhf-test-GPTQ \
         -ppl \
         -ppl_ds datasets/wikitext2.txt \
         -l 8192 \
         -cpe 4 \
         -ppl_cn 40 \
         -ppl_cs 8192 \
         -ppl_ct 8192
 -- Perplexity:
 -- - Dataset: datasets/wikitext2.txt
 -- - Chunks: 40
 -- - Chunk size: 8192 -> 8192
 -- - Chunk overlap: 0
 -- - Min. chunk size: 50
 -- - Key: text
 -- Tokenizer: /workspace/models/superhot-13b-8k-no-rlhf-test-GPTQ/tokenizer.model
 -- Model config: /workspace/models/superhot-13b-8k-no-rlhf-test-GPTQ/config.json
 -- Model: /workspace/models/superhot-13b-8k-no-rlhf-test-GPTQ/4bit.safetensors
 -- Sequence length: 8192
 -- RoPE compression factor: 4.0
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: ['perplexity']
 ** Time, Load model: 3.58 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 !! Model has empty group index (discarded)
 ** VRAM, Model: [cuda:0] 6,754.74 MB
 -- Loading dataset...
 -- Testing 40 chunks....
 ** Perplexity: 5.7766
Downloads last month
20
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using tmpupload/superhot-13b-8k-no-rlhf-test-GPTQ 1