tmpupload
/

superhot-30b-8k-no-rlhf-test-GPTQ

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

superhot-30b-8k-no-rlhf-test-GPTQ / README.md

tmpupload's picture

Update README.md

1eb3954 12 months ago

|

raw history blame contribute delete

2.02 kB

	---
	license: other
	---
	# superhot-30b-8k-4bit--1g-safetensors

	Note: Maximum sequence length (max_seq_len) and compression factor (compress_pos_emb) need to be set to 8192 (or lower) and 4.

	Merged base LLaMA and LoRA with this:
	https://github.com/tloen/alpaca-lora

	Base LLaMA 30B:
	https://huggingface.co/huggyllama/llama-30b

	SuperHOT 30B 8k no-rlhf-test LoRA:
	https://huggingface.co/kaiokendev/superhot-30b-8k-no-rlhf-test

	``` sh
	BASE_MODEL=huggyllama_llama-30b LORA=kaiokendev_superhot-30b-8k-no-rlhf-test python export_hf_checkpoint.py
	```

	Quantized with AutoGPTQ:
	https://github.com/PanQiWei/AutoGPTQ

	``` sh
	python quant_with_alpaca.py --pretrained_model_dir superhot-30b-8k-safetensors --quantized_model_dir superhot-30b-8k-4bit--1g-safetensors --bits 4 --group_size -1 --desc_act --num_samples 256 --save_and_reload
	```

	Perplexity:
	```
	CUDA_VISIBLE_DEVICES=0 python test_benchmark_inference.py \
	-d /workspace/models/superhot-30b-8k-4bit--1g-safetensors \
	-ppl \
	-ppl_ds datasets/wikitext2.txt \
	-l 8192 \
	-cpe 4 \
	-ppl_cn 40 \
	-ppl_cs 8192 \
	-ppl_ct 8192
	-- Perplexity:
	-- - Dataset: datasets/wikitext2.txt
	-- - Chunks: 40
	-- - Chunk size: 8192 -> 8192
	-- - Chunk overlap: 0
	-- - Min. chunk size: 50
	-- - Key: text
	-- Tokenizer: /workspace/models/superhot-30b-8k-4bit--1g-safetensors/tokenizer.model
	-- Model config: /workspace/models/superhot-30b-8k-4bit--1g-safetensors/config.json
	-- Model: /workspace/models/superhot-30b-8k-4bit--1g-safetensors/4bit.safetensors
	-- Sequence length: 8192
	-- RoPE compression factor: 4.0
	-- Tuning:
	-- --matmul_recons_thd: 8
	-- --fused_mlp_thd: 2
	-- --sdp_thd: 8
	-- Options: ['perplexity']
	** Time, Load model: 3.34 seconds
	** Time, Load tokenizer: 0.01 seconds
	-- Groupsize (inferred): None
	-- Act-order (inferred): no
	!! Model has empty group index (discarded)
	** VRAM, Model: [cuda:0] 16,447.66 MB
	-- Loading dataset...
	-- Testing 40 chunks....
	** Perplexity: 4.9434
	```