LearnItAnyway
/

llama-13b-hf-35q_4bit-128g_WVU

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

llama-13b-hf-35q_4bit-128g_WVU / README.md

LearnItAnyway's picture

Update README.md

733075e over 1 year ago

|

history blame contribute delete

2.24 kB

	---
	license: other
	---
	# Model Card for llama-13b-hf-35q_4bit-128g_WVU

	## Model Description

	`llama-13b-hf-35q_4bit-128g_WVU` is a model based on the
	Llama architecture with 13 billion parameters.
	This model adopts a quantization in which the first 35 layers
	of the decoder have been quantized with the [`gptq`](https://github.com/qwopqwop200/GPTQ-for-LLaMa) method,
	which uses 4-bit precision and 128 groups.
	Then, the last 5 decoder layers (1/8 of decoding layers), and lm_head have been fine-tuned using the [wizard_vicuna_70k_unfiltered dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered), 1 epoch.

	## Note

	Quantization effectively reduces memory usage, however, it may result in differences in the parameters.
	Additionally, fine-tuning only the last few layers lowers memory requirements for training but could lead to minor performance degradation.

	Several alternatives exist for fine-tuning and quantizing the Llama models. The specific method utilized here—quantizing several layers,
	followed by fine-tuning the last few layers—is designed to account for errors introduced during quantization (which sometimes can result in unexpected answers),
	and enables the last few layers to be fine-tuned considering both the quantization error and the dataset.

	It is worth mentioning that other methods may yield superior performance. For instance:
	1. Fine-tuning the entire model for `X` epochs
	2. Quantizing the first `K` layers
	3. Fine-tuning the remaining layers for `Y` epochs

	Nonetheless, as fine-tuning the entire model requires considerable resources (for example, 4 GPUs with 80GB VRAM is required for 7B LLaMa),
	this model omit the first step from the method described above, and it works.

	## Using the Model

	To load the model, a custom `LlamaForCausalLM` is required.
	You can find quantized llama [here](https://github.com/LearnItAnyway/quantized_llama).

	## References

	1. Meta - LLaMA
	2. [WizardLM](https://github.com/nlpxucan/WizardLM)
	3. [GPTQ for LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa)
	4. [Wizard Vicuna Unfiltered Dataset](https://huggingface.co/datasets/ehartford/wizard_vicuna_70k_unfiltered)
	5. Various unlisted but great works, researches, and projects.