shahidul034/KUETLLM_Zephyr7b_gguf

KUETLLM_zyphyr7b_gguf

KUETLLM is a zephyr7b-beta finetune, using a dataset with prompts and answers about Khulna University of Engineering and Technology. It was loaded in 8 bit quantization using bitsandbytes. LORA was used to finetune an adapter, which was leter merged with the base unquantized model. The finetuned unquantized model will be found here. It was later quantized and converted into gguf format using llama.cpp.

Below is the training configuarations for the finetuning process:

LoraConfig:
r=16,
lora_alpha=16,
target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"

TrainingArguments:
per_device_train_batch_size=12,
gradient_accumulation_steps=1,
optim='paged_adamw_8bit',
learning_rate=5e-06 ,
fp16=True,            
logging_steps=10,
num_train_epochs = 1,
output_dir=zephyr_lora_output,
remove_unused_columns=False,

Llama.cpp quantization parameter = q4_k_m

Inferencing using llama.cpp command:

Download the gguf file manually or huggingface_hub. Setup llama.cpp Make sure you are using llama.cpp from commit d0cee0d or later.

./main -ngl 35 -m zephyr_q4km_kuetllm.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>\nYou are a KUET authority managed chatbot, help users by answering their queries about KUET.\n<|user|>\nTell me about KUET.\n<|assistant|>\n"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 2048 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

For other parameters and how to use them, please refer to the llama.cpp documentation