shahidul034
/

KUETLLM_Zephyr7b_gguf

Text Generation

Model card Files Files and versions Community

arbitropy commited on Dec 18, 2023

Commit

5ec90f7

•

1 Parent(s): 3699e03

Create README.md

Files changed (1) hide show

README.md +52 -0

README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+---
+license: apache-2.0
+pipeline_tag: text-generation
+---
+### KUETLLM_zyphyr7b_gguf
+KUETLLM is a [zephyr7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) finetune, using a dataset with prompts and answers about Khulna University of Engineering and Technology.
+It was loaded in 8 bit quantization using [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). [LORA](https://huggingface.co/docs/diffusers/main/en/training/lora) was used to finetune an adapter, which was leter merged with the base unquantized model.
+The finetuned unquantized model will be found [here](https://huggingface.co/shahidul034/KUETLLM_zephyr_base). It was later quantized and converted into gguf format using [llama.cpp](https://github.com/ggerganov/llama.cpp).
+## Below is the training configuarations for the finetuning process:
+```
+LoraConfig:
+r=16,
+lora_alpha=16,
+target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"],
+lora_dropout=0.05,
+bias="none",
+task_type="CAUSAL_LM"
+```
+```
+TrainingArguments:
+per_device_train_batch_size=12,
+gradient_accumulation_steps=1,
+optim='paged_adamw_8bit',
+learning_rate=5e-06 ,
+fp16=True,
+logging_steps=10,
+num_train_epochs = 1,
+output_dir=zephyr_lora_output,
+remove_unused_columns=False,
+```
+```
+Llama.cpp quantization parameter = q4_k_m
+```
+## Inferencing using llama.cpp command:
+Download the gguf file manually or huggingface_hub.
+Setup llama.cpp
+Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
+```shell
+./main -ngl 35 -m zephyr_q4km_kuetllm.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|system|>\nYou are a KUET authority managed chatbot, help users by answering their queries about KUET.\n<|user|>\nTell me about KUET.\n<|assistant|>\n"
+```
+Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
+Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
+If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
+For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)