--- language: - en license: apache-2.0 tags: - text-generation-inference - transformers - unsloth - llama - trl base_model: llama-3-8b. --- # Uploaded Model - **Developed by:** AlberBshara - **License:** apache-2.0 - **Finetuned from model:** llama-3-8b-bnb-4bit This llama model was trained 2x faster with Unsloth and Huggingface's TRL library. Here I fine-tuned Llama3_8B to perform the matching task in my Scholara Virtual assistant. It matches the given student information with the provided scholarships list (which comes from my Vector DB and my AI Web agent), and then shows the student the most suitable scholarships based on their information and desires. - context window is 4k ## Example Usage The following example demonstrates how to use the model. It requires at least 1xL4 GPU to make the inference. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from typing import Tuple import torch class ScholaraMatcher: def __init__(self, load_in_4bit: bool = True, load_cpu_mem_usage: bool = True, hf_model_path: str = "AlberBshara/scholara_matching", k: int = 2): """ Args: load_in_4bit (bool): Use 4-bit quantization. Defaults to True. load_cpu_mem_usage (bool): Reduce CPU memory usage. Defaults to True. hf_model_path (str): The path of your model on HuggingFace-Hub like "your-user-name/model-name". k (int): The number of matched scholarships. Preferably [2 <= k <= 4]. """ assert torch.cuda.is_available(), "CUDA is not available. An NVIDIA GPU is required." assert any("L4" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())), \ "An NVIDIA L4 GPU is required to initialize this class." # Specify the quantization config self._bnb_config = BitsAndBytesConfig(load_in_4bit=load_in_4bit) # Load model directly with quantization config self._model = AutoModelForCausalLM.from_pretrained( hf_model_path, low_cpu_mem_usage=load_cpu_mem_usage, quantization_config=self._bnb_config, ) # Load the tokenizer self._tokenizer = AutoTokenizer.from_pretrained(hf_model_path) self._hf_model_path = hf_model_path self._instruction = f"Based on the student details, select the best {k} scholarships for them only from the following given scholarships" self._EOS_TOKEN_ID = self._tokenizer.eos_token_id self._alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {} """ def invoke(self, student_info: str, scholarships: str) -> Tuple: if not student_info.strip(): raise ValueError("student_info cannot be empty or None") if not scholarships.strip(): raise ValueError("scholarships cannot be empty or None") inputs = f"student details: \n [{student_info}]. \n scholarships list: \n {scholarships}" inputs = self._tokenizer( [ self._alpaca_prompt.format( self._instruction, # instruction inputs, # input "", # output - leave this blank for generation. ) ], return_tensors="pt" ).to("cuda") input_ids = inputs['input_ids'] attention_mask = inputs['attention_mask'] output_ids = self._model.generate(input_ids, pad_token_id=self._EOS_TOKEN_ID) output_text = self._tokenizer.decode(output_ids[0], skip_special_tokens=True) return output_text, output_ids, attention_mask, input_ids def extract_answer(self, output: torch.Tensor) -> str: """ Returns the required answer after getting rid of the instruction and inputs. """ decoded_outputs = self._tokenizer.batch_decode(output) response_text = decoded_outputs[0].split("### Response:")[1].strip() return response_text