---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: llama-3-8b.
---

# Uploaded Model

- **Developed by:** AlberBshara
- **License:** apache-2.0
- **Finetuned from model:** llama-3-8b-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Here I fine-tuned Llama3_8B to perform the matching task in my Scholara Virtual assistant. It matches the given student information with the provided scholarships list (which comes from my Vector DB and my AI Web agent), and then shows the student the most suitable scholarships based on their information and desires.


- context window is 4k
## Example Usage

The following example demonstrates how to use the model. It requires at least 1xL4 GPU to make the inference.


```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from typing import Tuple
import torch

class ScholaraMatcher:
    def __init__(self, load_in_4bit: bool = True, 
                 load_cpu_mem_usage: bool = True,
                 hf_model_path: str = "AlberBshara/scholara_matching",
                 k: int = 2):
        """
        Args:
            load_in_4bit (bool): Use 4-bit quantization. Defaults to True.
            load_cpu_mem_usage (bool): Reduce CPU memory usage. Defaults to True.
            hf_model_path (str): The path of your model on HuggingFace-Hub like "your-user-name/model-name".
            k (int): The number of matched scholarships. Preferably [2 <= k <= 4].
        """
        assert torch.cuda.is_available(), "CUDA is not available. An NVIDIA GPU is required."
        assert any("L4" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())), \
            "An NVIDIA L4 GPU is required to initialize this class."
        
        # Specify the quantization config
        self._bnb_config = BitsAndBytesConfig(load_in_4bit=load_in_4bit)
        
        # Load model directly with quantization config 
        self._model = AutoModelForCausalLM.from_pretrained(
            hf_model_path,
            low_cpu_mem_usage=load_cpu_mem_usage,  
            quantization_config=self._bnb_config,  
        )
        
        # Load the tokenizer
        self._tokenizer = AutoTokenizer.from_pretrained(hf_model_path)
        self._hf_model_path = hf_model_path
        self._instruction = f"Based on the student details, select the best {k} scholarships for them only from the following given scholarships"
        self._EOS_TOKEN_ID = self._tokenizer.eos_token_id

        self._alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
        
        ### Instruction:
        {}
        
        ### Input:
        {}
        
        ### Response:
        {}
        """

    def invoke(self, student_info: str, scholarships: str) -> Tuple:
        if not student_info.strip():
            raise ValueError("student_info cannot be empty or None")
        
        if not scholarships.strip():
            raise ValueError("scholarships cannot be empty or None")
        
        inputs = f"student details: \n [{student_info}]. \n scholarships list: \n {scholarships}"
        inputs = self._tokenizer(
            [
                self._alpaca_prompt.format(
                    self._instruction,  # instruction
                    inputs,  # input
                    "",  # output - leave this blank for generation.
                )
            ], return_tensors="pt"
        ).to("cuda")
            
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        
        output_ids = self._model.generate(input_ids, pad_token_id=self._EOS_TOKEN_ID)
        
        output_text = self._tokenizer.decode(output_ids[0], skip_special_tokens=True)
        
        return output_text, output_ids, attention_mask, input_ids

    def extract_answer(self, output: torch.Tensor) -> str:
        """
        Returns the required answer after getting rid of the instruction and inputs. 
        """
        decoded_outputs = self._tokenizer.batch_decode(output)
        response_text = decoded_outputs[0].split("### Response:")[1].strip()
        
        return response_text