README.md · ImranzamanML/7B_finetuned_Mistral at aa90ee7647615d5d046c60ec3a997f998dea9661

metadata

base_model: unsloth/mistral-7b-v0.3-bnb-4bit
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - mistral
  - trl

Mental Health Chatbot using Fine-Tuned 7B Mistral Model

Inference

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ImranzamanML/7B_finetuned_Mistral",
max_seq_length = 5020,
dtype = None,
load_in_4bit = True)

Prompt to use for model answer

data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompt(examples):
    inputs       = examples["Context"]
    outputs      = examples["Response"]
    texts = []
    for input_, output in zip(inputs, outputs):
        text = data_prompt.format(input_, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

Using this prompt text to feed into model

text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"

Note: Lets use the fine-tuned model for inference in order to generate responses based on mental health-related prompts !

Here is some keys to note:

The model = FastLanguageModel.for_inference(model) configures the model specifically for inference, optimizing its performance for generating responses.

The input text is tokenized using the tokenizer, it convert the text into a format that model can process. We are using data_prompt to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt" parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda") for faster processing.

The model.generate method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020 and use_cache = True ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.

model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    data_prompt.format(
        #instructions
        text,
        #answer
        "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)