---
base_model: unsloth/mistral-7b-v0.3-bnb-4bit
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- mistral
- trl
---
Mental Health Chatbot using Fine-Tuned 7B Mistral Model
## Inference
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ImranzamanML/7B_finetuned_Mistral",
max_seq_length = 5020,
dtype = None,
load_in_4bit = True)
```
## Prompt to use for model answer
```python
data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation.
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompt(examples):
inputs = examples["Context"]
outputs = examples["Response"]
texts = []
for input_, output in zip(inputs, outputs):
text = data_prompt.format(input_, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
```
## Using this prompt text to feed into model
text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"
Note: Lets use the fine-tuned model for inference in order to generate responses based on mental health-related prompts !
Here is some keys to note:
The model = FastLanguageModel.for_inference(model)
configures the model specifically for inference, optimizing its performance for generating responses.
The input text is tokenized using the tokenizer
, it convert the text into a format that model can process. We are using data_prompt
to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt"
parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda")
for faster processing.
The model.generate
method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020
and use_cache = True
ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.
```python
model = FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
data_prompt.format(
#instructions
text,
#answer
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True)
answer=tokenizer.batch_decode(outputs)
answer = answer[0].split("### Response:")[-1]
print("Answer of the question is:", answer)
```