--- base_model: unsloth/mistral-7b-v0.3-bnb-4bit language: - en license: apache-2.0 tags: - text-generation-inference - transformers - unsloth - mistral - trl ---

Mental Health Chatbot using Fine-Tuned 7B Mistral Model

## Inference ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "ImranzamanML/7B_finetuned_Mistral", max_seq_length = 5020, dtype = None, load_in_4bit = True) ``` ## Prompt to use for model answer ```python data_prompt = """Analyze the provided text from a mental health perspective. Identify any indicators of emotional distress, coping mechanisms, or psychological well-being. Highlight any potential concerns or positive aspects related to mental health, and provide a brief explanation for each observation. ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token def formatting_prompt(examples): inputs = examples["Context"] outputs = examples["Response"] texts = [] for input_, output in zip(inputs, outputs): text = data_prompt.format(input_, output) + EOS_TOKEN texts.append(text) return { "text" : texts, } ``` ## Using this prompt text to feed into model text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"

Note: Lets use the fine-tuned model for inference in order to generate responses based on mental health-related prompts !

Here is some keys to note:

The model = FastLanguageModel.for_inference(model) configures the model specifically for inference, optimizing its performance for generating responses.

The input text is tokenized using the tokenizer, it convert the text into a format that model can process. We are using data_prompt to format the input text, while the response placeholder is left empty to get response from model. The return_tensors = "pt" parameter specifies that the output should be in PyTorch tensors, which are then moved to the GPU using .to("cuda") for faster processing.

The model.generate method generating response based on the tokenized inputs. The parameters max_new_tokens = 5020 and use_cache = True ensure that the model can produce long and coherent responses efficiently by utilizing cached computation from previous layers.

```python model = FastLanguageModel.for_inference(model) inputs = tokenizer( [ data_prompt.format( #instructions text, #answer "", ) ], return_tensors = "pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True) answer=tokenizer.batch_decode(outputs) answer = answer[0].split("### Response:")[-1] print("Answer of the question is:", answer) ```