--- library_name: transformers tags: - unsloth datasets: - Ayansk11/Mental_health_data_conversational language: - en base_model: - meta-llama/Llama-3.2-1B Quantized: - unsloth/Llama-3.2-1B-bnb-4bit --- # Model Card for Model ID ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Ayan Javeed Shaikh and Srushti Sonavane - **Finetuned from model:** unsloth/Llama-3.2-1B-bnb-4bit ### Model Sources [optional]

Mental Health Llama 3.2 - 1B ConversationalBot

## Inference ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "Ayansk11/Mental_health_Llama3.2-1B_conversationalBot", max_seq_length = 5020, dtype = None, load_in_4bit = True) ``` ## Using this text to feed into model for getting the response text="I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here. I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it. How can I change my feeling of being worthless to everyone?"
Note: Lets use the fine-tuned model for inference in order to generate responses based on mental health-related prompts !

Key Points to Note:

  1. The model = FastLanguageModel.for_inference(model) command prepares the model specifically for inference, ensuring it is optimized for generating responses efficiently.

  2. The input text is processed using the tokenizer, which converts it into a format suitable for the model. The data_prompt is used to structure the input text, leaving a placeholder for the model's response. Additionally, the return_tensors = "pt" argument ensures the output is in PyTorch tensor format, which is then transferred to the GPU using .to("cuda") for faster processing.

  3. The model.generate function generates responses based on the tokenized input. Parameters like max_new_tokens = 5020 and use_cache = True enable the model to produce lengthy, coherent outputs efficiently by leveraging cached computations from prior layers.

```python model = FastLanguageModel.for_inference(model) inputs = tokenizer( [ data_prompt.format( #instructions text, #answer "", ) ], return_tensors = "pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens = 5020, use_cache = True) answer=tokenizer.batch_decode(outputs) answer = answer[0].split("### Response:")[-1] print("Answer of the question is:", answer) ```