This is ONLY the LoRA adapters, and not the full model!

Base model: https://huggingface.co/mesolitica/malaysian-tinyllama-1.1b-16k-instructions-v4

Fine-tuned on this dataset: https://huggingface.co/datasets/kaiimran/malaysia-tweets-sentiment

Following this tutorial: https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing

Evaluation on test dataset

Accuracy: 0.9455
- Interpretation: Approximately 94.55% of the predictions made by the model are correct. This is a high accuracy rate, indicating that the model performs well on the test dataset overall.
Precision: 0.9936
- Interpretation: Out of all the positive predictions made by the model, 99.36% were correct. This suggests that the model is very good at identifying true positive cases and has a very low false positive rate.
Recall: 0.8980
- Interpretation: Out of all the actual positive cases in the dataset, the model correctly identified 89.80% of them. While this is a good recall rate, it is relatively lower compared to precision, indicating that there are some false negatives (i.e., positive cases that the model failed to identify).
F1 Score: 0.9434
- Interpretation: The F1 score is the harmonic mean of precision and recall, balancing the two. An F1 score of 0.9434 indicates that the model achieves a good balance between precision and recall.

Overall Assessment

High Precision: The model has an excellent precision score, meaning it is highly reliable in predicting positive sentiment without mistakenly labeling too many negative cases as positive.
Good Recall: The recall score is also good, but slightly lower than precision, suggesting that there are some positive cases that the model misses.
Balanced Performance: The F1 score indicates that the model maintains a good balance between precision and recall, which is crucial for tasks like sentiment analysis.

Considerations for Improvement

Recall Improvement: Since recall is lower compared to precision, we might consider strategies to improve it, such as:
- Data Augmentation: Adding more training data, particularly positive samples, might help the model learn to identify positive cases better.
- Hyperparameter tuning: Like changing epochs, etc

Conclusion

The model shows strong performance, with particularly high precision and a good overall F1 score. The slightly lower recall suggests room for improvement, but the current metrics indicate that the model is very effective for binary sentiment analysis.