|
--- |
|
language: |
|
- ms |
|
library_name: transformers |
|
--- |
|
|
|
Safe for Work Classifier Model for Malaysian Data |
|
|
|
Current version supports Malay. We are working towards supporting malay, english and indo. |
|
|
|
Base Model finetuned from https://huggingface.co/mesolitica/malaysian-mistral-191M-MLM-512 with Malaysian NSFW data. |
|
|
|
Data Source: https://huggingface.co/datasets/malaysia-ai/Malaysian-NSFW |
|
|
|
Github Repo: https://github.com/malaysia-ai/sfw-classifier |
|
|
|
Project Board: https://github.com/orgs/malaysia-ai/projects/6 |
|
|
|
![Image in a markdown cell](https://github.com/mesolitica/malaysian-llmops/raw/main/e2e.png) |
|
|
|
Current Labels Available: |
|
|
|
- religion insult |
|
- sexist |
|
- racist |
|
- psychiatric or mental illness |
|
- harassment |
|
- safe for work |
|
- porn |
|
- self-harm |
|
- violence |
|
|
|
|
|
|
|
### How to use |
|
|
|
```python |
|
from classifier import MistralForSequenceClassification |
|
from transformers import AutoTokenizer |
|
from transformers import pipeline |
|
|
|
|
|
model = MistralForSequenceClassification.from_pretrained('malaysia-ai/malaysian-sfw-classifier') |
|
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/malaysian-sfw-classifier') |
|
|
|
|
|
pipe = pipeline("text-classification", |
|
tokenizer = tokenizer, |
|
model=model) |
|
|
|
input_str = ["INSERT_INPUT_0", "INSERT_INPUT_1"] |
|
print(pipe(input_str)) |
|
``` |
|
|
|
|
|
``` |
|
precision recall f1-score support |
|
|
|
racist 0.87619 0.91390 0.89465 1719 |
|
religion insult 0.88533 0.85813 0.87152 3320 |
|
psychiatric or mental illness 0.94224 0.87020 0.90479 5624 |
|
sexist 0.77146 0.82234 0.79609 1486 |
|
harassment 0.81935 0.87460 0.84608 949 |
|
porn 0.95047 0.97546 0.96280 1141 |
|
safe for work 0.83471 0.90741 0.86954 3456 |
|
self-harm 0.81796 0.95906 0.88291 342 |
|
violence 0.84317 0.78786 0.81457 1433 |
|
|
|
accuracy 0.87684 19470 |
|
macro avg 0.86010 0.88544 0.87144 19470 |
|
weighted avg 0.87960 0.87684 0.87718 19470 |
|
|
|
``` |
|
|
|
|
|
``` |
|
@misc{razak2024adaptingsafeforworkclassifiermalaysian, |
|
title={Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework}, |
|
author={Aisyah Razak and Ariff Nazhan and Kamarul Adha and Wan Adzhar Faiq Adzlan and Mas Aisyah Ahmad and Ammar Azman}, |
|
year={2024}, |
|
eprint={2407.20729}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2407.20729}, |
|
} |
|
``` |