---
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
---

Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
The license for the v1 model is Apache 2.0

<br>

|          |    v3    |    v1    |
|----------|----------|----------|
| Base Model   | bert-base-multilingual-cased   |  nlpaueb/legal-bert-small-uncased   |
| Base Tokenizer   |  bert-base-multilingual-cased   |  bert-base-multilingual-cased  |
| Framework  | PyTorch   |  TensorFlow   |
| Dataset Size  |  3.0M |  2.68M   |
| Train Split | 80% English<br>20% English + 100% Multilingual |  None  |
| English Train Accuracy  |  99.5% |  N/A (≈97.5%)  |
| Other Train Accuracy  | 98.6%  |  96.6%  |
| Final Val Accuracy  |  96.8%  |  94.6%  |
| Languages |  55  |  N/A (≈35)  |
| Hyperparameters  | maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss()  |  maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy"  |
| Training Stopped |  7/20/2023  |  9/05/2022  |

<br>

I manually annotated more data on top of Toxi Text 3M and added them to the training set.
Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.

<br>

Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.

<br>

## PyTorch

```python
text = "hello world!"

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device)

encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=208,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
print('device:', device)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predicted_labels = torch.argmax(logits, dim=1)

print(predicted_labels)
```

## Attribution
- If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.