--- license: cc-by-4.0 datasets: - FredZhang7/toxi-text-3M pipeline_tag: text-classification language: - ar - es - pa - th - et - fr - fi - hu - lt - ur - so - pl - el - mr - sk - gu - he - af - te - ro - lv - sv - ne - kn - it - mk - cs - en - de - da - ta - bn - pt - sq - tl - uk - bg - ca - sw - hi - zh - ja - hr - ru - vi - id - sl - cy - ko - nl - ml - tr - fa - 'no' - multilingual tags: - nlp - moderation --- [Link to the distilbert spam defender](https://huggingface.co/FredZhang7/distilbert-spam-defender) Find the v1 (TensorFlow) model in SavedModel format on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification). The license for the v1 model is Apache 2.0
| | v3 | v1 | |----------|----------|----------| | Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased | | Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased | | Framework | PyTorch | TensorFlow | | Dataset Size | 3.0M | 2.68M | | Train Split | 80% English
20% English + 100% Multilingual | None | | English Train Accuracy | 99.5% | N/A (≈97.5%) | | Other Train Accuracy | 98.6% | 96.6% | | Final Val Accuracy | 96.8% | 94.6% | | Languages | 55 | N/A (≈35) | | Hyperparameters | maxlen=208
padding='max_length'
batch_size=112
optimizer=AdamW
learning_rate=1e-5
loss=BCEWithLogitsLoss() | maxlen=192
padding='max_length'
batch_size=16
optimizer=Adam
learning_rate=1e-5
loss="binary_crossentropy" | | Training Stopped | 7/20/2023 | 9/05/2022 |
I manually annotated more data on top of Toxi Text 3M and added them to the training set. Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2. Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.
## PyTorch ```python text = "hello world!" import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3") model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device) encoding = tokenizer.encode_plus( text, add_special_tokens=True, max_length=208, padding="max_length", truncation=True, return_tensors="pt" ) print('device:', device) input_ids = encoding["input_ids"].to(device) attention_mask = encoding["attention_mask"].to(device) with torch.no_grad(): outputs = model(input_ids, attention_mask=attention_mask) logits = outputs.logits predicted_labels = torch.argmax(logits, dim=1) print(predicted_labels) ``` ## Attribution - If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.