README.md · FredZhang7/one-for-all-toxicity-v3 at main

metadata

license: cc-by-4.0
datasets:
  - FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
  - ar
  - es
  - pa
  - th
  - et
  - fr
  - fi
  - hu
  - lt
  - ur
  - so
  - pl
  - el
  - mr
  - sk
  - gu
  - he
  - af
  - te
  - ro
  - lv
  - sv
  - ne
  - kn
  - it
  - mk
  - cs
  - en
  - de
  - da
  - ta
  - bn
  - pt
  - sq
  - tl
  - uk
  - bg
  - ca
  - sw
  - hi
  - zh
  - ja
  - hr
  - ru
  - vi
  - id
  - sl
  - cy
  - ko
  - nl
  - ml
  - tr
  - fa
  - 'no'
  - multilingual
tags:
  - nlp
  - moderation

Link to the distilbert spam defender

Find the v1 (TensorFlow) model in SavedModel format on this page. The license for the v1 model is Apache 2.0

	v3	v1
Base Model	bert-base-multilingual-cased	nlpaueb/legal-bert-small-uncased
Base Tokenizer	bert-base-multilingual-cased	bert-base-multilingual-cased
Framework	PyTorch	TensorFlow
Dataset Size	3.0M	2.68M
Train Split	80% English 20% English + 100% Multilingual	None
English Train Accuracy	99.5%	N/A (≈97.5%)
Other Train Accuracy	98.6%	96.6%
Final Val Accuracy	96.8%	94.6%
Languages	55	N/A (≈35)
Hyperparameters	maxlen=208 padding='max_length' batch_size=112 optimizer=AdamW learning_rate=1e-5 loss=BCEWithLogitsLoss()	maxlen=192 padding='max_length' batch_size=16 optimizer=Adam learning_rate=1e-5 loss="binary_crossentropy"
Training Stopped	7/20/2023	9/05/2022

I manually annotated more data on top of Toxi Text 3M and added them to the training set. Training on Toxi Text 3M alone results in a biased model that classifies short text with lower precision.

Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2. Of these, I chose bert-multilingual-cased because it performs better with the same amount of resources as the others for this particular task.

PyTorch

text = "hello world!"

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("FredZhang7/one-for-all-toxicity-v3")
model = AutoModelForSequenceClassification.from_pretrained("FredZhang7/one-for-all-toxicity-v3").to(device)

encoding = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    max_length=208,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
print('device:', device)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    predicted_labels = torch.argmax(logits, dim=1)

print(predicted_labels)

Attribution

If you distribute, remix, adapt, or build upon One-for-all Toxicity v3, please credit "AIstrova Technologies Inc." in your README.md, application description, research, or website.