--- license: cc-by-4.0 datasets: - FredZhang7/toxi-text-3M pipeline_tag: text-classification language: - ar - es - pa - th - et - fr - fi - hu - lt - ur - so - pl - el - mr - sk - gu - he - af - te - ro - lv - sv - ne - kn - it - mk - cs - en - de - da - ta - bn - pt - sq - tl - uk - bg - ca - sw - hi - zh - ja - hr - ru - vi - id - sl - cy - ko - nl - ml - tr - fa - 'no' - multilingual tags: - nlp - moderation --- Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
| | v3 | v1 | |----------|----------|----------| | Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased | | Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased | | Framework | PyTorch | TensorFlow | | Dataset Size | 3.0M | 2.68M | | Train Split | 80% English
20% English + 100% Multilingual | None | | English Train Accuracy | 99.5% | N/A (≈97.5%) | | Other Train Accuracy | 98.6% | 96.6% | | Final Val Accuracy | 96.8% | 94.6% | | Languages | 55 | N/A (≈35) | | Hyperparameters | maxlen=208
padding='max_length'
batch_size=112
optimizer=AdamW
learning_rate=1e-5
loss=BCEWithLogitsLoss() | maxlen=192
padding='max_length'
batch_size=16
optimizer=Adam
learning_rate=1e-5
loss="binary_crossentropy" | | Training Stopped | 7/20/2023 | 9/05/2022 |
I manually annotated more data on top of Toxi Text 3M and added them to the training set.
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2. From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.