File size: 1,844 Bytes

33265fb
13ab308
56f4e90
8928304
8959dc5
95a5efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9e901a
735b0ef
95a5efe
 
5f9e9c5
33265fb
56f4e90
d7a0060
56f4e90
8cae0fb
 
11f1de9
95a5efe
 
 
 
11f1de9
95a5efe
955c23b
35cc9f1
4981b77
95a5efe
955c23b
35cc9f1
f1b36b5
 
 
9df036e
 
f1b36b5
 
95a5efe

---
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
---

Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).

<br>

|          |    v3    |    v1    |
|----------|----------|----------|
| Base Model   | bert-base-multilingual-cased   |  nlpaueb/legal-bert-small-uncased   |
| Base Tokenizer   |  bert-base-multilingual-cased   |  bert-base-multilingual-cased  |
| Framework  | PyTorch   |  TensorFlow   |
| Dataset Size  |  3.0M |  2.68M   |
| Train Split | 80% English<br>20% English + 100% Multilingual |  None  |
| English Train Accuracy  |  99.5% |  N/A (≈97.5%)  |
| Other Train Accuracy  | 98.6%  |  96.6%  |
| Final Val Accuracy  |  96.8%  |  94.6%  |
| Languages |  55  |  N/A (≈35)  |
| Hyperparameters  | maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss()  |  maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy"  |
| Training Stopped |  7/20/2023  |  9/05/2022  |

<br>

I manually annotated more data on top of Toxi Text 3M and added them to the training set.

<br>

Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.