File size: 1,844 Bytes
33265fb 13ab308 56f4e90 8928304 8959dc5 95a5efe f9e901a 735b0ef 95a5efe 5f9e9c5 33265fb 56f4e90 d7a0060 56f4e90 8cae0fb 11f1de9 95a5efe 11f1de9 95a5efe 955c23b 35cc9f1 4981b77 95a5efe 955c23b 35cc9f1 f1b36b5 9df036e f1b36b5 95a5efe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: cc-by-4.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa
- 'no'
- multilingual
tags:
- nlp
- moderation
---
Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).
<br>
| | v3 | v1 |
|----------|----------|----------|
| Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased |
| Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased |
| Framework | PyTorch | TensorFlow |
| Dataset Size | 3.0M | 2.68M |
| Train Split | 80% English<br>20% English + 100% Multilingual | None |
| English Train Accuracy | 99.5% | N/A (≈97.5%) |
| Other Train Accuracy | 98.6% | 96.6% |
| Final Val Accuracy | 96.8% | 94.6% |
| Languages | 55 | N/A (≈35) |
| Hyperparameters | maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() | maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" |
| Training Stopped | 7/20/2023 | 9/05/2022 |
<br>
I manually annotated more data on top of Toxi Text 3M and added them to the training set.
<br>
Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task. |