FredZhang7
/

one-for-all-toxicity-v3

Text Classification

Inference Endpoints

Model card Files Files and versions Community

FredZhang7 commited on Jul 1, 2023

Commit

95a5efe

·

1 Parent(s): bee5fba

finalize upload

Files changed (1) hide show

README.md +73 -10

README.md CHANGED Viewed

@@ -3,24 +3,87 @@ license: cc-by-nc-3.0
 datasets:
 - FredZhang7/toxi-text-3M
 pipeline_tag: text-classification
 ---
 **I have decided to release all auto-moderation models at once sometime in July. The curated datasets for training these models will be avaliable first.**
 <br>
-Finished training: 6/30/2023
-Final Train & Validation Accuracy: 95-98%
-Large model (v2) will be avaliable for PyTorch
-Lightweight model and tokenizer (v1) will be avaliable for transformers.js
 <br>
 <br>
-Models tested: roberta, xlm-roberta, bert-tiny, bert-base-cased/uncased, bert-multilingual-cased/uncased, alberta-large-v2
-Model chosen based on cost-efficiency and performance: bert-multilingual-cased

 datasets:
 - FredZhang7/toxi-text-3M
 pipeline_tag: text-classification
+language:
+- ar
+- es
+- pa
+- th
+- et
+- fr
+- fi
+- no
+- hu
+- lt
+- ur
+- so
+- pl
+- el
+- mr
+- sk
+- gu
+- he
+- af
+- te
+- ro
+- lv
+- sv
+- ne
+- kn
+- it
+- mk
+- cs
+- en
+- de
+- da
+- ta
+- bn
+- pt
+- sq
+- tl
+- uk
+- bg
+- ca
+- sw
+- hi
+- zh
+- ja
+- hr
+- ru
+- vi
+- id
+- sl
+- cy
+- ko
+- nl
+- ml
+- tr
+- fa
+tags:
+- nlp
 ---
 **I have decided to release all auto-moderation models at once sometime in July. The curated datasets for training these models will be avaliable first.**
 <br>
+|          |    v2    |    v1    |
+|----------|----------|----------|
+| Base Model   | bert-base-multilingual-cased   |  nlpaueb/legal-bert-small-uncased   |
+| Base Tokenizer   |  bert-base-multilingual-cased   |  bert-base-multilingual-cased  |
+| Framework  | PyTorch   |  TensorFlow   |
+| Dataset Size  |  2.95M |  2.68M   |
+| Train Split | 80% English<br>20% English + 100% Multilingual |  None  |
+| English Train Accuracy  |  99.4% |  N/A (≈98%)  |
+| Final Train Accuracy  | 96.5%  |  96.6%  |
+| Final Val Accuracy  |  95.0%  |  94.6%  |
+| Languages |  55  |  N/A (≈35)  |
+| Hyperparameters  | maxlen=208<br>batch_size=112<br>optimizer=Adam<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss()  |  maxlen=192<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy"  |
+| Training Stopped |  6/30/2023  |  9/05/2022  |
 <br>
 <br>
+Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
+From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.