FredZhang7 commited on
Commit
95a5efe
·
1 Parent(s): bee5fba

finalize upload

Browse files
Files changed (1) hide show
  1. README.md +73 -10
README.md CHANGED
@@ -3,24 +3,87 @@ license: cc-by-nc-3.0
3
  datasets:
4
  - FredZhang7/toxi-text-3M
5
  pipeline_tag: text-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
 
8
  **I have decided to release all auto-moderation models at once sometime in July. The curated datasets for training these models will be avaliable first.**
9
 
10
  <br>
11
 
12
- Finished training: 6/30/2023
13
-
14
- Final Train & Validation Accuracy: 95-98%
15
-
16
- Large model (v2) will be avaliable for PyTorch
17
-
18
- Lightweight model and tokenizer (v1) will be avaliable for transformers.js
 
 
 
 
 
 
19
 
20
  <br>
21
 
22
  <br>
23
 
24
- Models tested: roberta, xlm-roberta, bert-tiny, bert-base-cased/uncased, bert-multilingual-cased/uncased, alberta-large-v2
25
-
26
- Model chosen based on cost-efficiency and performance: bert-multilingual-cased
 
3
  datasets:
4
  - FredZhang7/toxi-text-3M
5
  pipeline_tag: text-classification
6
+ language:
7
+ - ar
8
+ - es
9
+ - pa
10
+ - th
11
+ - et
12
+ - fr
13
+ - fi
14
+ - no
15
+ - hu
16
+ - lt
17
+ - ur
18
+ - so
19
+ - pl
20
+ - el
21
+ - mr
22
+ - sk
23
+ - gu
24
+ - he
25
+ - af
26
+ - te
27
+ - ro
28
+ - lv
29
+ - sv
30
+ - ne
31
+ - kn
32
+ - it
33
+ - mk
34
+ - cs
35
+ - en
36
+ - de
37
+ - da
38
+ - ta
39
+ - bn
40
+ - pt
41
+ - sq
42
+ - tl
43
+ - uk
44
+ - bg
45
+ - ca
46
+ - sw
47
+ - hi
48
+ - zh
49
+ - ja
50
+ - hr
51
+ - ru
52
+ - vi
53
+ - id
54
+ - sl
55
+ - cy
56
+ - ko
57
+ - nl
58
+ - ml
59
+ - tr
60
+ - fa
61
+
62
+ tags:
63
+ - nlp
64
  ---
65
 
66
  **I have decided to release all auto-moderation models at once sometime in July. The curated datasets for training these models will be avaliable first.**
67
 
68
  <br>
69
 
70
+ | | v2 | v1 |
71
+ |----------|----------|----------|
72
+ | Base Model | bert-base-multilingual-cased | nlpaueb/legal-bert-small-uncased |
73
+ | Base Tokenizer | bert-base-multilingual-cased | bert-base-multilingual-cased |
74
+ | Framework | PyTorch | TensorFlow |
75
+ | Dataset Size | 2.95M | 2.68M |
76
+ | Train Split | 80% English<br>20% English + 100% Multilingual | None |
77
+ | English Train Accuracy | 99.4% | N/A (≈98%) |
78
+ | Final Train Accuracy | 96.5% | 96.6% |
79
+ | Final Val Accuracy | 95.0% | 94.6% |
80
+ | Languages | 55 | N/A (≈35) |
81
+ | Hyperparameters | maxlen=208<br>batch_size=112<br>optimizer=Adam<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() | maxlen=192<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" |
82
+ | Training Stopped | 6/30/2023 | 9/05/2022 |
83
 
84
  <br>
85
 
86
  <br>
87
 
88
+ Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
89
+ From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.