File size: 1,706 Bytes
33265fb
8959dc5
56f4e90
8928304
8959dc5
95a5efe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33265fb
56f4e90
bee5fba
56f4e90
8cae0fb
 
95a5efe
 
 
 
 
 
 
 
 
 
 
 
 
f1b36b5
 
 
 
 
95a5efe
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: cc-by-nc-3.0
datasets:
- FredZhang7/toxi-text-3M
pipeline_tag: text-classification
language:
- ar
- es
- pa
- th
- et
- fr
- fi
- no
- hu
- lt
- ur
- so
- pl
- el
- mr
- sk
- gu
- he
- af
- te
- ro
- lv
- sv
- ne
- kn
- it
- mk
- cs
- en
- de
- da
- ta
- bn
- pt
- sq
- tl
- uk
- bg
- ca
- sw
- hi
- zh
- ja
- hr
- ru
- vi
- id
- sl
- cy
- ko
- nl
- ml
- tr
- fa

tags:
- nlp
---

**I have decided to release all auto-moderation models at once sometime in July. The curated datasets for training these models will be avaliable first.**

<br>

|          |    v2    |    v1    |
|----------|----------|----------|
| Base Model   | bert-base-multilingual-cased   |  nlpaueb/legal-bert-small-uncased   |
| Base Tokenizer   |  bert-base-multilingual-cased   |  bert-base-multilingual-cased  |
| Framework  | PyTorch   |  TensorFlow   |
| Dataset Size  |  2.95M |  2.68M   |
| Train Split | 80% English<br>20% English + 100% Multilingual |  None  |
| English Train Accuracy  |  99.4% |  N/A (≈98%)  |
| Final Train Accuracy  | 96.5%  |  96.6%  |
| Final Val Accuracy  |  95.0%  |  94.6%  |
| Languages |  55  |  N/A (≈35)  |
| Hyperparameters  | maxlen=208<br>batch_size=112<br>optimizer=Adam<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss()  |  maxlen=192<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy"  |
| Training Stopped |  6/30/2023  |  9/05/2022  |

<br>

<br>

Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.