FredZhang7
/

one-for-all-toxicity-v3

Text Classification

Inference Endpoints

Model card Files Files and versions Community

one-for-all-toxicity-v3 / README.md

FredZhang7's picture

... wrong digit

4981b77 11 months ago

|

raw history blame

No virus

1.84 kB

	---
	license: cc-by-4.0
	datasets:
	- FredZhang7/toxi-text-3M
	pipeline_tag: text-classification
	language:
	- ar
	- es
	- pa
	- th
	- et
	- fr
	- fi
	- hu
	- lt
	- ur
	- so
	- pl
	- el
	- mr
	- sk
	- gu
	- he
	- af
	- te
	- ro
	- lv
	- sv
	- ne
	- kn
	- it
	- mk
	- cs
	- en
	- de
	- da
	- ta
	- bn
	- pt
	- sq
	- tl
	- uk
	- bg
	- ca
	- sw
	- hi
	- zh
	- ja
	- hr
	- ru
	- vi
	- id
	- sl
	- cy
	- ko
	- nl
	- ml
	- tr
	- fa
	- 'no'
	- multilingual
	tags:
	- nlp
	- moderation
	---

	Find the v1 (TensorFlow) model on [this page](https://github.com/FredZhang7/tfjs-node-tiny/releases/tag/text-classification).

	<br>

	\| \| v3 \| v1 \|
	\|----------\|----------\|----------\|
	\| Base Model \| bert-base-multilingual-cased \| nlpaueb/legal-bert-small-uncased \|
	\| Base Tokenizer \| bert-base-multilingual-cased \| bert-base-multilingual-cased \|
	\| Framework \| PyTorch \| TensorFlow \|
	\| Dataset Size \| 3.0M \| 2.68M \|
	\| Train Split \| 80% English<br>20% English + 100% Multilingual \| None \|
	\| English Train Accuracy \| 99.5% \| N/A (≈97.5%) \|
	\| Other Train Accuracy \| 98.6% \| 96.6% \|
	\| Final Val Accuracy \| 96.8% \| 94.6% \|
	\| Languages \| 55 \| N/A (≈35) \|
	\| Hyperparameters \| maxlen=208<br>padding='max_length'<br>batch_size=112<br>optimizer=AdamW<br>learning_rate=1e-5<br>loss=BCEWithLogitsLoss() \| maxlen=192<br>padding='max_length'<br>batch_size=16<br>optimizer=Adam<br>learning_rate=1e-5<br>loss="binary_crossentropy" \|
	\| Training Stopped \| 7/20/2023 \| 9/05/2022 \|

	<br>

	I manually annotated more data on top of Toxi Text 3M and added them to the training set.

	<br>

	Models tested for v2: roberta, xlm-roberta, bert-small, bert-base-cased/uncased, bert-multilingual-cased/uncased, and alberta-large-v2.
	From these models, I chose bert-multilingual-cased because of its higher resource efficiency and performance than the rest for this particular task.