IMJONEZZ
/

SlovenBERTcina

Inference Endpoints

Model card Files Files and versions Community

SlovenBERTcina / README.md

IMJONEZZ's picture

Update README.md

2ccce93 about 1 year ago

|

history blame contribute delete

1.17 kB

	#Slovak RoBERTA Masked Language Model

	###83Mil Parameters in small model

	Medium and Large models coming soon!

	RoBERTA pretrained tokenizer vocab and merges included.

	---

	##Training params:
	- Dataset:
	8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.
	- Preprocessing:
	Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.
	- Evaluation results:
	- Mnoho ľudí tu MASK
	- žije.
	- žijú.
	- je.
	- trpí.
	- Ako sa MASK
	- máte
	- máš
	- má
	- hovorí
	- Plážová sezóna pod Zoborom patrí medzi MASK obdobia.
	- ročné
	- najkrajšie
	- najobľúbenejšie
	- najnáročnejšie

	- Limitations:
	The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.

	- Credit:
	If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.