mideind
/

IceBERT-igc

Inference Endpoints

Model card Files Files and versions Community

IceBERT-igc / README.md

vesteinn's picture

Update README.md

4b43558 over 2 years ago

|

history blame contribute delete

No virus

1.76 kB

	---
	language: is
	widget:
	- text: Má bjóða þér <mask> í kvöld?
	- text: Forseti <mask> er ágæt.
	- text: Súpan var <mask> á bragðið.
	tags:
	- roberta
	- icelandic
	- masked-lm
	- pytorch
	license: agpl-3.0
	---

	# IceBERT-igc

	This model was trained with fairseq using the RoBERTa-base architecture. It is one of many models we have trained for Icelandic, see the paper referenced below for further details. The training data used is shown in the table below.

	\| Dataset \| Size \| Tokens \|
	\|------------------------------------------------------\|---------\|--------\|
	\| Icelandic Gigaword Corpus v20.05 (IGC) \| 8.2 GB \| 1,388M \|


	## Citation

	The model is described in this paper [https://arxiv.org/abs/2201.05601](https://arxiv.org/abs/2201.05601). Please cite the paper if you make use of the model.

	```
	@article{DBLP:journals/corr/abs-2201-05601,
	author = {V{\'{e}}steinn Sn{\ae}bjarnarson and
	Haukur Barri S{\'{\i}}monarson and
	P{\'{e}}tur Orri Ragnarsson and
	Svanhv{\'{\i}}t Lilja Ing{\'{o}}lfsd{\'{o}}ttir and
	Haukur P{\'{a}}ll J{\'{o}}nsson and
	Vilhj{\'{a}}lmur {\TH}orsteinsson and
	Hafsteinn Einarsson},
	title = {A Warm Start and a Clean Crawled Corpus - {A} Recipe for Good Language
	Models},
	journal = {CoRR},
	volume = {abs/2201.05601},
	year = {2022},
	url = {https://arxiv.org/abs/2201.05601},
	eprinttype = {arXiv},
	eprint = {2201.05601},
	timestamp = {Thu, 20 Jan 2022 14:21:35 +0100},
	biburl = {https://dblp.org/rec/journals/corr/abs-2201-05601.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}
	```