vesteinn commited on
Commit
7e0f4b8
1 Parent(s): c71cced

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -4
README.md CHANGED
@@ -1,6 +1,5 @@
1
  ---
2
  language: is
3
- thumbnail: https://mideind.is/pro_greynir_logo.2bce8b4e1ce0bbe535c5.svg
4
  widget:
5
  - text: Má bjóða þér <mask> í kvöld?
6
  - text: Forseti <mask> er ágæt.
@@ -13,8 +12,39 @@ tags:
13
  license: agpl-3.0
14
  ---
15
 
16
- # IceBERT
17
 
18
- IceBERT was trained with fairseq using the RoBERTa model on the Icelandic Giga Word corpus.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- This model will soon be replaced with another version trained on a larged dataset, this model will then be renamed to IceBERT-IGC
1
  ---
2
  language: is
 
3
  widget:
4
  - text: Má bjóða þér <mask> í kvöld?
5
  - text: Forseti <mask> er ágæt.
12
  license: agpl-3.0
13
  ---
14
 
15
+ # IceBERT-igc
16
 
17
+ This model was trained with fairseq using the RoBERTa-base architecture. It is one of many models we have trained for Icelandic, see the paper referenced below for further details. The training data used is shown in the table below.
18
+
19
+ | Dataset | Size | Tokens |
20
+ |------------------------------------------------------|---------|--------|
21
+ | Icelandic Gigaword Corpus v20.05 (IGC) | 8.2 GB | 1,388M |
22
+
23
+
24
+ ## Scitation
25
+
26
+ The model is described in this paper [https://arxiv.org/abs/2201.05601](https://arxiv.org/abs/2201.05601). Please cite the paper if you make use of the model.
27
+
28
+ ```
29
+ @article{DBLP:journals/corr/abs-2201-05601,
30
+ author = {V{\'{e}}steinn Sn{\ae}bjarnarson and
31
+ Haukur Barri S{\'{\i}}monarson and
32
+ P{\'{e}}tur Orri Ragnarsson and
33
+ Svanhv{\'{\i}}t Lilja Ing{\'{o}}lfsd{\'{o}}ttir and
34
+ Haukur P{\'{a}}ll J{\'{o}}nsson and
35
+ Vilhj{\'{a}}lmur {\TH}orsteinsson and
36
+ Hafsteinn Einarsson},
37
+ title = {A Warm Start and a Clean Crawled Corpus - {A} Recipe for Good Language
38
+ Models},
39
+ journal = {CoRR},
40
+ volume = {abs/2201.05601},
41
+ year = {2022},
42
+ url = {https://arxiv.org/abs/2201.05601},
43
+ eprinttype = {arXiv},
44
+ eprint = {2201.05601},
45
+ timestamp = {Thu, 20 Jan 2022 14:21:35 +0100},
46
+ biburl = {https://dblp.org/rec/journals/corr/abs-2201-05601.bib},
47
+ bibsource = {dblp computer science bibliography, https://dblp.org}
48
+ }
49
+ ```
50