aapot commited on
Commit
fa9a284
1 Parent(s): c7e52d3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fi
4
+ license: apache-2.0
5
+ tags:
6
+ - finnish
7
+ - roberta
8
+ datasets:
9
+ - mc4
10
+ - wikipedia
11
+ pipeline_tag: fill-mask
12
+ widget:
13
+ - text: "Moikka olen <mask> kielimalli."
14
+ ---
15
+
16
+ # RoBERTa large model for Finnish
17
+
18
+ Pretrained model on Finnish language using a masked language modeling (MLM) objective. It was introduced in
19
+ [this paper](https://arxiv.org/abs/1907.11692) and first released in
20
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
21
+ makes a difference between finnish and Finnish.
22
+
23
+ ## Model description
24
+
25
+ RoBERTa is a transformers model pretrained on a large corpus of Finnish data in a self-supervised fashion. This means
26
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
27
+ publicly available data) with an automatic process to generate inputs and labels from those texts.
28
+
29
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
30
+ randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
31
+ the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
32
+ after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
33
+ learn a bidirectional representation of the sentence.
34
+
35
+ This way, the model learns an inner representation of the Finnish language that can then be used to extract features
36
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
37
+ classifier using the features produced by the RoBERTa model as inputs.
38
+
39
+ ## Intended uses & limitations
40
+
41
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
42
+
43
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
44
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
45
+ generation you should look at model like GPT2.
46
+
47
+ ### How to use
48
+
49
+ You can use this model directly with a pipeline for masked language modeling:
50
+
51
+ ```python
52
+ >>> from transformers import pipeline
53
+ >>> unmasker = pipeline('fill-mask', model='Finnish-NLP/roberta-large-finnish')
54
+ >>> unmasker("Moikka olen <mask> kielimalli.")
55
+
56
+ [{'sequence': 'Moikka olen hyvä kielimalli.',
57
+ 'score': 0.1535797119140625,
58
+ 'token': 767,
59
+ 'token_str': ' hyvä'},
60
+ {'sequence': 'Moikka olen paras kielimalli.',
61
+ 'score': 0.04795042425394058,
62
+ 'token': 2888,
63
+ 'token_str': ' paras'},
64
+ {'sequence': 'Moikka olen huono kielimalli.',
65
+ 'score': 0.04251479730010033,
66
+ 'token': 3217,
67
+ 'token_str': ' huono'},
68
+ {'sequence': 'Moikka olen myös kielimalli.',
69
+ 'score': 0.027469098567962646,
70
+ 'token': 520,
71
+ 'token_str': ' myös'},
72
+ {'sequence': 'Moikka olen se kielimalli.',
73
+ 'score': 0.013878575526177883,
74
+ 'token': 358,
75
+ 'token_str': ' se'}]
76
+ ```
77
+
78
+ Here is how to use this model to get the features of a given text in PyTorch:
79
+
80
+ ```python
81
+ from transformers import RobertaTokenizer, RobertaModel
82
+ tokenizer = RobertaTokenizer.from_pretrained('Finnish-NLP/roberta-large-finnish')
83
+ model = RobertaModel.from_pretrained('Finnish-NLP/roberta-large-finnish')
84
+ text = "Replace me by any text you'd like."
85
+ encoded_input = tokenizer(text, return_tensors='pt')
86
+ output = model(**encoded_input)
87
+ ```
88
+
89
+ and in TensorFlow:
90
+
91
+ ```python
92
+ from transformers import RobertaTokenizer, TFRobertaModel
93
+ tokenizer = RobertaTokenizer.from_pretrained('Finnish-NLP/roberta-large-finnish')
94
+ model = TFRobertaModel.from_pretrained('Finnish-NLP/roberta-large-finnish', from_pt=True)
95
+ text = "Replace me by any text you'd like."
96
+ encoded_input = tokenizer(text, return_tensors='tf')
97
+ output = model(encoded_input)
98
+ ```
99
+
100
+ ### Limitations and bias
101
+
102
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from
103
+ neutral. Therefore, the model can have biased predictions.
104
+
105
+ ## Training data
106
+
107
+ This Finnish RoBERTa model was pretrained on the combination of five datasets:
108
+ - [mc4](https://huggingface.co/datasets/mc4), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset. We used the Finnish subset of the mC4 dataset
109
+ - [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
110
+ - [Yle Finnish News Archive](http://urn.fi/urn:nbn:fi:lb-2017070501)
111
+ - [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
112
+ - [The Suomi24 Sentences Corpus](http://urn.fi/urn:nbn:fi:lb-2020021803)
113
+
114
+ Raw datasets were cleaned to filter out bad quality and non-Finnish examples. Together these cleaned datasets were around 78GB of text.
115
+
116
+ ## Training procedure
117
+
118
+ ### Preprocessing
119
+
120
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
121
+ the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
122
+ with `<s>` and the end of one by `</s>`
123
+
124
+ The details of the masking procedure for each sentence are the following:
125
+ - 15% of the tokens are masked.
126
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
127
+
128
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
129
+ - In the 10% remaining cases, the masked tokens are left as is.
130
+
131
+ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
132
+
133
+ ### Pretraining
134
+
135
+ The model was trained on TPUv3-8 VM, sponsored by the Google TPU Research Cloud, for 2 epochs with a sequence length of 128 and continuing for one more epoch with a sequence length of 512. The optimizer used is Adafactor with a learning rate of 2e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and \\(\epsilon = 1e-6\\), learning rate warmup for 1500 steps and linear decay of the learning rate after.
136
+
137
+ ## Evaluation results
138
+
139
+ Evaluation was done by fine-tuning the model on downstream text classification task with two different labeled datasets: [Yle News](https://github.com/spyysalo/yle-corpus) and [Eduskunta](https://github.com/aajanki/eduskunta-vkk). Yle News classification fine-tuning was done with two different sequence lengths: 128 and 512 but Eduskunta only with 128 sequence length.
140
+ When fine-tuned on those datasets, this model (the first row of the table) achieves the following accuracy results compared to the [FinBERT (Finnish BERT)](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) and to our previous [Finnish RoBERTa-large](https://huggingface.co/flax-community/RoBERTa-large-finnish) trained during the Hugging Face JAX/Flax community week:
141
+
142
+ | | Average | Yle News 128 length | Yle News 512 length | Eduskunta 128 length |
143
+ |----------------------------------------|----------|---------------------|---------------------|----------------------|
144
+ |Finnish-NLP/roberta-large-finnish |88.02 |94.53 |95.23 |74.30 |
145
+ |TurkuNLP/bert-base-finnish-cased-v1 |**88.82** |**94.90** |**95.49** |**76.07** |
146
+ |flax-community/RoBERTa-large-finnish |87.72 |94.42 |95.06 |73.67 |
147
+
148
+ To conclude, this model improves on our previous [Finnish RoBERTa-large](https://huggingface.co/flax-community/RoBERTa-large-finnish) model trained during the Hugging Face JAX/Flax community week but is still slightly (~ 1%) losing to the [FinBERT (Finnish BERT)](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1) model.
149
+
150
+ ## Team Members
151
+
152
+ - Aapo Tanskanen ([aapot](https://huggingface.co/aapot))
153
+ - Rasmus Toivanen ([RASMUS](https://huggingface.co/RASMUS))
154
+ - Tommi Vehviläinen ([Tommi](https://huggingface.co/Tommi))