aapot commited on
Commit
4e05e88
1 Parent(s): 308d107

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -107
README.md CHANGED
@@ -1,108 +1,113 @@
1
- ---
2
- language:
3
- - fi
4
- license: apache-2.0
5
- tags:
6
- - finnish
7
- - convbert
8
- datasets:
9
- - Finnish-NLP/mc4_fi_cleaned
10
- - wikipedia
11
- widget:
12
- - text: "Moikka olen [MASK] kielimalli."
13
-
14
- ---
15
-
16
- # ConvBERT for Finnish
17
-
18
- Pretrained ConvBERT model on Finnish language using a replaced token detection (RTD) objective. ConvBERT was introduced in
19
- [this paper](https://arxiv.org/abs/2008.02496)
20
- and first released at [this page](https://github.com/yitu-opensource/ConvBert).
21
-
22
- **Note**: this model is the ConvBERT generator model intented to be used for the fill-mask task. The ConvBERT discriminator model intented to be used for fine-tuning on downstream tasks like text classification is released here [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish)
23
-
24
- ## Model description
25
-
26
- Finnish ConvBERT is a transformers model pretrained on a very large corpus of Finnish data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
27
-
28
- More precisely, it was pretrained with the replaced token detection (RTD) objective. Instead of masking the input like in BERT's masked language modeling (MLM) objective, this approach corrupts the input by replacing some tokens with plausible alternatives sampled from a small generator model. Then, instead of training a model that predicts the original identities of the corrupted tokens, a discriminative model is trained that predicts whether each token in the corrupted input was replaced by a generator model's sample or not. Thus, this training approach resembles Generative Adversarial Nets (GAN).
29
-
30
- This way, the model learns an inner representation of the Finnish language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the ConvBERT model as inputs.
31
-
32
- Compared to BERT and ELECTRA models, ConvBERT model utilizes a span-based
33
- dynamic convolution to replace some of the global self-attention heads for modeling local input sequence
34
- dependencies. These convolution heads, together with the rest of the self-attention
35
- heads, form a new mixed attention block that should be more efficient at both global
36
- and local context learning.
37
-
38
- ## Intended uses & limitations
39
-
40
- You can use this generator model mainly just for the fill-mask task. For other tasks, check the [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish) model instead.
41
-
42
- ### How to use
43
-
44
- Here is how to use this model directly with a pipeline for fill-mask task:
45
-
46
- ```python
47
- >>> from transformers import pipeline
48
- >>> unmasker = pipeline('fill-mask', model='Finnish-NLP/convbert-base-generator-finnish')
49
- >>> unmasker("Moikka olen [MASK] kielimalli.")
50
- [{'score': 0.08341152966022491,
51
- 'token': 4619,
52
- 'token_str': 'suomalainen',
53
- 'sequence': 'Moikka olen suomalainen kielimalli.'},
54
- {'score': 0.02831297740340233,
55
- 'token': 25583,
56
- 'token_str': 'ranskalainen',
57
- 'sequence': 'Moikka olen ranskalainen kielimalli.'},
58
- {'score': 0.027857203036546707,
59
- 'token': 37714,
60
- 'token_str': 'kiinalainen',
61
- 'sequence': 'Moikka olen kiinalainen kielimalli.'},
62
- {'score': 0.027701903134584427,
63
- 'token': 21614,
64
- 'token_str': 'ruotsalainen',
65
- 'sequence': 'Moikka olen ruotsalainen kielimalli.'},
66
- {'score': 0.026388710364699364,
67
- 'token': 591,
68
- 'token_str': 'hyvä',
69
- 'sequence': 'Moikka olen hyvä kielimalli.'}]
70
- ```
71
-
72
- ### Limitations and bias
73
-
74
- The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
75
-
76
- ## Training data
77
-
78
- This Finnish ConvBERT model was pretrained on the combination of five datasets:
79
- - [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it with our own text data cleaning codes (check the dataset repo).
80
- - [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
81
- - [Yle Finnish News Archive 2011-2018](http://urn.fi/urn:nbn:fi:lb-2017070501)
82
- - [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
83
- - [The Suomi24 Sentences Corpus](http://urn.fi/urn:nbn:fi:lb-2020021803)
84
-
85
- Raw datasets were cleaned to filter out bad quality and non-Finnish examples. Together these cleaned datasets were around 84GB of text.
86
-
87
- ## Training procedure
88
-
89
- ### Preprocessing
90
-
91
- The texts are tokenized using WordPiece and a vocabulary size of 50265. The inputs are sequences of 512 consecutive tokens. Texts are not lower cased so this model is case-sensitive: it makes a difference between finnish and Finnish.
92
-
93
- ### Pretraining
94
-
95
- The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 1M steps. The optimizer used was a AdamW with learning rate 1e-4, learning rate warmup for 20000 steps and linear decay of the learning rate after.
96
-
97
- Training code was from the official [ConvBERT repository](https://github.com/yitu-opensource/ConvBert) and also some instructions was used from [here](https://github.com/stefan-it/turkish-bert/blob/master/convbert/CHEATSHEET.md).
98
-
99
- ## Evaluation results
100
-
101
- For evaluation results, check the [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish) model repository instead.
102
-
103
- ## Team Members
104
-
105
- - Aapo Tanskanen, [Hugging Face profile](https://huggingface.co/aapot), [LinkedIn profile](https://www.linkedin.com/in/aapotanskanen/)
106
- - Rasmus Toivanen, [Hugging Face profile](https://huggingface.co/RASMUS), [LinkedIn profile](https://www.linkedin.com/in/rasmustoivanen/)
107
-
 
 
 
 
 
108
  Feel free to contact us for more details 🤗
1
+ ---
2
+ language:
3
+ - fi
4
+ license: apache-2.0
5
+ tags:
6
+ - finnish
7
+ - convbert
8
+ datasets:
9
+ - Finnish-NLP/mc4_fi_cleaned
10
+ - wikipedia
11
+ widget:
12
+ - text: "Moikka olen [MASK] kielimalli."
13
+
14
+ ---
15
+
16
+ # ConvBERT for Finnish
17
+
18
+ Pretrained ConvBERT model on Finnish language using a replaced token detection (RTD) objective. ConvBERT was introduced in
19
+ [this paper](https://arxiv.org/abs/2008.02496)
20
+ and first released at [this page](https://github.com/yitu-opensource/ConvBert).
21
+
22
+ **Note**: this model is the ConvBERT generator model intented to be used for the fill-mask task. The ConvBERT discriminator model intented to be used for fine-tuning on downstream tasks like text classification is released here [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish)
23
+
24
+ ## Model description
25
+
26
+ Finnish ConvBERT is a transformers model pretrained on a very large corpus of Finnish data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
27
+
28
+ More precisely, it was pretrained with the replaced token detection (RTD) objective. Instead of masking the input like in BERT's masked language modeling (MLM) objective, this approach corrupts the input by replacing some tokens with plausible alternatives sampled from a small generator model. Then, instead of training a model that predicts the original identities of the corrupted tokens, a discriminative model is trained that predicts whether each token in the corrupted input was replaced by a generator model's sample or not. Thus, this training approach resembles Generative Adversarial Nets (GAN).
29
+
30
+ This way, the model learns an inner representation of the Finnish language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the ConvBERT model as inputs.
31
+
32
+ Compared to BERT and ELECTRA models, ConvBERT model utilizes a span-based
33
+ dynamic convolution to replace some of the global self-attention heads for modeling local input sequence
34
+ dependencies. These convolution heads, together with the rest of the self-attention
35
+ heads, form a new mixed attention block that should be more efficient at both global
36
+ and local context learning.
37
+
38
+ ## Intended uses & limitations
39
+
40
+ You can use this generator model mainly just for the fill-mask task. For other tasks, check the [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish) model instead.
41
+
42
+ ### How to use
43
+
44
+ Here is how to use this model directly with a pipeline for fill-mask task:
45
+
46
+ ```python
47
+ >>> from transformers import pipeline
48
+ >>> unmasker = pipeline('fill-mask', model='Finnish-NLP/convbert-base-generator-finnish')
49
+ >>> unmasker("Moikka olen [MASK] kielimalli.")
50
+ [{'score': 0.08341152966022491,
51
+ 'token': 4619,
52
+ 'token_str': 'suomalainen',
53
+ 'sequence': 'Moikka olen suomalainen kielimalli.'},
54
+ {'score': 0.02831297740340233,
55
+ 'token': 25583,
56
+ 'token_str': 'ranskalainen',
57
+ 'sequence': 'Moikka olen ranskalainen kielimalli.'},
58
+ {'score': 0.027857203036546707,
59
+ 'token': 37714,
60
+ 'token_str': 'kiinalainen',
61
+ 'sequence': 'Moikka olen kiinalainen kielimalli.'},
62
+ {'score': 0.027701903134584427,
63
+ 'token': 21614,
64
+ 'token_str': 'ruotsalainen',
65
+ 'sequence': 'Moikka olen ruotsalainen kielimalli.'},
66
+ {'score': 0.026388710364699364,
67
+ 'token': 591,
68
+ 'token_str': 'hyvä',
69
+ 'sequence': 'Moikka olen hyvä kielimalli.'}]
70
+ ```
71
+
72
+ ### Limitations and bias
73
+
74
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model.
75
+
76
+ ## Training data
77
+
78
+ This Finnish ConvBERT model was pretrained on the combination of five datasets:
79
+ - [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it with our own text data cleaning codes (check the dataset repo).
80
+ - [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
81
+ - [Yle Finnish News Archive 2011-2018](http://urn.fi/urn:nbn:fi:lb-2017070501)
82
+ - [Finnish News Agency Archive (STT)](http://urn.fi/urn:nbn:fi:lb-2018121001)
83
+ - [The Suomi24 Sentences Corpus](http://urn.fi/urn:nbn:fi:lb-2020021803)
84
+
85
+ Raw datasets were cleaned to filter out bad quality and non-Finnish examples. Together these cleaned datasets were around 84GB of text.
86
+
87
+ ## Training procedure
88
+
89
+ ### Preprocessing
90
+
91
+ The texts are tokenized using WordPiece and a vocabulary size of 50265. The inputs are sequences of 512 consecutive tokens. Texts are not lower cased so this model is case-sensitive: it makes a difference between finnish and Finnish.
92
+
93
+ ### Pretraining
94
+
95
+ The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), for 1M steps. The optimizer used was a AdamW with learning rate 1e-4, learning rate warmup for 20000 steps and linear decay of the learning rate after.
96
+
97
+ Training code was from the official [ConvBERT repository](https://github.com/yitu-opensource/ConvBert) and also some instructions was used from [here](https://github.com/stefan-it/turkish-bert/blob/master/convbert/CHEATSHEET.md).
98
+
99
+ ## Evaluation results
100
+
101
+ For evaluation results, check the [Finnish-NLP/convbert-base-finnish](https://huggingface.co/Finnish-NLP/convbert-base-finnish) model repository instead.
102
+
103
+ ## Acknowledgements
104
+
105
+ This project would not have been possible without compute generously provided by Google through the
106
+ [TPU Research Cloud](https://sites.research.google/trc/).
107
+
108
+ ## Team Members
109
+
110
+ - Aapo Tanskanen, [Hugging Face profile](https://huggingface.co/aapot), [LinkedIn profile](https://www.linkedin.com/in/aapotanskanen/)
111
+ - Rasmus Toivanen, [Hugging Face profile](https://huggingface.co/RASMUS), [LinkedIn profile](https://www.linkedin.com/in/rasmustoivanen/)
112
+
113
  Feel free to contact us for more details 🤗