Updated README
Browse files
README.md
CHANGED
@@ -6,16 +6,16 @@ language:
|
|
6 |
|
7 |
# Latvian BERT base model (cased)
|
8 |
|
9 |
-
A BERT model pretrained on
|
10 |
-
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via [
|
|
|
11 |
|
12 |
-
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding
|
13 |
-
|
14 |
-
Developed at [AiLab.lv](https://ailab.lv)
|
15 |
|
16 |
## Training data
|
17 |
|
18 |
-
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); 500M tokens in total.
|
19 |
|
20 |
## Tokenization
|
21 |
|
|
|
6 |
|
7 |
# Latvian BERT base model (cased)
|
8 |
|
9 |
+
A BERT model pretrained on Latvian language data using the masked language modeling and next sentence prediction objectives.
|
10 |
+
It was introduced in [this paper](http://ebooks.iospress.nl/volumearticle/55531) and first released via a [GitHub repository](https://github.com/LUMII-AILab/LVBERT).
|
11 |
+
The current HF repository contains an improved version of LVBERT.
|
12 |
|
13 |
+
This model is case-sensitive. It is primarily intended to be fine-tuned on downstream natural language understanding tasks like text classification, named entity recognition, question answering.
|
14 |
+
However, the model can be used as is to compute contextual embeddings for tasks like text similarity and clustering, semantic search.
|
|
|
15 |
|
16 |
## Training data
|
17 |
|
18 |
+
LVBERT was pretrained on texts from the [Balanced Corpus of Modern Latvian](https://korpuss.lv/en/id/LVK2018), [Latvian Wikipedia](https://korpuss.lv/en/id/Vikipēdija), [Corpus of News Portal Articles](https://korpuss.lv/en/id/Ziņas), as well as [Corpus of News Portal Comments](https://korpuss.lv/en/id/Barometrs); around 500M tokens in total.
|
19 |
|
20 |
## Tokenization
|
21 |
|