Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,9 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
4 |
Qra is a series of LLMs adapted to the Polish language, resulting from a collaboration between the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). The models were trained on the infrastructure of PG TASK Computing Center using 21 Nvidia A100 cards. The published versions of the Qra models were initialized with the weights of English LLama 2 checkpoints and then further trained on a carefully cleaned, filtered, and deduplicated corpus of Polish texts, totaling about 90 billion tokens. The original corpus consisted primarily of web data, including CommonCrawl dumps, the MADLAD-400 corpus, and other crawls of Polish websites.
|
5 |
|
6 |
⚠️ **Important: Qra are foundation language models trained with causal language modeling objective on a large corpus of texts. They are therefore not intended for conversational or instruction-following purposes, and should be further fine-tuned to be used for such tasks.** ⚠️
|
@@ -15,6 +18,8 @@ The preprocessing pipeline included the following steps:
|
|
15 |
|
16 |
The final distribution of documents by topic is shown in the chart below:
|
17 |
|
|
|
|
|
18 |
## Model details
|
19 |
|
20 |
The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
<center><img src="https://i.imgur.com/f1OSa3c.png"></img></center>
|
6 |
+
|
7 |
Qra is a series of LLMs adapted to the Polish language, resulting from a collaboration between the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). The models were trained on the infrastructure of PG TASK Computing Center using 21 Nvidia A100 cards. The published versions of the Qra models were initialized with the weights of English LLama 2 checkpoints and then further trained on a carefully cleaned, filtered, and deduplicated corpus of Polish texts, totaling about 90 billion tokens. The original corpus consisted primarily of web data, including CommonCrawl dumps, the MADLAD-400 corpus, and other crawls of Polish websites.
|
8 |
|
9 |
⚠️ **Important: Qra are foundation language models trained with causal language modeling objective on a large corpus of texts. They are therefore not intended for conversational or instruction-following purposes, and should be further fine-tuned to be used for such tasks.** ⚠️
|
|
|
18 |
|
19 |
The final distribution of documents by topic is shown in the chart below:
|
20 |
|
21 |
+
<center><img src="https://i.imgur.com/NtWZJ4J.png"></img></center>
|
22 |
+
|
23 |
## Model details
|
24 |
|
25 |
The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
|