Update README.md
Browse files
README.md
CHANGED
@@ -2,9 +2,9 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
-
<center><img src="https://
|
6 |
|
7 |
-
Qra is a series of LLMs adapted to the Polish language, resulting from a collaboration between the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). The models were trained on the infrastructure of PG TASK Computing Center using 21 Nvidia A100 cards. The published versions of the Qra models were initialized with the weights of English LLama 2 checkpoints and then further trained on a carefully cleaned, filtered, and deduplicated corpus of Polish texts, totaling about 90 billion tokens. The original corpus consisted primarily of web data, including CommonCrawl dumps, the MADLAD-400 corpus
|
8 |
|
9 |
⚠️ **Important: Qra are foundation language models trained with causal language modeling objective on a large corpus of texts. They are therefore not intended for conversational or instruction-following purposes, and should be further fine-tuned to be used for such tasks.** ⚠️
|
10 |
|
@@ -15,10 +15,11 @@ The preprocessing pipeline included the following steps:
|
|
15 |
- Filtering documents using a quality classifier trained on a set of several thousand documents manually labeled as being of high or low quality. The input to the classifier is a set of several statistics ("quality signals") such as the percentage of Polish words, average word and sentence length, number of word and character duplications, proportion of different characters classes in the text.
|
16 |
- Filtering documents based on the perplexity value calculated by a lightweight KenLM language model.
|
17 |
- Assigning the document to one of 18 topical domains using a trained classifier.
|
|
|
18 |
|
19 |
The final distribution of documents by topic is shown in the chart below:
|
20 |
|
21 |
-
<center><img src="https://
|
22 |
|
23 |
## Model details
|
24 |
|
@@ -28,20 +29,21 @@ The models were trained for one epoch on sequences of 4096 tokens. During traini
|
|
28 |
- [Flash Attention 2](github.com/Dao-AILab/flash-attention)
|
29 |
- [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
|
30 |
- [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
|
31 |
-
- [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode
|
32 |
- [Gradient checkpointing](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-checkpointing) (only for the 13B model)
|
33 |
|
34 |
Below is a summary of the Qra-1B model:
|
35 |
|
36 |
| Attribute | Value |
|
37 |
| ---- | ---- |
|
38 |
-
|
|
39 |
| License | Apache 2.0 |
|
40 |
| Batch size | 1344 |
|
41 |
| Context length | 4096 |
|
42 |
| Learning rate | 2e-5 |
|
43 |
| Learning rate decay | cosine |
|
44 |
| Warmup steps | 0 |
|
|
|
45 |
|
46 |
## Evaluation
|
47 |
|
@@ -78,7 +80,7 @@ In 2018, the PolEval competition included a language modeling task, for which tr
|
|
78 |
|
79 |
### Long documents (2024)
|
80 |
|
81 |
-
Currently, LLMs support contexts of thousands of tokens. Their practical applications usually also involve processing long documents. Therefore, evaluating perplexity on a sentence-based dataset such as PolEval-2018 may not be meaningful. Additionally, the PolEval corpus has been publicly available on the internet for the past few years, which raises the possibility that for some models the training sets have been contaminated by this data. For this reason, we have prepared a new collection consisting of long papers published exclusively in 2024, which will allow us to more reliably test the perplexities of the models on new knowledge that was not available to them at the time of training. The corpus consists of 5,000 documents ranging from several hundred to about 20,000 tokens. Half of the set consists of press texts from Polish news portals from February 2024, the other half are scientific articles published since January 2024. Most of the documents exceed the context size of the evaluated models. To calculate perplexity for these documents, we divided them into chunks of size equal to the model's context with a stride of 512 tokens, following [this example](https://huggingface.co/docs/transformers/en/perplexity).
|
82 |
|
83 |
<table>
|
84 |
<thead>
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+
<center><img src="https://huggingface.co/OPI-PG/Qra-1b/resolve/main/images/1b-logo.png"></img></center>
|
6 |
|
7 |
+
Qra is a series of LLMs adapted to the Polish language, resulting from a collaboration between the National Information Processing Institute (OPI) and Gdańsk University of Technology (PG). The models were trained on the infrastructure of PG TASK Computing Center using 21 Nvidia A100 cards. The published versions of the Qra models were initialized with the weights of English LLama 2 checkpoints and then further trained on a carefully cleaned, filtered, and deduplicated corpus of Polish texts, totaling about 90 billion tokens. The original corpus consisted primarily of web data, including CommonCrawl dumps, and the MADLAD-400 corpus.
|
8 |
|
9 |
⚠️ **Important: Qra are foundation language models trained with causal language modeling objective on a large corpus of texts. They are therefore not intended for conversational or instruction-following purposes, and should be further fine-tuned to be used for such tasks.** ⚠️
|
10 |
|
|
|
15 |
- Filtering documents using a quality classifier trained on a set of several thousand documents manually labeled as being of high or low quality. The input to the classifier is a set of several statistics ("quality signals") such as the percentage of Polish words, average word and sentence length, number of word and character duplications, proportion of different characters classes in the text.
|
16 |
- Filtering documents based on the perplexity value calculated by a lightweight KenLM language model.
|
17 |
- Assigning the document to one of 18 topical domains using a trained classifier.
|
18 |
+
- Fuzzy deduplication using the MinHash algorithm within each topical domain.
|
19 |
|
20 |
The final distribution of documents by topic is shown in the chart below:
|
21 |
|
22 |
+
<center><img src="https://huggingface.co/OPI-PG/Qra-1b/resolve/main/images/topics.png"></img></center>
|
23 |
|
24 |
## Model details
|
25 |
|
|
|
29 |
- [Flash Attention 2](github.com/Dao-AILab/flash-attention)
|
30 |
- [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
|
31 |
- [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
|
32 |
+
- [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode
|
33 |
- [Gradient checkpointing](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-checkpointing) (only for the 13B model)
|
34 |
|
35 |
Below is a summary of the Qra-1B model:
|
36 |
|
37 |
| Attribute | Value |
|
38 |
| ---- | ---- |
|
39 |
+
| Adapted from | [TinyLlama-1.1B](TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) |
|
40 |
| License | Apache 2.0 |
|
41 |
| Batch size | 1344 |
|
42 |
| Context length | 4096 |
|
43 |
| Learning rate | 2e-5 |
|
44 |
| Learning rate decay | cosine |
|
45 |
| Warmup steps | 0 |
|
46 |
+
| Training time | 2 days |
|
47 |
|
48 |
## Evaluation
|
49 |
|
|
|
80 |
|
81 |
### Long documents (2024)
|
82 |
|
83 |
+
Currently, LLMs support contexts of thousands of tokens. Their practical applications usually also involve processing long documents. Therefore, evaluating perplexity on a sentence-based dataset such as PolEval-2018 may not be meaningful. Additionally, the PolEval corpus has been publicly available on the internet for the past few years, which raises the possibility that for some models the training sets have been contaminated by this data. For this reason, we have prepared a new collection consisting of long papers published exclusively in 2024, which will allow us to more reliably test the perplexities of the models on new knowledge that was not available to them at the time of training. The corpus consists of 5,000 documents ranging from several hundred to about 20,000 tokens. Half of the set consists of press texts from Polish news portals from February 2024, the other half are scientific articles published since January 2024. Most of the documents exceed the context size of the evaluated models. To calculate perplexity for these documents, we divided them into chunks of size equal to the model's context length with a stride of 512 tokens, following [this example](https://huggingface.co/docs/transformers/en/perplexity).
|
84 |
|
85 |
<table>
|
86 |
<thead>
|