Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,7 @@ Training was done on [Karolina](https://www.it4i.cz/en) cluster.
|
|
17 |
- [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)
|
18 |
|
19 |
# <span style="color:blue">Latest Updates</span>
|
|
|
20 |
- 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
|
21 |
- 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
|
22 |
-
|
@@ -35,7 +36,6 @@ Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
|
35 |
However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
|
36 |
The improvement over mistral7b is not significant.
|
37 |
|
38 |
-
We will release more evaluations together with our benchmark **BenCzechMark** soon (see release plan!).
|
39 |
|
40 |
## Loss
|
41 |
We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
|
@@ -146,8 +146,8 @@ We release most (95.79%) of our training data corpus as [BUT-Large Czech Collect
|
|
146 |
| Stage | Description | Date |
|
147 |
|---------------|----------------|----------------|
|
148 |
| 1 | 'Best' model + training data | 13.03.2024
|
149 |
-
| 2 | All checkpoints + training code| Checkpoints are released. Code won't be released. We've used LLM foundry with slight adjustments, but the version is outdated now.
|
150 |
-
| 3 | __Benczechmark__ a collection of Czech datasets for few-shot LLM evaluation **Get in touch if you want to contribute!** |
|
151 |
| 4 | Preprint Publication |
|
152 |
|
153 |
## Getting in Touch
|
@@ -168,7 +168,6 @@ by the Ministry of Education, Youth and Sports of the Czech Republic through the
|
|
168 |
title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
|
169 |
journal = {arXiv preprint arXiv:insert-arxiv-number-here},
|
170 |
year = {2024},
|
171 |
-
month = {March},
|
172 |
eprint = {insert-arxiv-number-here},
|
173 |
archivePrefix = {arXiv},
|
174 |
primaryClass = {cs.CL},
|
|
|
17 |
- [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b)
|
18 |
|
19 |
# <span style="color:blue">Latest Updates</span>
|
20 |
+
- 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
|
21 |
- 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
|
22 |
- 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
|
23 |
-
|
|
|
36 |
However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
|
37 |
The improvement over mistral7b is not significant.
|
38 |
|
|
|
39 |
|
40 |
## Loss
|
41 |
We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be
|
|
|
146 |
| Stage | Description | Date |
|
147 |
|---------------|----------------|----------------|
|
148 |
| 1 | 'Best' model + training data | 13.03.2024
|
149 |
+
| 2 | All checkpoints + training code| 10.04.2024 Checkpoints are released. Code won't be released. We've used LLM foundry with slight adjustments, but the version is outdated now.
|
150 |
+
| 3 | __Benczechmark__ a collection of Czech datasets for few-shot LLM evaluation **Get in touch if you want to contribute!** | 01.10.2024
|
151 |
| 4 | Preprint Publication |
|
152 |
|
153 |
## Getting in Touch
|
|
|
168 |
title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
|
169 |
journal = {arXiv preprint arXiv:insert-arxiv-number-here},
|
170 |
year = {2024},
|
|
|
171 |
eprint = {insert-arxiv-number-here},
|
172 |
archivePrefix = {arXiv},
|
173 |
primaryClass = {cs.CL},
|