BUT-FIT
/

csmpt7b

@@ -4,7 +4,7 @@ license: apache-2.0
 # Introduction
 CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
-# Eval
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 | Model | CS-HellaSwag Accuracy |
 |---------------|----------------|
@@ -19,7 +19,7 @@ Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
 The improvement over mistral7b is not significant.
-<TBD> More evaluation details teaser.
 ## Loss
 We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be

 # Introduction
 CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
+# Evaluation
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 | Model | CS-HellaSwag Accuracy |
 |---------------|----------------|
 However, we ran validation over the course of training on CS-Hellaswag, and after 100k steps, the improvements were very noisy if any.
 The improvement over mistral7b is not significant.
+We will release more evaluations together with our benchmark **BenCzechMark** soon (see release plan!).
 ## Loss
 We encountered loss spikes during training. As the model always recovered, and our budget for training 7b model was very constrained, we kept on training. We observed such loss spikes before in our ablations. In these ablations (with GPT-2 small), we found these to be