BUT-FIT
/

csmpt7b

@@ -27,14 +27,24 @@ We encountered loss spikes during training. As the model always recovered, and o
 - (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
 such instabilities were previously observed only for much larger models (larger than 65b).
-The model was trained on 3 corpora. Corpus #1  was the same we used for GPT-2 training (~16b tokens). <TBD MF>
 <img src="figures/tloss_full.png"  width="900"/>
 Figure 1: Training loss.
 <img src="figures/tloss_closeup.png" width="900"/>
-Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. <TBD MF>
 <img src="figures/vloss_closeup.png" width="900"/>
-Figure 3: Test loss closeup, testing performed on internal-corpus #1. <TBD MF>
 ## Training Method
@@ -83,7 +93,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
 ```
 # Training Data
-We release most of our training data here \[TBD MDocekal.\].
 # Our Release Plan
@@ -98,10 +108,24 @@ We release most of our training data here \[TBD MDocekal.\].
 For further questions, email to `martin.fajcik@vut.cz`.
 # Disclaimer
-This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk.
 # Acknowledgement
 This work was supported by NAKI III program of  Ministry of Culture Czech Republic, project semANT ---
 "Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
-by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).

 - (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
 such instabilities were previously observed only for much larger models (larger than 65b).
+### Corpora
+The model was trained on 3 corpora, which were hot-swapped during the training. These were collected/filtered during the course of training.
+- Corpus #1 was the same we used for our [Czech GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) training (15,621,685,248 tokens).
+- Corpus #2 contained 67,981,934,592 tokens, coming mostly from HPLT and CulturaX corpora.
+- Corpus #3 is Corpus #2 after we removed proportions of the unappropriate content (which avoided our other checks) through linear classifier.
 <img src="figures/tloss_full.png"  width="900"/>
 Figure 1: Training loss.
 <img src="figures/tloss_closeup.png" width="900"/>
+Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively.
+Additionaly, we perform  two ablations:
+- (a) After first hot swap, we continued training on the corpus #1 for a while.
+- (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
+- we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
 <img src="figures/vloss_closeup.png" width="900"/>
+Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
 ## Training Method
 ```
 # Training Data
+We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
 # Our Release Plan
 For further questions, email to `martin.fajcik@vut.cz`.
 # Disclaimer
+This is a probabilistic model, it can output stochastic information. Authors are not responsible for the model outputs. Use at your own risk.
 # Acknowledgement
 This work was supported by NAKI III program of  Ministry of Culture Czech Republic, project semANT ---
 "Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
+by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
+# Citation
+```bibtex
+@article{benczechmark,
+  author    = {Martin Fajčík, Martin Dočekal, Jan Doležal, Karel Beneš, Michal Hradiš},
+  title     = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
+  journal   = {arXiv preprint arXiv:insert-arxiv-number-here},
+  year      = {2024},
+  month     = {March},
+  eprint    = {insert-arxiv-number-here},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.CL},
+}
+```