Text Generation
Transformers
Safetensors
Czech
mpt
custom_code
text-generation-inference
Inference Endpoints
mfajcik commited on
Commit
941acb1
1 Parent(s): 6b9f7d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -7
README.md CHANGED
@@ -27,14 +27,24 @@ We encountered loss spikes during training. As the model always recovered, and o
27
  - (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
28
  such instabilities were previously observed only for much larger models (larger than 65b).
29
 
30
- The model was trained on 3 corpora. Corpus #1 was the same we used for GPT-2 training (~16b tokens). <TBD MF>
 
 
 
 
 
31
 
32
  <img src="figures/tloss_full.png" width="900"/>
33
  Figure 1: Training loss.
34
  <img src="figures/tloss_closeup.png" width="900"/>
35
- Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively. <TBD MF>
 
 
 
 
 
36
  <img src="figures/vloss_closeup.png" width="900"/>
37
- Figure 3: Test loss closeup, testing performed on internal-corpus #1. <TBD MF>
38
 
39
 
40
  ## Training Method
@@ -83,7 +93,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
83
 
84
  ```
85
  # Training Data
86
- We release most of our training data here \[TBD MDocekal.\].
87
 
88
 
89
  # Our Release Plan
@@ -98,10 +108,24 @@ We release most of our training data here \[TBD MDocekal.\].
98
  For further questions, email to `martin.fajcik@vut.cz`.
99
 
100
  # Disclaimer
101
- This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk.
102
-
103
 
104
  # Acknowledgement
105
  This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
106
  "Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
107
- by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  - (b) in preliminary ablations, they only appear for continuously pretrained models. While we do not know why do they appear, we hypothesize this might be linked to theory on [Adam instability in time-domain correlation of update vectors](https://arxiv.org/pdf/2304.09871.pdf). However
28
  such instabilities were previously observed only for much larger models (larger than 65b).
29
 
30
+ ### Corpora
31
+ The model was trained on 3 corpora, which were hot-swapped during the training. These were collected/filtered during the course of training.
32
+ - Corpus #1 was the same we used for our [Czech GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) training (15,621,685,248 tokens).
33
+ - Corpus #2 contained 67,981,934,592 tokens, coming mostly from HPLT and CulturaX corpora.
34
+ - Corpus #3 is Corpus #2 after we removed proportions of the unappropriate content (which avoided our other checks) through linear classifier.
35
+
36
 
37
  <img src="figures/tloss_full.png" width="900"/>
38
  Figure 1: Training loss.
39
  <img src="figures/tloss_closeup.png" width="900"/>
40
+ Figure 2: Training loss closeup. We mark two hotswap places, where the training corpus #1 was switched for internal-corpus #2 and internal-corpus #2.1 respectively.
41
+ Additionaly, we perform two ablations:
42
+
43
+ - (a) After first hot swap, we continued training on the corpus #1 for a while.
44
+ - (b) On step 94,000, the training loss stopped decreasing, increased, and around step 120,000 (near hot swap #2) started decreasing again. To ablate whether this was an effect of hot-swap,
45
+ - we resume training from step 93,000 using corpus #3. The optimizer states were reinitialized.
46
  <img src="figures/vloss_closeup.png" width="900"/>
47
+ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. See Figure 2 description for ablation explanation.
48
 
49
 
50
  ## Training Method
 
93
 
94
  ```
95
  # Training Data
96
+ We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
97
 
98
 
99
  # Our Release Plan
 
108
  For further questions, email to `martin.fajcik@vut.cz`.
109
 
110
  # Disclaimer
111
+ This is a probabilistic model, it can output stochastic information. Authors are not responsible for the model outputs. Use at your own risk.
 
112
 
113
  # Acknowledgement
114
  This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
115
  "Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and
116
+ by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).
117
+
118
+ # Citation
119
+ ```bibtex
120
+ @article{benczechmark,
121
+ author = {Martin Fajčík, Martin Dočekal, Jan Doležal, Karel Beneš, Michal Hradiš},
122
+ title = {BenCzechMark: Machine Language Understanding Benchmark for Czech Language},
123
+ journal = {arXiv preprint arXiv:insert-arxiv-number-here},
124
+ year = {2024},
125
+ month = {March},
126
+ eprint = {insert-arxiv-number-here},
127
+ archivePrefix = {arXiv},
128
+ primaryClass = {cs.CL},
129
+ }
130
+
131
+ ```