# Czech GPT
This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en).

## Factsheet
- The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
- The model was trained on
- Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
- The model was adapted from the original GPT-2 XL, by: 
   - replacing the tokenizer,  
   - corresponding embeddings, and  
   - copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary.
- The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version.
  <img src="XL_vs_SMALL_train.png" width="600"/>
- The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version.
  <img src="XL_vs_SMALL_test.png" width="600"/>

## Training parameters
Not mentioned parameters are the same as for GPT-2.

| **Name**                   | **Value**     | **Note**                                                                                     |
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
| dataset_type               | Concat        | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
| tokenizer_size             | 64k           |                                                                                              |
| max_seq_len                | 1024          |                                                                                              |
| batch_size                 | 1024          |                                                                                              |
| learning_rate              | 1.0e-4        |                                                                                              |
| optimizer                  | LionW         |                                                                                              |
| optimizer_betas            | 0.9/0.95      |                                                                                              |
| optimizer_weight_decay     | 0             |                                                                                              |
| optimizer_eps              | 1.0e-08       |                                                                                              |
| gradient_clipping_max_norm | 1.0           |                                                                                              |
| attn_impl                  | flash2        |                                                                                              |
| dropout                    | 0.1           | for residuals, attention, embeddings                                                         |
| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB GPUs)                                                               |
| precision                  | bf16          |                                                                                              |
| scheduler                  | linear        |                                                                                              |
| scheduler_warmup           | 10,000 steps  |                                                                                              |
| scheduler_steps            | 200,000       |                                                                                              |
| scheduler_alpha            | 0.1           | So LR on last step is 0.1*(vanilla LR)                                                       |


## Evaluation
We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning. 
There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?).
We will release the precise results once we advance with the work on our Czech evaluation kit.


## Disclaimer
This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.