--- language: - cs metrics: - perplexity pipeline_tag: text-generation license: mit datasets: - BUT-FIT/BUT-LCC --- # Czech GPT This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en). # BUT LM Model Roster - [BUT-FIT/CSTinyLlama-1.2B](https://huggingface.co/BUT-FIT/CSTinyLlama-1.2B) - [BUT-FIT/Czech-GPT-2-XL-133k](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) - [BUT-FIT/csmpt7b](https://huggingface.co/BUT-FIT/csmpt7b) ## Factsheet - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling. - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`. - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization. - The model is trained in GPT-2 style, the first token is an actual text token (not bos). Thus first token probability can't be computed. - Due to the feature of our code, our model was never trained to generate [EOS]. - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment. - The model was adapted from the original GPT-2 XL, by: - replacing the tokenizer, - corresponding embeddings, and - copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary. - The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version. - The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version. ## Training parameters Not mentioned parameters are the same as for GPT-2. | **Name** | **Value** | **Note** | |----------------------------|---------------|----------------------------------------------------------------------------------------------| | dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. | | tokenizer_size | 64k | | | max_seq_len | 1024 | | | batch_size | 1024 | | | learning_rate | 1.0e-4 | | | optimizer | LionW | | | optimizer_betas | 0.9/0.95 | | | optimizer_weight_decay | 0 | | | optimizer_eps | 1.0e-08 | | | gradient_clipping_max_norm | 1.0 | | | attn_impl | flash2 | | | dropout | 0.1 | for residuals, attention, embeddings | | fsdp | SHARD_GRAD_OP | (optimized for A100 40GB GPUs) | | precision | bf16 | | | scheduler | linear | | | scheduler_warmup | 10,000 steps | | | scheduler_steps | 200,000 | | | scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) | ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch t = AutoTokenizer.from_pretrained("BUT-FIT/Czech-GPT-2-XL-133k") m = AutoModelForCausalLM.from_pretrained("BUT-FIT/Czech-GPT-2-XL-133k").eval() # Try the model inference prompt = "Nejznámějším českým spisovatelem " input_ids = t.encode(prompt, return_tensors="pt") with torch.no_grad(): generated_text = m.generate(input_ids=input_ids, do_sample=True, top_p=0.95, repetition_penalty=1.0, temperature=0.8, max_new_tokens=64, num_return_sequences=1) print(t.decode(generated_text[0], skip_special_tokens=True)) ``` ## Evaluation We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning. There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?). We will release the precise results once we advance with the work on our Czech evaluation kit. ## Disclaimer This is an intermediate result of our work-in-progress. This is a probabilistic model, and authors are not responsible for the model outputs. Use at your own risk. For further questions, turn to `martin.fajcik@vut.cz`. ## Acknowledgement This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT --- "Sémantický průzkumník textového kulturního dědictví" grant no. `DH23P03OVV060` and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:`90254`).