stefan-it
/

bert5urk

@@ -14,10 +14,105 @@ tags:
 ![BERT5urk](bert5urk_logo.png)
-**New**: BERT5urk - an upcoming T5-based Turkish Language Model.
-**Update (20.01.2025)**: Official release will be between 27.01. and 02.02.2025 (Evaluations are currently running...), stay tuned!
-**Update (27.01.2025)**: Release date will be 30.01.2025 (same day as new RTX 5090 is coming out, but this is just a coincidence, as I will be up all day to grab one...)
-**Update (29.01.2025)**: Evaluations are done \o/

 ![BERT5urk](bert5urk_logo.png)
+This repository hosts the new 1.42B Turkish T5 model named BERT5urk.
+BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome
+[T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective.
+Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP)
+group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks
+to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models!
+# Pretraining Data
+BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus.
+Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.
+We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.
+# Pretraining
+BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights:
+* One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
+* Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
+* The resulting model has 1.42B parameters
+# Evaluation
+Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned
+[TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison.
+## Encoder-only Results
+For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA.
+The overall performance can be seen in the following table:
+| Model Name                                                                                                | Overall Development | Overall Test |
+|-----------------------------------------------------------------------------------------------------------|--------------------:|-------------:|
+| [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased)                        |               89.72 |        90.05 |
+| [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased)                    |               89.25 |        89.95 |
+| [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased)                              |               88.98 |        89.49 |
+| [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased)                          |               89.28 |        89.67 |
+| [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased)                           |           **90.06** |        90.27 |
+| [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased)                   |               90.03 |        90.09 |
+| [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased)               |               89.76 |        89.97 |
+| [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased)                       |               87.95 |        88.16 |
+| [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator)             |               89.08 |        89.91 |
+| [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator)     |               89.24 |        90.03 |
+| [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) |               89.09 |        89.62 |
+| [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator)           |               87.27 |        88.28 |
+| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)                                                     |               89.96 |        90.26 |
+| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA)                                                       |               88.81 |        89.36 |
+## Encoder-decoder Results
+We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library.
+### Paraphrasing - Tatoeba
+We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper
+is also shown in the following table:
+| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
+|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
+| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 90.22       | 80.23       | 88.95       | 71.14     | 87.56       |
+| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 90.36       | 80.50       | 89.10       | 71.48     | 87.63       |
+| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 90.47       | 80.78       | 89.21       | 71.89     | 87.74       |
+### Paraphrasing - OpenSubtitles
+We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
+| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
+|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
+| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 78.43       | 63.58       | 76.81       | 51.47     | 74.79       |
+| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 78.36       | 63.42       | 76.71       | 51.39     | 74.94       |
+| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 78.56       | 63.80       | 76.95       | 51.74     | 75.07       |
+#### Title Generation - TrNews
+We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
+| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
+|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
+| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 36.47       | 22.88       | 35.47       | 12.64     | 23.62       |
+| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 41.65       | 27.60       | 36.77       | 18.60     | 34.55       |
+| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 41.79       | 27.77       | 37.00       | 19.08     | 34.69       |
+### Summarization - TrNews
+We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
+| Model                                                            | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
+|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
+| [TURNA](https://arxiv.org/abs/2401.14373) (paper)                | 41.77       | 27.81       | 36.99       | 19.05     | 34.61       |
+| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 40.75       | 26.82       | 35.88       | 18.00     | 33.91       |
+| [BERT5urk](https://huggingface.co/stefan-it/bert5urk)            | 41.00       | 27.08       | 36.24       | 18.78     | 23.96       |
+# Acknowledgments
+Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
+Many Thanks for providing access to the TPUs over many years ❤️
+Made from Bavarian Oberland with ❤️ and 🥨.