stefan-it commited on
Commit
7a6a8a0
·
verified ·
1 Parent(s): 955ece9

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +99 -4
README.md CHANGED
@@ -14,10 +14,105 @@ tags:
14
 
15
  ![BERT5urk](bert5urk_logo.png)
16
 
17
- **New**: BERT5urk - an upcoming T5-based Turkish Language Model.
18
 
19
- **Update (20.01.2025)**: Official release will be between 27.01. and 02.02.2025 (Evaluations are currently running...), stay tuned!
 
20
 
21
- **Update (27.01.2025)**: Release date will be 30.01.2025 (same day as new RTX 5090 is coming out, but this is just a coincidence, as I will be up all day to grab one...)
 
 
22
 
23
- **Update (29.01.2025)**: Evaluations are done \o/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ![BERT5urk](bert5urk_logo.png)
16
 
17
+ This repository hosts the new 1.42B Turkish T5 model named BERT5urk.
18
 
19
+ BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome
20
+ [T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective.
21
 
22
+ Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP)
23
+ group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks
24
+ to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models!
25
 
26
+ # Pretraining Data
27
+
28
+ BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus.
29
+ Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.
30
+
31
+ We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.
32
+
33
+ # Pretraining
34
+
35
+ BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights:
36
+
37
+ * One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
38
+ * Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
39
+ * The resulting model has 1.42B parameters
40
+
41
+ # Evaluation
42
+
43
+ Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned
44
+ [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison.
45
+
46
+ ## Encoder-only Results
47
+
48
+ For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA.
49
+ The overall performance can be seen in the following table:
50
+
51
+ | Model Name | Overall Development | Overall Test |
52
+ |-----------------------------------------------------------------------------------------------------------|--------------------:|-------------:|
53
+ | [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased) | 89.72 | 90.05 |
54
+ | [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) | 89.25 | 89.95 |
55
+ | [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased) | 88.98 | 89.49 |
56
+ | [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased) | 89.28 | 89.67 |
57
+ | [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased) | **90.06** | 90.27 |
58
+ | [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased) | 90.03 | 90.09 |
59
+ | [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased) | 89.76 | 89.97 |
60
+ | [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased) | 87.95 | 88.16 |
61
+ | [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator) | 89.08 | 89.91 |
62
+ | [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator) | 89.24 | 90.03 |
63
+ | [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) | 89.09 | 89.62 |
64
+ | [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator) | 87.27 | 88.28 |
65
+ | [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 89.96 | 90.26 |
66
+ | [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) | 88.81 | 89.36 |
67
+
68
+ ## Encoder-decoder Results
69
+
70
+ We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library.
71
+
72
+ ### Paraphrasing - Tatoeba
73
+
74
+ We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper
75
+ is also shown in the following table:
76
+
77
+ | Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
78
+ |:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
79
+ | [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 90.22 | 80.23 | 88.95 | 71.14 | 87.56 |
80
+ | [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 90.36 | 80.50 | 89.10 | 71.48 | 87.63 |
81
+ | [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 90.47 | 80.78 | 89.21 | 71.89 | 87.74 |
82
+
83
+ ### Paraphrasing - OpenSubtitles
84
+
85
+ We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
86
+
87
+ | Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
88
+ |:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
89
+ | [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 78.43 | 63.58 | 76.81 | 51.47 | 74.79 |
90
+ | [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 78.36 | 63.42 | 76.71 | 51.39 | 74.94 |
91
+ | [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 78.56 | 63.80 | 76.95 | 51.74 | 75.07 |
92
+
93
+ #### Title Generation - TrNews
94
+
95
+ We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
96
+
97
+ | Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
98
+ |:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
99
+ | [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 36.47 | 22.88 | 35.47 | 12.64 | 23.62 |
100
+ | [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 41.65 | 27.60 | 36.77 | 18.60 | 34.55 |
101
+ | [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.79 | 27.77 | 37.00 | 19.08 | 34.69 |
102
+
103
+ ### Summarization - TrNews
104
+
105
+ We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
106
+
107
+ | Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
108
+ |:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
109
+ | [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 41.77 | 27.81 | 36.99 | 19.05 | 34.61 |
110
+ | [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 40.75 | 26.82 | 35.88 | 18.00 | 33.91 |
111
+ | [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.00 | 27.08 | 36.24 | 18.78 | 23.96 |
112
+
113
+ # Acknowledgments
114
+
115
+ Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
116
+ Many Thanks for providing access to the TPUs over many years ❤️
117
+
118
+ Made from Bavarian Oberland with ❤️ and 🥨.