readme: add initial version
Browse files
README.md
CHANGED
@@ -14,10 +14,105 @@ tags:
|
|
14 |
|
15 |

|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
|
|
20 |
|
21 |
-
|
|
|
|
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |

|
16 |
|
17 |
+
This repository hosts the new 1.42B Turkish T5 model named BERT5urk.
|
18 |
|
19 |
+
BERT5urk is part of the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) family and pretrained using the awesome
|
20 |
+
[T5X](https://github.com/google-research/t5x) library with the [UL2](https://arxiv.org/abs/2205.05131) objective.
|
21 |
|
22 |
+
Inspired by the great [Finnish T5 and UL2 models](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) from the [Finnish NLP](https://huggingface.co/Finnish-NLP)
|
23 |
+
group, BERT5urk also uses UL2 and the efficient T5 architecture, that is proposed in the ["Scale Efficiently"](https://arxiv.org/abs/2109.10686) paper. Many thanks
|
24 |
+
to the [Finnish NLP](https://huggingface.co/Finnish-NLP) group for open-sourcing the pretraining code and models!
|
25 |
|
26 |
+
# Pretraining Data
|
27 |
+
|
28 |
+
BERT5urk uses the Turkish part of the amazing [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) corpus.
|
29 |
+
Only documents with a higher language score than 0.99 are chosen for final pretraining corpus, that has a total size of 262GB.
|
30 |
+
|
31 |
+
We train a SPM-based vocab on a 3GB corpus from randomly chosen documents of the pretraining corpus.
|
32 |
+
|
33 |
+
# Pretraining
|
34 |
+
|
35 |
+
BERT5urk was pretrained with the awesome [T5X](https://github.com/google-research/t5x) library. Some pretraining highlights:
|
36 |
+
|
37 |
+
* One-shot pretraining (pretraining without any training crashes) was possible a v3-32 TPU Pod and took 16.56 days
|
38 |
+
* Model was pretrained for 2M steps for an input & output sequence length of 512 and a batch size of 128
|
39 |
+
* The resulting model has 1.42B parameters
|
40 |
+
|
41 |
+
# Evaluation
|
42 |
+
|
43 |
+
Detailed evaluations can be found in the [Turkish Model Zoo](https://github.com/stefan-it/turkish-bert) repository. Additionally, we also fine-tuned
|
44 |
+
[TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) models as it is another T5 model with 1.14B parameters for comparison.
|
45 |
+
|
46 |
+
## Encoder-only Results
|
47 |
+
|
48 |
+
For experiments on named entity recognition (NER) and part-of-speech (PoS) tagging we also the awesome Flair library and fine-tune only the encoder of BERT5urk and TURNA.
|
49 |
+
The overall performance can be seen in the following table:
|
50 |
+
|
51 |
+
| Model Name | Overall Development | Overall Test |
|
52 |
+
|-----------------------------------------------------------------------------------------------------------|--------------------:|-------------:|
|
53 |
+
| [BERTurk (cased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased) | 89.72 | 90.05 |
|
54 |
+
| [BERTurk (uncased, 128k)](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) | 89.25 | 89.95 |
|
55 |
+
| [BERTurk (cased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-cased) | 88.98 | 89.49 |
|
56 |
+
| [BERTurk (uncased, 32k)](https://huggingface.co/dbmdz/bert-base-turkish-uncased) | 89.28 | 89.67 |
|
57 |
+
| [ConvBERTurk (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-cased) | **90.06** | 90.27 |
|
58 |
+
| [ConvBERTurk mC4 (cased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-cased) | 90.03 | 90.09 |
|
59 |
+
| [ConvBERTurk mC4 (uncased)](https://huggingface.co/dbmdz/convbert-base-turkish-mc4-uncased) | 89.76 | 89.97 |
|
60 |
+
| [DistilBERTurk (cased)](https://huggingface.co/dbmdz/distilbert-base-turkish-cased) | 87.95 | 88.16 |
|
61 |
+
| [ELECTRA Base (cased)](https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator) | 89.08 | 89.91 |
|
62 |
+
| [ELECTRA Base mC4 (cased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-cased-discriminator) | 89.24 | 90.03 |
|
63 |
+
| [ELECTRA Base mC4 (uncased)](https://huggingface.co/dbmdz/electra-base-turkish-mc4-uncased-discriminator) | 89.09 | 89.62 |
|
64 |
+
| [ELECTRA Small (cased)](https://huggingface.co/dbmdz/electra-small-turkish-cased-discriminator) | 87.27 | 88.28 |
|
65 |
+
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 89.96 | 90.26 |
|
66 |
+
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) | 88.81 | 89.36 |
|
67 |
+
|
68 |
+
## Encoder-decoder Results
|
69 |
+
|
70 |
+
We tried to replicate the results from the [TURNA](https://arxiv.org/abs/2401.14373) paper using the [TURNA fine-tuning](https://github.com/boun-tabi-LMG/turkish-lm-tuner) library.
|
71 |
+
|
72 |
+
### Paraphrasing - Tatoeba
|
73 |
+
|
74 |
+
We fine-tune five different models for both TURNA and BERT5urk with different seeds and report the average score. Additionally the score from the TURNA paper
|
75 |
+
is also shown in the following table:
|
76 |
+
|
77 |
+
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|
78 |
+
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
|
79 |
+
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 90.22 | 80.23 | 88.95 | 71.14 | 87.56 |
|
80 |
+
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 90.36 | 80.50 | 89.10 | 71.48 | 87.63 |
|
81 |
+
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 90.47 | 80.78 | 89.21 | 71.89 | 87.74 |
|
82 |
+
|
83 |
+
### Paraphrasing - OpenSubtitles
|
84 |
+
|
85 |
+
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
|
86 |
+
|
87 |
+
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|
88 |
+
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
|
89 |
+
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 78.43 | 63.58 | 76.81 | 51.47 | 74.79 |
|
90 |
+
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 78.36 | 63.42 | 76.71 | 51.39 | 74.94 |
|
91 |
+
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 78.56 | 63.80 | 76.95 | 51.74 | 75.07 |
|
92 |
+
|
93 |
+
#### Title Generation - TrNews
|
94 |
+
|
95 |
+
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
|
96 |
+
|
97 |
+
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|
98 |
+
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
|
99 |
+
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 36.47 | 22.88 | 35.47 | 12.64 | 23.62 |
|
100 |
+
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 41.65 | 27.60 | 36.77 | 18.60 | 34.55 |
|
101 |
+
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.79 | 27.77 | 37.00 | 19.08 | 34.69 |
|
102 |
+
|
103 |
+
### Summarization - TrNews
|
104 |
+
|
105 |
+
We fine-tune TURNA and BERT5urk only for one seed (due to resource limitations) and report scores (incl. scores from the TURNA paper):
|
106 |
+
|
107 |
+
| Model | test_rouge1 | test_rouge2 | test_rougeL | test_bleu | test_meteor |
|
108 |
+
|:-----------------------------------------------------------------|------------:|------------:|------------:|----------:|------------:|
|
109 |
+
| [TURNA](https://arxiv.org/abs/2401.14373) (paper) | 41.77 | 27.81 | 36.99 | 19.05 | 34.61 |
|
110 |
+
| [TURNA](https://huggingface.co/boun-tabi-LMG/TURNA) (replicated) | 40.75 | 26.82 | 35.88 | 18.00 | 33.91 |
|
111 |
+
| [BERT5urk](https://huggingface.co/stefan-it/bert5urk) | 41.00 | 27.08 | 36.24 | 18.78 | 23.96 |
|
112 |
+
|
113 |
+
# Acknowledgments
|
114 |
+
|
115 |
+
Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
|
116 |
+
Many Thanks for providing access to the TPUs over many years ❤️
|
117 |
+
|
118 |
+
Made from Bavarian Oberland with ❤️ and 🥨.
|