File size: 7,251 Bytes
2bdafc7
f0f86db
 
 
 
 
e51f7ae
 
 
02743c7
 
f0f86db
 
b687f6c
 
 
 
 
f0f86db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b687f6c
 
2bdafc7
f0f86db
 
 
 
 
 
 
 
1768a06
849ab68
f0f86db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1768a06
f0f86db
 
 
 
 
 
 
 
 
 
1768a06
 
f0f86db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language:
- tr
arXiv: 2403.01308
library_name: transformers
pipeline_tag: text2text-generation
inference:
  parameters:
    max_new_tokens: 32
    num_beams: 8
    do_sample: false
widget:
- text: >-
    Soru yarat: Alan Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi
    ve kriptolog. II. Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok
    önemli bir rol oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester
    Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma
    tanımı ile modern bilgisayarların kavramsal temelini atmıştır.
  example_title: Question generation
- text: >-
    Soru cevapla: Turing makinesi denilen algoritma tanımı ile modern
    bilgisayarların kavramsal temelini atan bilim insanı kimdir? kaynak: Alan
    Mathison Turing İngiliz matematikçi, bilgisayar bilimcisi ve kriptolog. II.
    Dünya Savaşı sırasında Alman şifrelerinin kırılmasında çok önemli bir rol
    oynadığı için savaş kahramanı sayılmıştır. Ayrıca Manchester
    Üniversitesi'nde çalıştığı yıllarda, Turing makinesi denilen algoritma
    tanımı ile modern bilgisayarların kavramsal temelini atmıştır.
  example_title: Question answering
- text: >-
    yanıtları çıkar: Alan Mathison Turing İngiliz matematikçi, bilgisayar
    bilimcisi ve kriptolog.  II. Dünya Savaşı sırasında Alman şifrelerinin
    kırılmasında çok önemli bir rol oynadığı için savaş kahramanı sayılmıştır.
    <hl>  Ayrıca Manchester Üniversitesi'nde çalıştığı yıllarda, Turing makinesi
    denilen algoritma tanımı ile modern bilgisayarların kavramsal temelini
    atmıştır <hl> .
  example_title: Answer Extraction
license: cc-by-nc-sa-4.0
datasets:
- vngrs-ai/vngrs-web-corpus
---
# VBART Model Card

## Model Description  

VBART is the first sequence-to-sequence LLM pre-trained on Turkish corpora from scratch on a large scale. It was pre-trained by VNGRS in February 2023.  
The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine-tuned.
It outperforms its multilingual counterparts, albeit being much smaller than other implementations.

VBART-XLarge is created by adding extra Transformer layers between the layers of VBART-Large. Hence it was able to transfer learned weights from the smaller model while doublings its number of layers.
VBART-XLarge improves the results compared to VBART-Large albeit in small margins.

This repository contains fine-tuned TensorFlow and Safetensors weights of VBART for question-answering and generation tasks described in the [paper](https://doi.org/10.55730/1300-0632.3914).

- **Developed by:** [VNGRS-AI](https://vngrs.com/ai/)
- **Model type:**  Transformer encoder-decoder based on mBART architecture
- **Language(s) (NLP):** Turkish
- **License:** CC BY-NC-SA 4.0
- **Finetuned from:** VBART-XLarge
- **Paper:** [arXiv](https://arxiv.org/abs/2403.01308)
## How to Get Started with the Model  
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("vngrs-ai/VBART-XLarge-QAQG",
                            model_input_names=['input_ids', 'attention_mask'])
# Uncomment the device_map kwarg and delete the closing bracket to use model for inference on GPU
model = AutoModelForSeq2SeqLM.from_pretrained("vngrs-ai/VBART-XLarge-QAQG")#, device_map="auto")

context="..."
question="..."
highlighted_context="..."

# Prompt for question generation
qg_prompt = f'Soru yarat: cevap: {context}'
# Prompt for question answering
qa_prompt = f'Soru cevapla: {question} kaynak: {context}'
# Prompt for answer extraction
ae_prompt = f'yanıtları çıkar: {highlighted_context}'


token_input = tokenizer(ae_prompt, return_tensors="pt")#.to('cuda')
outputs = model.generate(**token_input)
print(tokenizer.decode(outputs[0]))
```
  
## Training Details  
### Fine-tuning prompt
This model is fine-tuned on three tasks: 
- question answering: Answer a question in a given context. Prompted with
    ```Soru cevapla: <question> kaynak: <context>```
- question generation: Generate a question from a given context. Will accept a highlight token (`<hl>`, without spaces) to specify the answer to the question generated. Prompted with
    ```Soru yarat: <context>```
- answer extraction: Will extract possible answers from a highlighted range (using the same highlight token). Prompted with
    ``` yanıtları çıkar: <context with highlighted parts>```

### Training Data  
The base model is pre-trained on [vngrs-web-corpus](https://huggingface.co/datasets/vngrs-ai/vngrs-web-corpus). It is curated by cleaning and filtering Turkish parts of [OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) and [mC4](https://huggingface.co/datasets/mc4) datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our [paper](https://arxiv.org/abs/2403.01308).

The fine-tuning dataset is [TQuAD](https://github.com/obss/turkish-question-generation), which has two versions. We have concatenated them and dropped duplicate samples. More information about this process can be found in Appendix B of our [paper](https://arxiv.org/abs/2403.01308). 

### Limitations
This model is fine-tuned for question-answering and question-generation tasks with specific prompts. It is not intended to be used in any other case and can not be fine-tuned to any other task with full performance of the base model. It is also not guaranteed that this model will work without specified prompts.

### Training Procedure  
Pre-trained for 8 days and for a total of 84B tokens. Finally, finetuned for 55 epochs.
#### Hardware
- **GPUs**: 8 x Nvidia A100-80 GB
#### Software
- TensorFlow
#### Hyperparameters  
##### Pretraining
- **Training regime:** fp16 mixed precision
- **Training objective**: Sentence permutation and span masking (using mask lengths sampled from Poisson distribution λ=3.5, masking 30% of tokens)
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
- **Scheduler**: Custom scheduler from the original Transformers paper (20,000 warm-up steps)
- **Weight Initialization**: Model Enlargement from VBART-Large. See the related section in the [paper](https://arxiv.org/abs/2403.01308) for the details. 
- **Dropout**: 0.1 (dropped to 0.05 and then to 0 in the last 80K and 80k steps, respectively)
- **Initial Learning rate**: 5e-6
- **Training tokens**: 84B

##### Fine-tuning
- **Training regime:** fp16 mixed precision
- **Optimizer** : Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e-6)
- **Scheduler**: Linear decay scheduler
- **Dropout**: 0.1 
-  **Learning rate**: 5e-6
-  **Fine-tune epochs**: 55

#### Metrics
![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f8b3c84588fe31f435a92b/D-Epasj5C4icAu0ykqt10.png)

## Citation  
```
@article{turker2024vbart,
  title={VBART: The Turkish LLM},
  author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
  journal={arXiv preprint arXiv:2403.01308},
  year={2024}
}
```