soksof commited on
Commit
7363203
1 Parent(s): 631cf57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -1
README.md CHANGED
@@ -7,4 +7,148 @@ tags:
7
  - finetuned
8
  inference: true
9
  pipeline_tag: text-generation
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - finetuned
8
  inference: true
9
  pipeline_tag: text-generation
10
+ ---
11
+
12
+ # Meltemi: A large foundation Language Model for the Greek language
13
+
14
+ We introduce Meltemi, the first Greek Large Language Model (LLM) trained by the Institute for Language and Speech Processing at Athena Research & Innovation Center.
15
+ Meltemi is built on top of Mistral-7b and has been trained on a large corpus which includes high-quality Greek texts. We present Meltemi-7b, along with an instruction-tuned variant, Meltemi-Instruct-7b which can be used for chat applications. Updated versions with enhanced chat and translation capabilities will be released shortly under version 2.
16
+ Additionally, in the near future, we will also release a Mixture-of-Experts foundation model (MeltemiX-8x7b), as well as chat models based on real chats with human feedback.
17
+
18
+ Meltemi is the first open Greek Foundation Language Model, available for research and commercial purposes. It is built on top of [Mistral-7b](https://mistral.ai/news/announcing-mistral-7b/), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts.
19
+ The training was performed on AWS infrastructure thanks to a GRNET grant.
20
+ We release two models trained with 8k context length: Meltemi-7b-v1 (INSERT HF LINK) and Meltemi-Instruct-7b-v1 (INSERT HF LINK) under the [Apache 2.0 License](https://github.com/apache/.github/blob/main/LICENSE).
21
+ To assess the capabilities of Meltemi we constructed a standardized LLM evaluation suite for the Greek language, integrated with [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness).
22
+
23
+
24
+ # Introduction
25
+
26
+ Large foundation models have revolutionized the AI field, opening new opportunities for research and industry applications.
27
+ However, the development and adoption of Large Foundation Models has been mostly limited to the English language, while other digitally under-represented languages lag behind, due to the fact that most open-source foundation models having been trained with mostly English monolingual data.
28
+ Recently, there have been efforts to extend the capabilities of open LLMs to other languages (e.g., [LeoLM](https://laion.ai/blog/leo-lm/) for German, [Aguila](https://huggingface.co/projecte-aina/aguila-7b) for Spanish, etc.).
29
+ This movement provides alternatives to commercial, siloed solutions for the local communities, as well as more fine-grained control for the development of safe and application-optimized models.
30
+ We develop and release Meltemi with the current status quo in mind, seeking to alleviate these limitations for the Greek language. Meltemi is developed as a bilingual model, maintaining its capabilities for the English language, while being extended to understand and generate fluent text in Modern Greek using state-of-the-art techniques.
31
+
32
+
33
+ # Continual pretraining
34
+
35
+ The original version of Mistral-7b is trained on a large corpus of English text. The corpus for the publicly released versions is estimated to contain approximately 800 billion tokens.
36
+ We extend the pretraining of Mistral-7b with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
37
+
38
+ This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
39
+
40
+ This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below:
41
+
42
+ <br/>
43
+ <br/>
44
+ Table 1: Pretraining Corpora
45
+
46
+ | Sub-corpus | # Tokens | Percentage |
47
+ |----------|------------------|------------|
48
+ | Greek | 28,555,902,360 | 72.0% |
49
+ | English | 10,478,414,033 | 26.4% |
50
+ | Parallel | 633,816,023 | 1.6% |
51
+ | **Total** | **39,668,132,416** | **100%** |
52
+ <br/>
53
+ <br/>
54
+
55
+
56
+ Our pretraining procedure uses insights from works which are focused on continual pretraining for adapting English models to a non-latin script language (Chinese), such as [Fast and efficient pretraining](https://arxiv.org/pdf/2304.08177.pdf) and [CollosalAI](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base).
57
+ Our pretraining strategy consists of the following three stages:
58
+
59
+ 1. Vocabulary extension of the Mistral-7b tokenizer with Greek tokens
60
+ 2. Greek embedding initialization and fine-tuning on 10% of the corpus (all other model parameters are kept frozen)
61
+ 3. Continual pretraining of the whole model using the full corpus
62
+
63
+ We use the following hyperparameters and training settings for the continual pretraining stage:
64
+
65
+ <br/>
66
+
67
+ Table 2: Training settings
68
+
69
+ | Training settings | |
70
+ |-----------------|---------|
71
+ | Training steps | 25340 |
72
+ | Warmup steps | 253 |
73
+ | Batch size | 512 |
74
+ | Context length | 8192 |
75
+ | Optimizer | AdamW |
76
+ | Learning rate | 2.5e-5 |
77
+ | Learning rate decay | Cosine down to 2.5e-6 |
78
+ | Adam β | (0.9, 0.95) |
79
+ | Weight decay | 0.0 |
80
+ | DeepSpeed | Zero Stage-2 |
81
+ | Precision | BF16 |
82
+ | GPUs | 8 x NVIDIA H100 (80GB) |
83
+ | Energy footprint | 2300 kWh |
84
+
85
+ <br/>
86
+ <br/>
87
+
88
+ # Supervised fine-tuning
89
+
90
+
91
+ To create Meltemi-Instruct-7b, we utilize approximately 100k Greek instructions, which include machine-translated versions of existing single-turn and multi-turn conversation datasets. In particular, we used the following:
92
+
93
+ * [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) (only subsets with permissive licenses)
94
+ * [Evol-Instruct](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)
95
+ * [Capybara](https://huggingface.co/datasets/LDJnr/Capybara)
96
+ * A manually created Greek dataset with multi-turn examples steering the instruction-tuned model towards safe and harmless responses
97
+
98
+ The model is trained on the resulting instructions using full Supervised Fine-Tuning (SFT). Our SFT procedure is based on the [finetuning recipes](https://github.com/huggingface/alignment-handbook) provided by huggingface. We are extending and improving the instruction tuning dataset to enhance the model's chat and translation capabilities.
99
+
100
+
101
+ # Evaluation
102
+
103
+ The evaluation suite we created includes 6 test sets. The suite is integrated with [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness).
104
+
105
+ Our evaluation suite includes:
106
+ * Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
107
+ * An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
108
+ * A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
109
+
110
+ Our evaluation for Meltemi-7b is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+14.9%** average improvement. The results for the Greek test sets are shown in Table 3 and Figure 1:
111
+
112
+ <br/>
113
+ <br/>
114
+
115
+ Table 3: Evaluation of Meltemi-7b on the Greek LLM benchmark
116
+
117
+ | | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
118
+ |----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
119
+ | Mistral 7B | 29.8% | 45.0% | 36.5% | 27.1% | 45.8% | 35% | 36.5% |
120
+ | Meltemi 7B | 41.0% | 63.6% | 61.6% | 43.2% | 52.1% | 47% | 51.4% |
121
+
122
+
123
+ <br/>
124
+ <br/>
125
+
126
+ ![Alt text](./meltemi-mistral.png)
127
+
128
+ Figure 1: Comparison of Meltemi-7b and Mistral-7b on Greek test sets
129
+
130
+ <br/>
131
+ <br/>
132
+
133
+ # Try it yourself
134
+
135
+ You can try the released models yourself in the [following link]().
136
+
137
+ # Code availability
138
+
139
+ All the training and fine-tuning scripts, as well as our lm-evaluation-harness fork will be made publicly available under a permissive license.
140
+
141
+ # Acknowledgements
142
+
143
+ The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
144
+
145
+ ## Contributions
146
+
147
+ * Project lead: Vassilis Katsouros
148
+ * Data acquisition and curation: Dimitris Roussis, Leon Voukoutis, Prokopis Prokopidis, Vassilis Papavassiliou
149
+ * Model training: Leon Voukoutis, Dimitris Roussis
150
+ * Model evaluation: Prokopis Prokopidis, Dimitris Roussis, Leon Voukoutis
151
+ * Infrastructure: Sokratis Sofianopoulos, George Paraskevopoulos
152
+ * Technical supervision: Nassos Katsamanis, Stelios Piperidis, Sokratis Sofianopoulos, George Paraskevopoulos
153
+
154
+ Special thanks to Sotiris Kotitsas, Petros Stavropoulos, Dimitris Pappas, Dimitris Galanis for their input during the design and development process. Special thanks to Olga Yannoutsou for her help in the translation of one of the evaluation datasets. And special thanks as well to all members of ILSP that participated in the internal evaluation.