|
--- |
|
license: llama3.1 |
|
language: |
|
- el |
|
- en |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- text-generation-inference |
|
--- |
|
|
|
# Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language |
|
|
|
Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs. |
|
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct). |
|
|
|
![image/png](llama-krikri-image.jpg) |
|
|
|
# Model Information |
|
|
|
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens |
|
- 128k context length (**approximately 80,000 Greek words**) |
|
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus. |
|
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources. |
|
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens). |
|
* The training corpus also contains 7.8 billion math and code tokens. |
|
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below: |
|
|
|
|
|
| Sub-corpus | # Tokens | Percentage | |
|
|-----------|------------------|------------| |
|
| Greek | 56.7 B | 62.3 % | |
|
| English | 21.0 B | 23.1 % | |
|
| Parallel | 5.5 B | 6.0 % | |
|
| Math/Code | 7.8 B | 8.6 % | |
|
| **Total** | 91 B | **100%** | |
|
|
|
|
|
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**. |
|
|
|
|
|
# How to use |
|
|
|
## With Transformers |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
device = "cuda" |
|
|
|
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base") |
|
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base") |
|
|
|
model.to(device) |
|
|
|
input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device) |
|
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True) |
|
|
|
print(tokenizer.batch_decode(outputs)[0]) |
|
``` |
|
|
|
## With OpenAI compatible server via vLLM |
|
|
|
```bash |
|
vllm serve ilsp/Llama-Krikri-8B-Base \ |
|
--enforce-eager \ |
|
--dtype 'bfloat16' \ |
|
--api-key token-abc123 |
|
``` |
|
|
|
Then, the model can be used through Python using: |
|
```python |
|
from openai import OpenAI |
|
|
|
api_key = "token-abc123" |
|
base_url = "http://localhost:8000/v1" |
|
|
|
client = OpenAI( |
|
api_key=api_key, |
|
base_url=base_url, |
|
) |
|
|
|
response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base", |
|
prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει") |
|
print(response.choices[0].text) |
|
``` |
|
|
|
# Evaluation |
|
|
|
Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English: |
|
- **+10.8%** on Greek benchmarks |
|
- **+0.8%** on English benchmarks |
|
|
|
Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). |
|
|
|
## Greek Benchmarks |
|
|
|
|
|
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval). |
|
|
|
Our evaluation suite includes: |
|
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)). |
|
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884)) |
|
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)). |
|
|
|
We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table: |
|
|
|
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average | |
|
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------| |
|
| Meltemi 7B v1.5 | 42.2% | 61.0% | 53.8% | 40.0% | 49.0% | 41.2% | 47.9% | |
|
| Llama-3.1-8B | 33.4% | 72.8% | 52.1% | 39.9% | 51.1% | 42.6% | 48.7% | |
|
| Llama-Krikri-8B | **53.8%** | **82.7%** | **64.6%** | **49.4%** | **54.2%** | **52.0%** | **59.5%** | |
|
|
|
|
|
## English Benchmarks |
|
|
|
We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table: |
|
|
|
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average | |
|
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------| |
|
| Meltemi 7B v1.5 | 73.4% | 77.7% | 79.6% | 54.1% | 40.5% | 56.9% | 63.7% | |
|
| Llama-3.1-8B | **74.6%** | 71.5% | **82.0%** | **58.5%** | 44.2% | **66.2%** | 66.2% | |
|
| Llama-Krikri-8B | 72.6% | **79.8%** | 80.7% | 57.8% | **44.8%** | 65.1% | **67.0%** | |
|
|
|
Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5 |
|
|
|
|
|
# Ethical Considerations |
|
|
|
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content. |
|
|
|
|
|
# Acknowledgements |
|
|
|
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community. |