File size: 6,577 Bytes
b5b5549
 
 
 
 
 
 
 
 
 
 
 
 
85198d6
 
 
 
b5b5549
 
 
 
 
7c037a6
 
b5b5549
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
base_model:
- ilsp/Llama-Krikri-8B-Base
---

🚨 **PLEASE USE THE OFFICIAL QUANTIZED VERSIONS** 🚨

🚨 *There is no guarantee that you are using the latest improved versions from 3rd party quantizations as we have updated the model's weights* 🚨

# Llama-Krikri-8B-Instruct: An Instruction-tuned Large Language Model for the Greek language

Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present **Llama-Krikri-8B-Instruct**, along with the base model, [Llama-Krikri-8B-Base](https://huggingface.co/ilsp/Llama-Krikri-8B-Base).


![image/png](https://cdn-uploads.huggingface.co/production/uploads/639215a81bae0dde85842ab8/VMTgYzygHsarC9QRv2rGV.png)


# Model Information

## Base Model

- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (approximately 80,000 Greek words)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus. 
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
  * The training corpus also contains 7.8 billion math and code tokens.
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:


| Sub-corpus   | # Tokens         | Percentage |
|-----------|------------------|------------|
| Greek     | 56.7 B   | 62.3 %      |
| English   | 21.0 B   | 23.1 %      |
| Parallel  |  5.5 B   | 6.0 %       |
| Math/Code |  7.8 B   | 8.6 %       |
| **Total** | 91 B   |  **100%**       |


Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.

## Instruct Model

Llama-Krikri-8B-Instruct is the result of post-training Llama-Kriki-8B-Base and features:
- Enhanced chat capabilities and instruction-following in both Greek and English.
- Document translation from Greek to English, French, German, Italian, Portuguese, Spanish and vice versa.
- Great performance on generation, comprehension, and editing tasks, such as summarization, creative content creation, text modification, entity recognition, sentiment analysis, etc.
- Domain-specifc expertise for specialized legal, financial, medical, and scientific applications.
- Retrieval-Augmented Generation (RAG) utilizing multiple documents with 128k context length. 
- Improved coding and agentic capabilities with correct formatting and tool use.
- Conversion or structured extraction (e.g., XML, JSON) in data-to-text & text-to-data settings.
- Analytical thinking and Chain-of-Thought (CoT) reasoning for problem-solving.

🚨 **More information on the post-training corpus and methdology coming soon.** 🚨


# How to use

## With Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Instruct")

model.to(device)

system_prompt = "Είσαι το Κρικρί, ένα εξαιρετικά ανεπτυγμένο μοντέλο Τεχνητής Νοημοσύνης για τα ελληνικα και εκπαιδεύτηκες από το ΙΕΛ του Ε.Κ. \"Αθηνά\"."
user_prompt = "Σε τι διαφέρει ένα κρικρί από ένα λάμα;"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])
```

## With OpenAI compatible server via vLLM

```bash
vllm serve ilsp/Llama-Krikri-8B-Instruct \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123
```

Then, the model can be used through Python using:
```python
from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python. Δεν γράφεις τίποτα άλλο στις απαντήσεις σου πέρα από τις μεταφρασμένες λίστες."
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

response = client.chat.completions.create(model="ilsp/Llama-Krikri-8B-Instruct",
                                          messages=messages,
                                          temperature=0.0,
                                          top_p=0.95,
                                          max_tokens=8192,
                                          stream=False)

print(response.choices[0].message.content)
# ['Ηθική καθήκοντος', 'Μεταμοντέρνα ηθική', 'Συνεπειοκρατική ηθική', 'Ωφελιμιστική ηθική', 'Δεοντολογική ηθική', 'Ηθική αρετών', 'Σχετικιστική ηθική']
```

# Evaluation

🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨

# Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.