Safetensors
Macedonian
llama
mkd
macedonia
File size: 5,571 Bytes
fd75559
 
 
f1582e0
fd75559
 
 
 
34792c8
 
 
 
fd75559
 
662c05c
2c5e3ac
fd75559
 
 
662c05c
4647a50
fd75559
 
 
70304c3
fd75559
 
 
 
 
 
 
 
 
 
f47a350
 
 
 
 
 
 
 
 
 
 
662c05c
fd75559
998268c
f1582e0
fd75559
 
 
 
ab36a5b
 
 
 
 
94cb7f8
 
 
 
 
 
 
 
f60f896
 
 
 
 
 
 
4647a50
f60f896
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: llama3.1
datasets:
- LVSTCK/macedonian-corpus-raw-dedup
language:
- mk
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- mk
- mkd
- macedonia
---

# 🐂 domestic-yak, a Macedonian LM (base version)

## Model Summary
This model is a Macedonian language adaptation of the Llama 3.1 8B model. It has undergone continued pretraining on a deduplicated version of the Macedonian Corpus Raw dataset, containing approximately 1.6 billion tokens. The model has been pretrained for one epoch on this corpus, making it well-suited for tasks involving the Macedonian language, such as text classification, language generation, and translation.

### 📊 Results
The table below compares the performance of our model, domestic-yak-8B, with its foundational model, Llama 3.1-8B Instruct evaluated using the [macedonian-llm-eval](https://github.com/LVSTCK/macedonian-llm-eval) benchmark.

As shown in the table, domestic-yak-8B consistently outperforms its foundational model on all tasks.

| **Task (mk-version)**               | **domestic-yak-8B** | **Llama 3.1-8B Instruct** |
|-------------------------|---------------------------|-----------------------|
| **ARC Easy**           | **0.5244 ± 0.0102**       | 0.4453 ± 0.0102      |
| **ARC Challenge**      | **0.3183 ± 0.0136**       | 0.2824 ± 0.0132      |
| **BoolQ**              | **0.7676 ± 0.0074**       | 0.7639 ± 0.0074      |
| **HellaSwag**          | **0.4324 ± 0.0049**       | 0.3740 ± 0.0048      |
| **Openbook QA**        | **0.2920 ± 0.0204**       | 0.2520 ± 0.0194      |
| **PIQA**               | **0.6687 ± 0.0110**       | 0.5865 ± 0.0115      |
| **NQ Open**            | **0.0416 ± 0.0033**       | 0.0335 ± 0.0030      |
| **WinoGrande**         | **0.6259 ± 0.0136**       | 0.5683 ± 0.0139      |

Quantitative evaluation is just one part of the story. We also conducted qualitative tests, interacting with the model to assess its ability to generate coherent and relevant responses in Macedonian. The results were satisfying: the model not only mastered the language's grammar but also demonstrated a natural and fluent style. 

Here’s a comparison showing the answers from **Llama-3.1-8B-Instruct** and **domestic-yak-8B**:

- **Llama-3.1-8B-Instruct**:  
  *"Здраво! Секој пат кога го користиш оваа платформа, со тобе си во контакт со мојот систем на вештачка интелигенција. Мојот задача е да ти го обезбедам информативен и корисен одговор на секое прашање што можеш да ми го поставиш. Се нарекувам 'Асистент', и сум тука за да ти го правам живот полесен и интересниот. Што твоето прашање е?"*

- **domestic-yak-8B**:  
  *"Здраво, јас сум асистент дизајниран да обезбеди информации и одговори на прашања од корисниците. Можам да одговарам на широк спектар на прашања, од општи факти и информации до специјализирани прашања во различни области. Мојата цел е да обезбедам точни и корисни одговори, како и да го олеснам процесот на пребарување и да заштедам време за корисниците. Слободно постави ми го твоето прашање и ќе се обидам да ти помогнам."*


## 🔑 Key Details
- **Language:** Macedonian (`mk`)
- **Base Model:** [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Dataset:** [LVSTCK/macedonian-corpus-raw-dedup](https://huggingface.co/datasets/LVSTCK/macedonian-corpus-cleaned-dedup) (deduplicated version)
- **Training Tokens:** ~1.6 billion
- **Pretraining Epochs:** 1 epoch
- **Pretraining Objective:** Causal Language Modeling (continued pretraining using all the weights)

## ⚠️ Limitations
- **Biases:** The model may show biases present in the training dataset. Efforts were made to clean and deduplicate the corpus, but further bias mitigation might be necessary for sensitive applications.
- **Domain Specificity:** While the dataset covers diverse domains, performance may vary for niche or underrepresented topics. For example, the dataset is heavily skewed toward 'news'-themed texts, while domains such as 'science' or 'medicine' are less represented.
- **Chat Capabilities:** This version is the base model so its chat capabilities might be limited. If you would like to chat use the [instruct version](https://huggingface.co/LVSTCK/domestic-yak-8B-instruct).

## 📬 Contact

For inquiries, feedback, or contributions, please feel free to reach out to the core team:

- [Stefan Krsteski](https://www.linkedin.com/in/stefan-krsteski-136abb235/) [📧](mailto:stefan.krsteski@gmail.com)
- [Matea Tashkovska](https://www.linkedin.com/in/matea-tashkovska-774603198/) [📧](mailto:matea_tas@yahoo.com)
- [Borjan Sazdov](https://www.linkedin.com/in/borjan-sazdov-4b2187211/) [📧](mailto:borjansazdov@yahoo.com)

## Citation
```
@model{domestic-yak-8B,
title={domestic-yak-8B: A Macedonian Language Model},
authors={Stefan Krsteski, Matea Tashkovska, Borjan Sazdov},
year={2024},
url={https://huggingface.co/LVSTCK/domestic-yak-8B},
note={Macedonian adaptation of Llama 8B.}
}
```