trajkovnikola commited on
Commit
539fe5f
1 Parent(s): 77e89a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -3
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - mk
5
+ - en
6
+ tags:
7
+ - axolotl
8
+ ---
9
+ # MKLLM-7B-Instruct
10
+
11
+ MKLLM-7B is an open-source Large Language Model for the Macedonian language. The model is built on top of the amazing [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) model by continued pretraining on a mix of Macedonian and English text.
12
+ A corpus of around 300M tokens, repeated in 2 epochs, was used for the training and even though this might be considered small compared to other similar projects, the resulting model is very capable in understanding and processing the Macedonian language.
13
+
14
+ This is the instruction-tuned version of MKLLM-7B. It was trained by taking MKLLM-7B and then performing a full instruction training with axolotl by using the chatml format for conversations.
15
+
16
+ We tested the model against Meta's Llama3-8B-Instruct and Mistral's Mistral-7B-Instruct-v0.3 on a set of benchmarks we translated in Macedonian and the model performs better than both leading models in its category.
17
+ Additionally, these benchmarks are primarily focused on understanding and do not measure generation capabilities and fluency, in these categories we believe there's an even larger difference in performance as MKLLM-7B-Instruct writes much more coherent Macedonian.
18
+ The benchmarking was done with: https://github.com/N13T/mk-llm-eval
19
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f85631e019bdfd8cd83f10/k0ztAR-H8xdPZHNxhu35_.png)
20
+
21
+ In order to leverage the instruction training your prompt should follow the chatml format:
22
+ ```
23
+ <|im_start|>system
24
+ Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот.<|im_end|>
25
+ <|im_start|>user
26
+ Која планета е позната како 'Црвената Планета'?<|im_end|>
27
+ <|im_start|>assistant
28
+ Марс<|im_end|>
29
+ ```
30
+
31
+ This prompt is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating), which means you can format messages using the
32
+ `tokenizer.apply_chat_template()` method:
33
+
34
+ ```python
35
+ messages = [
36
+ {"role": "system", "content": "Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот."},
37
+ {"role": "user", "content": "Која планета е позната како 'Црвената Планета'?"}
38
+ ]
39
+ gen_input = tokenizer.apply_chat_template(messages,
40
+ tokenize=True,
41
+ return_dict=True,
42
+ return_tensors="pt",
43
+ add_generation_prompt=True).to("cuda")
44
+ with torch.no_grad():
45
+ generated_ids = model.generate(**gen_input, max_new_tokens=150,
46
+ do_sample=True,
47
+ temperature=0.1,
48
+ repetition_penalty=1.1,
49
+ )
50
+ print(tokenizer.decode(generated_ids[0][prompt["input_ids"].shape[1]:], skip_special_tokens=False))
51
+ ```
52
+
53
+ **Notes**
54
+
55
+ - MKLLM-7B-Instruct can hallucinate and produce factually incorrect output. This is especially pronounced when discussing Macedonian topics due to the smaller training dataset.
56
+
57
+ [<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)