RichardErkhov commited on
Commit
5048966
·
verified ·
1 Parent(s): 0ecbbf0

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +127 -0
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ emma-500-llama2-7b - GGUF
11
+ - Model creator: https://huggingface.co/MaLA-LM/
12
+ - Original model: https://huggingface.co/MaLA-LM/emma-500-llama2-7b/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [emma-500-llama2-7b.Q2_K.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q2_K.gguf) | Q2_K | 2.36GB |
18
+ | [emma-500-llama2-7b.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.IQ3_XS.gguf) | IQ3_XS | 2.6GB |
19
+ | [emma-500-llama2-7b.IQ3_S.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.IQ3_S.gguf) | IQ3_S | 2.75GB |
20
+ | [emma-500-llama2-7b.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q3_K_S.gguf) | Q3_K_S | 2.75GB |
21
+ | [emma-500-llama2-7b.IQ3_M.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.IQ3_M.gguf) | IQ3_M | 2.9GB |
22
+ | [emma-500-llama2-7b.Q3_K.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q3_K.gguf) | Q3_K | 3.07GB |
23
+ | [emma-500-llama2-7b.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q3_K_M.gguf) | Q3_K_M | 3.07GB |
24
+ | [emma-500-llama2-7b.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q3_K_L.gguf) | Q3_K_L | 3.35GB |
25
+ | [emma-500-llama2-7b.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.IQ4_XS.gguf) | IQ4_XS | 3.4GB |
26
+ | [emma-500-llama2-7b.Q4_0.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q4_0.gguf) | Q4_0 | 3.56GB |
27
+ | [emma-500-llama2-7b.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.IQ4_NL.gguf) | IQ4_NL | 3.58GB |
28
+ | [emma-500-llama2-7b.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q4_K_S.gguf) | Q4_K_S | 3.59GB |
29
+ | [emma-500-llama2-7b.Q4_K.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q4_K.gguf) | Q4_K | 3.8GB |
30
+ | [emma-500-llama2-7b.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q4_K_M.gguf) | Q4_K_M | 3.8GB |
31
+ | [emma-500-llama2-7b.Q4_1.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q4_1.gguf) | Q4_1 | 3.95GB |
32
+ | [emma-500-llama2-7b.Q5_0.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q5_0.gguf) | Q5_0 | 4.33GB |
33
+ | [emma-500-llama2-7b.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q5_K_S.gguf) | Q5_K_S | 4.33GB |
34
+ | [emma-500-llama2-7b.Q5_K.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q5_K.gguf) | Q5_K | 4.45GB |
35
+ | [emma-500-llama2-7b.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q5_K_M.gguf) | Q5_K_M | 4.45GB |
36
+ | [emma-500-llama2-7b.Q5_1.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q5_1.gguf) | Q5_1 | 4.72GB |
37
+ | [emma-500-llama2-7b.Q6_K.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q6_K.gguf) | Q6_K | 5.15GB |
38
+ | [emma-500-llama2-7b.Q8_0.gguf](https://huggingface.co/RichardErkhov/MaLA-LM_-_emma-500-llama2-7b-gguf/blob/main/emma-500-llama2-7b.Q8_0.gguf) | Q8_0 | 6.67GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ license: llama2
46
+ datasets:
47
+ - MaLA-LM/mala-monolingual-split
48
+ base_model:
49
+ - meta-llama/Llama-2-7b-hf
50
+ library_name: transformers
51
+ ---
52
+
53
+ # EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
54
+
55
+ ## Model Description
56
+
57
+ **EMMA-500** is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the **Llama 2 7B** architecture. Leveraging the **MaLA Corpus**, which spans over 500 languages and 74 billion tokens, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, open-ended generation, and text classification.
58
+
59
+ **EMMA-500** outperforms other Llama 2-based models in diverse multilingual settings while maintaining robustness in specialized tasks.
60
+
61
+ ---
62
+
63
+ ## Model Details
64
+
65
+ - **Architecture**: Built on Llama 2 7B with enhanced language adaptation through continual pre-training.
66
+ - **Languages**: Supports **546 languages** with substantial training data (over 100k tokens each).
67
+ - **Data Mix**: A diverse mix of text from domains like code, books, instruction data, and more.
68
+ - **Key Tasks**: Commonsense reasoning, machine translation, text classification, natural language inference, code generation, and open-ended generation.
69
+
70
+ ### Data Access
71
+ - [MaLA Corpus](https://huggingface.co/collections/MaLA-LM/mala-corpus-66e05127641a51de34d39529)
72
+ - [PolyWrite Benchmark](https://huggingface.co/datasets/MaLA-LM/PolyWrite)
73
+
74
+ ---
75
+
76
+ ## Usage
77
+
78
+ You can use **EMMA-500** for multilingual text generation. Below is an example to generate text using the model:
79
+
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ model_name = "MaLA-LM/emma-500-llama2-7b"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForCausalLM.from_pretrained(model_name)
86
+
87
+ input_text = "Once upon a time"
88
+ inputs = tokenizer(input_text, return_tensors="pt")
89
+ outputs = model.generate(**inputs)
90
+
91
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
92
+ ```
93
+
94
+ ---
95
+
96
+ ## Model Performance
97
+
98
+ **EMMA-500** was evaluated across multiple benchmarks and tasks, demonstrating:
99
+
100
+ - **Lowest negative log-likelihood** in intrinsic evaluations.
101
+ - Significant improvements in **commonsense reasoning**, **machine translation**, and **open-ended generation**.
102
+ - **Outperformed** all Llama 2-based models in **text classification** and **natural language inference**.
103
+ - Enhanced performance in **code generation** and **machine reading comprehension (MRC)**.
104
+
105
+ Challenges remain in low-resource languages, where the model tends to have higher **Self-BLEU** scores, indicating reduced output diversity.
106
+
107
+ ---
108
+
109
+
110
+ ## Citation
111
+
112
+ ```
113
+ @article{ji2024emma500enhancingmassivelymultilingual,
114
+ title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models},
115
+ author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
116
+ year={2024},
117
+ journal={arXiv preprint 2409.17892},
118
+ url={https://arxiv.org/abs/2409.17892},
119
+ }
120
+ ```
121
+
122
+ ## Acknowledgements
123
+
124
+ We extend our thanks to the language communities and contributors who helped source, clean, and validate the diverse data used in the MaLA Corpus. Their efforts are invaluable in supporting linguistic diversity in AI research.
125
+
126
+ This work is done by researchers at [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) in collaboration with partners from TU Darmstadt, the University of Edinburgh, and LMU Munich. It is funded by [HPLT](https://hplt-project.org) and [UTTER](https://he-utter.eu).
127
+