File size: 2,378 Bytes
6de4a92
 
 
 
 
dc2980f
6de4a92
 
 
e11673b
 
 
 
6de4a92
e11673b
 
6de4a92
e11673b
6de4a92
e11673b
6de4a92
 
e11673b
6de4a92
 
e11673b
6de4a92
e11673b
 
 
6de4a92
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
library_name: transformers
tags: []
---

# Lugha-Llama/Lugha-Llama-8B-wura

<!-- Provide a quick summary of what the model is/does. -->

Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource 
languages commonly spoken on the African continent. Using [UniMax sampling](https://openreview.net/forum?id=kXwdL1cWOAi), we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language. 

We demonstrate the findings in our paper [Adapting Large Language Models for African Languages:
The Lugha-Llama Model]()

Authors: [Happy Buzaaba](https://buzaabah.github.io/)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [David Ifeoluwa Adelani](https://dadelani.github.io/), [Christiane Fellbaum](https://www.cs.princeton.edu/people/profile/fellbaum) (* equal contribution)

contact {happy.buzaaba@, awettig@cs}princeton.edu


* Translated Swahili data 200M tokens: [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated)


## Lugha-Llama models

* [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
* [Lugha-Llama/Lugha-Llama-8B-wura_edu](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_edu)
* [Lugha-Llama/Lugha-Llama-8B-wura_math](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_math)