|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Lugha-Llama/Lugha-Llama-8B-wura |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource |
|
languages commonly spoken on the African continent. |
|
|
|
To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs. |
|
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results into improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively. |
|
Our models consistently achieve the best performance amongst similary-sized baselines on AfriMMLU, AfriMGSM, and AfriXNLI tasks in Irokobench. |
|
|
|
In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language. [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated). |
|
|
|
|
|
We demonstrate the findings in our paper [comming soon]() |
|
|
|
<!-- [Lugha-Llama: Adapting Large Language Models for African Languages]()--> |
|
|
|
Authors: [Happy Buzaaba](https://buzaabah.github.io/)\*, [Alexander Wettig](https://www.cs.princeton.edu/~awettig/)\*, [David Ifeoluwa Adelani](https://dadelani.github.io/), [Christiane Fellbaum](https://www.cs.princeton.edu/people/profile/fellbaum) (* equal contribution) |
|
|
|
Contact `{happy.buzaaba@, awettig@cs}princeton.edu` |
|
|
|
|
|
|
|
|
|
## Lugha-Llama models |
|
|
|
* [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura) |
|
* [Lugha-Llama/Lugha-Llama-8B-wura_edu](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_edu) |
|
* [Lugha-Llama/Lugha-Llama-8B-wura_math](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura_math) |
|
|
|
|
|
Our main result |
|
|
|
|
|
![main_result.png](https://cdn-uploads.huggingface.co/production/uploads/62a8501e1c396da5716cfca2/MZw0c4TAnRPYNVdhru7Uo.png) |
|
|
|
<p align="center"> |
|
<em>Performance of Lugha-Llama models and baselines on <a href="https://arxiv.org/abs/2406.03368"> [IrokoBench]</a>. Languages in italic |
|
are not present in the continual pre-training data. †: We exclude the high-resource languages English (eng) and French (fra) from the average, |
|
as they would otherwise skew the results due to strong English base models.</em> |
|
</p> |
|
|
|
|
|
|