license: cc-by-nc-4.0
language:
- en
- de
- fr
- zh
- pt
- nl
- ru
- ko
- it
- es
metrics:
- comet
pipeline_tag: translation
Model Card for TowerBase-7B-v0.1
Model Details
Model Description
TowerBase-7B is a language model that results from continuing the pretraining of Llama 2 on a mix of 20 billion tokens of non-English monolingual data, and bilingual data. TowerBase-7B-v0.1 is the first model in the series. The resulting model shows improved performance on the supported languages, while maintaining Llama 2's capabilities on English. It is particularly well-suited for fine-tuning on translation and related tasks: check out TowerInstruct.
We will release more details in the upcoming technical report.
- Developed by: Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay
- Model type: A 7B parameter model built on top of Llama 2 by continuing pretraining on multilingual data.
- Language(s) (NLP): English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian
- License: CC-BY-NC-4.0
Intended uses & limitations
The model is intended for research purposes in the 10 languages it supports. The model is able to perform well on translation and related tasks (e.g., APE, GEC) on a few-shot regime. It can also be fine-tuned to perform these tasks in a zero-shot fashion (see TowerInstruct, as well as other multilingual tasks.
Out-of-Scope Use
The model is not guaranteed to perform well for languages other than the 10 languages it supports.
Bias, Risks, and Limitations
TowerBase-v0.1 has not been aligned to human preferences, so the model may generate problematic outputs (e.g., hallucinations, harmful content, or false statements).
Training Data
Filtered versions of mc4 and bilingual data from various sources (e.g., OPUS).
Citation
To be completed.