TowerBase-7B-v0.1 / README.md
jmprcp's picture
Update README.md
ab0ba57
metadata
license: cc-by-nc-4.0
language:
  - en
  - de
  - fr
  - zh
  - pt
  - nl
  - ru
  - ko
  - it
  - es
metrics:
  - comet
pipeline_tag: translation

Model Card for TowerBase-7B-v0.1

Model Details

Model Description

TowerBase-7B is a language model that results from continuing the pretraining of Llama 2 on a mix of 20 billion tokens of non-English monolingual data, and bilingual data. TowerBase-7B-v0.1 is the first model in the series. The resulting model shows improved performance on the supported languages, while maintaining Llama 2's capabilities on English. It is particularly well-suited for fine-tuning on translation and related tasks: check out TowerInstruct.

We will release more details in the upcoming technical report.

  • Developed by: Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay
  • Model type: A 7B parameter model built on top of Llama 2 by continuing pretraining on multilingual data.
  • Language(s) (NLP): English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian
  • License: CC-BY-NC-4.0

Intended uses & limitations

The model is intended for research purposes in the 10 languages it supports. The model is able to perform well on translation and related tasks (e.g., APE, GEC) on a few-shot regime. It can also be fine-tuned to perform these tasks in a zero-shot fashion (see TowerInstruct, as well as other multilingual tasks.

Out-of-Scope Use

The model is not guaranteed to perform well for languages other than the 10 languages it supports.

Bias, Risks, and Limitations

TowerBase-v0.1 has not been aligned to human preferences, so the model may generate problematic outputs (e.g., hallucinations, harmful content, or false statements).

Training Data

Filtered versions of mc4 and bilingual data from various sources (e.g., OPUS).

Citation

To be completed.