We released the suite of models we trained as part of our work on scaling laws of decoder-only machine translation systems. This work has been published in WMT24 and is available here.
These models have been trained on a mixture of general and financial sentences on 11 language directions. They support 8 languages (English, French, German, Italian, Spanish, Dutch, Swedish and Portuguese) as well as 9 domains (general + 8 financial subdomains). They are not tailored for document-level translation.
A running demo of these models is available on our dedicated space.
Evaluation
The below table details the performance of our models on general domain translation.
Model | BLEU | COMET | COMET-Kiwi |
---|---|---|---|
FinTranslate-70M | 29.62 | 81.31 | 80.72 |
FinTranslate-160M | 32.43 | 84.00 | 83.45 |
FinTranslate-410M | 33.60 | 84.81 | 84.14 |
FinTranslate-Bronze | 34.08 | 85.10 | 84.35 |
FinTranslate-Silver | 34.42 | 85.10 | 84.33 |
FinTranslate-Gold | 36.07 | 85.88 | 84.82 |
Llama3.1 8B | 30.43 | 84.82 | 84.47 |
Mistral 7B | 23.26 | 80.08 | 82.29 |
Tower 7B | 33.50 | 85.91 | 85.02 |
The below table details the performance of our models on financial translation.
Model | BLEU | COMET | COMET-Kiwi |
---|---|---|---|
FinTranslate-70M | 44.63 | 86.95 | 80.88 |
FinTranslate-160M | 49.02 | 88.27 | 81.80 |
FinTranslate-410M | 50.85 | 88.64 | 81.73 |
FinTranslate-Bronze | 52.00 | 88.85 | 81.71 |
FinTranslate-Silver | 53.28 | 89.98 | 81.61 |
FinTranslate-Gold | 58.34 | 89.62 | 81.35 |
Llama 3.1 8B | 34.99 | 84.42 | 81.75 |
Mistral 7B | 38.93 | 76.52 | 76.17 |
Tower 7B | 38.93 | 86.49 | 82.66 |
How to use it
from transformers import AutoTokenizer, AutoModelForCausalLM
LANGUAGES = ["en", "de", "es", "fr", "it", "nl", "sv", "pt"]
DOMAINS = {
"Asset manangement": "am",
"Annual report": "ar",
"Corporate action": "corporateAction",
"Equity research": "equi",
"Fund fact sheet": "ffs",
"Kiid": "kiid",
"Life insurance": "lifeInsurance",
"Regulatory": "regulatory",
"General": "general",
}
def language_token(lang):
return f"<lang_{lang}>"
def domain_token(dom):
return f"<dom_{dom}>"
def format_input(src, tgt_lang, src_lang, domain):
assert tgt_lang in LANGUAGES
tgt_lang_token = language_token(tgt_lang)
# Please read our paper to understand why we need to prefix the input with <eos>
base_input = f"<eos>{src}</src>{tgt_lang_token}"
if src_lang is None:
return base_input
else:
assert src_lang in LANGUAGES
src_lang_token = language_token(src_lang)
base_input = f"{base_input}{src_lang_token}"
if domain is None:
return base_input
else:
domain = DOMAINS.get(domain, "general")
dom_token = domain_token(domain)
base_input = f"{base_input}{dom_token}"
return base_input
model_id = "LinguaCustodia/FinTranslate-410M"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
source_sentence = "Lingua Custodia est une entreprise française spécialisé dans le domaine de l'IA générative."
formatted_sentence = format_input(source_sentence, "en", "fr", "General")
inputs = tokenizer(formatted_sentence, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=64)
input_size = inputs["input_ids"].size(1)
translated_sentence = tokenizer.decode(
outputs[0, input_size:], skip_special_tokens=True
)
print(translated_sentence)
# Lingua Custodia is a French company specialized in the field of generative AI.
Citing this work
If you use this model in your work, please cite it as:
@inproceedings{caillaut-etal-2024-scaling,
title = "Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task",
author = {Caillaut, Ga{\"e}tan and
Nakhl{\'e}, Mariam and
Qader, Raheel and
Liu, Jingshu and
Barth{\'e}lemy, Jean-Gabriel},
editor = "Haddow, Barry and
Kocmi, Tom and
Koehn, Philipp and
Monz, Christof",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.wmt-1.124/",
doi = "10.18653/v1/2024.wmt-1.124",
pages = "1318--1331"
}
- Downloads last month
- 64