We released the suite of models we trained as part of our work on scaling laws of decoder-only machine translation systems. This work has been published in WMT24 and is available here.
These models have been trained on a mixture of general and financial sentences on 11 language directions. They support 8 languages (English, French, German, Italian, Spanish, Dutch, Swedish and Portuguese) as well as 9 domains (general + 8 financial subdomains). They are not tailored for document-level translation.
A running demo of these models is available on our dedicated space.
Evaluation
The below table details the performance of our models on general domain translation.
Model | BLEU | COMET | COMET-Kiwi |
---|---|---|---|
FinTranslate-70M | 29.62 | 81.31 | 80.72 |
FinTranslate-160M | 32.43 | 84.00 | 83.45 |
FinTranslate-410M | 33.60 | 84.81 | 84.14 |
FinTranslate-Bronze | 34.08 | 85.10 | 84.35 |
FinTranslate-Silver | 34.42 | 85.10 | 84.33 |
FinTranslate-Gold | 36.07 | 85.88 | 84.82 |
Llama3.1 8B | 30.43 | 84.82 | 84.47 |
Mistral 7B | 23.26 | 80.08 | 82.29 |
Tower 7B | 33.50 | 85.91 | 85.02 |
The below table details the performance of our models on financial translation.
Model | BLEU | COMET | COMET-Kiwi |
---|---|---|---|
FinTranslate-70M | 44.63 | 86.95 | 80.88 |
FinTranslate-160M | 49.02 | 88.27 | 81.80 |
FinTranslate-410M | 50.85 | 88.64 | 81.73 |
FinTranslate-Bronze | 52.00 | 88.85 | 81.71 |
FinTranslate-Silver | 53.28 | 89.98 | 81.61 |
FinTranslate-Gold | 58.34 | 89.62 | 81.35 |
Llama 3.1 8B | 34.99 | 84.42 | 81.75 |
Mistral 7B | 38.93 | 76.52 | 76.17 |
Tower 7B | 38.93 | 86.49 | 82.66 |
How to use it
from transformers import AutoTokenizer, AutoModelForCausalLM
LANGUAGES = ["en", "de", "es", "fr", "it", "nl", "sv", "pt"]
DOMAINS = {
"Asset manangement": "am",
"Annual report": "ar",
"Corporate action": "corporateAction",
"Equity research": "equi",
"Fund fact sheet": "ffs",
"Kiid": "kiid",
"Life insurance": "lifeInsurance",
"Regulatory": "regulatory",
"General": "general",
}
def language_token(lang):
return f"<lang_{lang}>"
def domain_token(dom):
return f"<dom_{dom}>"
def format_input(src, tgt_lang, src_lang, domain):
assert tgt_lang in LANGUAGES
tgt_lang_token = language_token(tgt_lang)
# Please read our paper to understand why we need to prefix the input with <eos>
base_input = f"<eos>{src}</src>{tgt_lang_token}"
if src_lang is None:
return base_input
else:
assert src_lang in LANGUAGES
src_lang_token = language_token(src_lang)
base_input = f"{base_input}{src_lang_token}"
if domain is None:
return base_input
else:
domain = DOMAINS.get(domain, "general")
dom_token = domain_token(domain)
base_input = f"{base_input}{dom_token}"
return base_input
model_id = "LinguaCustodia/FinTranslate-Bronze"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
source_sentence = "Lingua Custodia est une entreprise française spécialisé dans le domaine de l'IA générative."
formatted_sentence = format_input(source_sentence, "en", "fr", "General")
inputs = tokenizer(formatted_sentence, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(**inputs, max_new_tokens=64)
input_size = inputs["input_ids"].size(1)
translated_sentence = tokenizer.decode(
outputs[0, input_size:], skip_special_tokens=True
)
print(translated_sentence)
# Lingua Custodia is a French company specialized in the field of generative AI.
Usage Permissions:
Evaluation: You are encouraged to use this model for non-commercial evaluation purposes. Feel free to test and assess its performance in machine translation and various generative tasks.
Limitations:
Commercial Services: If you intend to utilize this model to build a commercial service, such as for profit, you are required to contact Lingua Custodia to obtain proper authorization. This requirement is in place to ensure that any commercial use of the model for evaluation services is done in collaboration with Lingua Custodia. This helps maintain the quality and consistency of the model's use in commercial contexts.
Contact Information:
For inquiries regarding commercial use authorization or any other questions regarding bigger models not available on Hugging Face, please contact us at support@linguacustodia.com .
We believe in the power of open-source and collaborative efforts, and we're excited to contribute to the community's advancements in the field of natural language processing. Please respect the terms of the CC-BY-NC-SA-4.0 license when using our models under non commercial licences.
Citing this work
If you use this model in your work, please cite it as:
@inproceedings{caillaut-etal-2024-scaling,
title = "Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task",
author = {Caillaut, Ga{\"e}tan and
Nakhl{\'e}, Mariam and
Qader, Raheel and
Liu, Jingshu and
Barth{\'e}lemy, Jean-Gabriel},
editor = "Haddow, Barry and
Kocmi, Tom and
Koehn, Philipp and
Monz, Christof",
booktitle = "Proceedings of the Ninth Conference on Machine Translation",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.wmt-1.124/",
doi = "10.18653/v1/2024.wmt-1.124",
pages = "1318--1331"
}
- Downloads last month
- 59