license: apache-2.0
language:
- fr
library_name: transformers
tags:
- NMT
- orféo
- pytorch
- pictograms
- translation
metrics:
- sacrebleu
inference: false
t2p-nmt-orfeo
t2p-nmt-orfeo is a text-to-pictograms translation model built by training from scratch the NMT model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from ARASAAC). The model is used only for inference.
Training details
The model was trained with Fairseq.
Datasets
The Propicto-orféo dataset is used, which was created from the CEFC-Orféo corpus. This dataset was presented in the research paper titled "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
Split | Number of utterances |
---|---|
train | 231,374 |
valid | 28,796 |
test | 29,009 |
Parameters
This is the arguments in the training pipeline :
fairseq-train \
data-bin/orfeo.tokenized.fr-frp \
--arch transformer_iwslt_de_en --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--save-dir exp_orfeo/checkpoints/nmt_fr_frp_orfeo \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--max-epoch 40 \
--keep-best-checkpoints 5 \
--keep-last-epochs 5
Evaluation
The model was evaluated with sacreBLEU, where we compared the reference pictogram translation with the model hypothesis.
fairseq-generate exp_orfeo/data-bin/orfeo.tokenized.fr-frp \
--path exp_orfeo/checkpoints/nmt_fr_frp_orfeo/checkpoint.best_bleu_87.2803.pt \
--batch-size 128 --beam 5 --remove-bpe > gen_orfeo.out
The output file prints the following information :
S-16709 peut-être vous pouvez vous exprimer
T-16709 vous pouvoir exprimer
H-16709 -0.0769597738981247 vous pouvoir exprimer
D-16709 -0.0769597738981247 vous pouvoir exprimer
P-16709 -0.0936 -0.0924 -0.0065 -0.1154
Generate test with beam=5: BLEU4 = 87.43, 95.2/89.8/85.0/80.4 (BP=1.000, ratio=1.006, syslen=250949, reflen=249520)
Results
Comparison to other translation models :
Model | validation | test |
---|---|---|
t2p-t5-large-orféo | 85.2 | 85.8 |
t2p-nmt-orféo | 87.2 | 87.4 |
t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
Environmental Impact
Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 2 hours in total.
Using t2p-nmt-orfeo model
The scripts to use the t2p-nmt-orfeo model are located in the speech-to-pictograms GitHub repository.
Information
- Language(s): French
- License: Apache-2.0
- Developed by: Cécile Macaire
- Funded by
- GENCI-IDRIS (Grant 2023-AD011013625R1)
- PROPICTO ANR-20-CE93-0005
- Authors
- Cécile Macaire
- Chloé Dion
- Emmanuelle Esperança-Rodier
- Benjamin Lecouteux
- Didier Schwab
Citation
If you use this model for your own research work, please cite as follows:
@inproceedings{macaire_jeptaln2024,
title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
url = {https://inria.hal.science/hal-04623007},
booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
address = {Toulouse, France},
publisher = {{ATALA \& AFPC}},
volume = {1 : articles longs et prises de position},
pages = {22-35},
year = {2024}
}