cecilemacaire commited on
Commit
6bf83db
1 Parent(s): 1f9f233

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ library_name: transformers
6
+ tags:
7
+ - NMT
8
+ - orféo
9
+ - pytorch
10
+ - pictograms
11
+ - translation
12
+ metrics:
13
+ - bleu
14
+ inference: false
15
+ ---
16
+
17
+ # t2p-nmt-orfeo
18
+
19
+ *t2p-nmt-orfeo* is a text-to-pictograms translation model built by training from scratch the [NMT](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
20
+ The model is used only for **inference**.
21
+
22
+ ## Training details
23
+
24
+ The model was trained with [Fairseq](https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md).
25
+
26
+ ### Datasets
27
+
28
+ The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus.
29
+ This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024.
30
+ The dataset was split into training, validation, and test sets.
31
+ | **Split** | **Number of utterances** |
32
+ |:-----------:|:-----------------------:|
33
+ | train | 231,374 |
34
+ | valid | 28,796 |
35
+ | test | 29,009 |
36
+
37
+ ### Parameters
38
+
39
+ This is the arguments in the training pipeline :
40
+
41
+ ```bash
42
+ fairseq-train \
43
+ data-bin/orfeo.tokenized.fr-frp \
44
+ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \
45
+ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
46
+ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
47
+ --dropout 0.3 --weight-decay 0.0001 \
48
+ --save-dir exp_orfeo/checkpoints/nmt_fr_frp_orfeo \
49
+ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
50
+ --max-tokens 4096 \
51
+ --eval-bleu \
52
+ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
53
+ --eval-bleu-detok moses \
54
+ --eval-bleu-remove-bpe \
55
+ --eval-bleu-print-samples \
56
+ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
57
+ --max-epoch 40 \
58
+ --keep-best-checkpoints 5 \
59
+ --keep-last-epochs 5
60
+ ```
61
+
62
+ ### Evaluation
63
+
64
+ The model was evaluated with BLEU, where we compared the reference pictogram translation with the model hypothesis.
65
+
66
+ ### Results
67
+
68
+ Comparison to other translation models :
69
+ | **Model** | **validation** | **test** |
70
+ |:-----------:|:-----------------------:|:-----------------------:|
71
+ | **t2p-t5-large-orféo** | 85.2 | 85.8 |
72
+ | t2p-nmt-orféo | **87.2** | **87.4** |
73
+ | t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
74
+ | t2p-nllb-200-distilled-600M-orfeo | 86.3 | 86.9 |
75
+
76
+ ### Environmental Impact
77
+
78
+ Training was performed using a single Nvidia V100 GPU with 32 GB of memory which took around 2 hours in total.
79
+
80
+ ## Using t2p-nmt-orfeo model
81
+
82
+ The scripts to use the *t2p-nmt-orfeo* model are located in the [speech-to-pictograms GitHub repository](https://github.com/macairececile/speech-to-pictograms).
83
+
84
+ ## Information
85
+
86
+ - **Language(s):** French
87
+ - **License:** Apache-2.0
88
+ - **Developed by:** Cécile Macaire
89
+ - **Funded by**
90
+ - GENCI-IDRIS (Grant 2023-AD011013625R1)
91
+ - PROPICTO ANR-20-CE93-0005
92
+ - **Authors**
93
+ - Cécile Macaire
94
+ - Chloé Dion
95
+ - Emmanuelle Esperança-Rodier
96
+ - Benjamin Lecouteux
97
+ - Didier Schwab
98
+
99
+
100
+ ## Citation
101
+
102
+ If you use this model for your own research work, please cite as follows:
103
+
104
+ ```bibtex
105
+ @inproceedings{macaire_jeptaln2024,
106
+ title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
107
+ author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
108
+ url = {https://inria.hal.science/hal-04623007},
109
+ booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
110
+ address = {Toulouse, France},
111
+ publisher = {{ATALA \& AFPC}},
112
+ volume = {1 : articles longs et prises de position},
113
+ pages = {22-35},
114
+ year = {2024}
115
+ }
116
+ ```