File size: 5,842 Bytes
0874513
 
 
 
 
 
 
 
 
 
 
 
 
c879918
0874513
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: apache-2.0
language:
- fr
library_name: transformers
tags:
- nllb
- commonvoice
- pytorch
- pictograms
- translation
metrics:
- bleu
inference: false
---

# t2p-nllb-200-distilled-600M-orfeo

*t2p-nllb-200-distilled-600M-orfeo* is a text-to-pictograms translation model built by fine-tuning the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model on a dataset of pairs of transcriptions / pictogram token sequence (each token is linked to a pictogram image from [ARASAAC](https://arasaac.org/)).
The model is used only for **inference**. 

## Training details

### Datasets

The [Propicto-orféo dataset](https://www.ortolang.fr/market/corpora/propicto) is used, which was created from the CEFC-Orféo corpus. 
This dataset was presented in the research paper titled ["A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation](https://aclanthology.org/2024.lrec-main.76/)" at LREC-Coling 2024. The dataset was split into training, validation, and test sets.
| **Split** | **Number of utterances** |
|:-----------:|:-----------------------:|
| train | 231,374 |
| valid | 28,796 |
| test | 29,009 |

### Parameters

A full list of the parameters is available in the config.json file. This is the arguments in the training pipeline :

```python
training_args = Seq2SeqTrainingArguments(
    output_dir="checkpoints_orfeo/",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=40,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True
)
```

### Evaluation

The model was evaluated with [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu/blob/d94719691d29f7adf7151c8b1471de579a78a280/sacrebleu.py), where we compared the reference pictogram translation with the model hypothesis.

### Results

Comparison to other translation models :
| **Model** | **validation** | **test** |
|:-----------:|:-----------------------:|:-----------------------:|
| t2p-t5-large-orféo | 85.2 | 85.8 |
| t2p-nmt-orféo | **87.2** | **87.4** | 
| t2p-mbart-large-cc25-orfeo | 75.2 | 75.6 |
| **t2p-nllb-200-distilled-600M-orfeo** | 86.3 | 86.9 |

### Environmental Impact

Fine-tuning was performed using a single Nvidia V100 GPU with 32 GB of memory which took 30 hours in total.

## Using t2p-nllb-200-distilled-600M-orfeo model with HuggingFace transformers

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

source_lang = "fr"
target_lang = "frp"
max_input_length = 128
max_target_length = 128

tokenizer = AutoTokenizer.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-orfeo")
model = AutoModelForSeq2SeqLM.from_pretrained("Propicto/t2p-nllb-200-distilled-600M-orfeo")

inputs = tokenizer("Je mange une pomme", return_tensors="pt").input_ids
outputs = model.generate(inputs.to("cuda:0"), max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
```

## Linking the predicted sequence of tokens to the corresponding ARASAAC pictograms

```python
import pandas as pd

def process_output_trad(pred):
    return pred.split()

def read_lexicon(lexicon):
    df = pd.read_csv(lexicon, sep='\t')
    df['keyword_no_cat'] = df['lemma'].str.split(' #').str[0].str.strip().str.replace(' ', '_')
    return df

def get_id_picto_from_predicted_lemma(df_lexicon, lemma):
    id_picto = df_lexicon.loc[df_lexicon['keyword_no_cat'] == lemma, 'id_picto'].tolist()
    return (id_picto[0], lemma) if id_picto else (0, lemma)

lexicon = read_lexicon("lexicon.csv")
sentence_to_map = process_output_trad(pred)
pictogram_ids = [get_id_picto_from_predicted_lemma(lexicon, lemma) for lemma in sentence_to_map]
```

## Viewing the predicted sequence of ARASAAC pictograms in a HTML file

```python
def generate_html(ids):
    html_content = '<html><body>'
    for picto_id, lemma in ids:
        if picto_id != 0:  # ignore invalid IDs
            img_url = f"https://static.arasaac.org/pictograms/{picto_id}/{picto_id}_500.png"
            html_content += f'''
            <figure style="display:inline-block; margin:1px;">
                <img src="{img_url}" alt="{lemma}" width="200" height="200" />
                <figcaption>{lemma}</figcaption>
            </figure>
            '''
    html_content += '</body></html>'
    return html_content
    
html = generate_html(pictogram_ids)
with open("pictograms.html", "w") as file:
    file.write(html)
```

## Information

- **Language(s):** French
- **License:** Apache-2.0
- **Developed by:** Cécile Macaire
- **Funded by**
  - GENCI-IDRIS (Grant 2023-AD011013625R1)
  - PROPICTO ANR-20-CE93-0005
- **Authors**
  - Cécile Macaire
  - Chloé Dion
  - Emmanuelle Esperança-Rodier
  - Benjamin Lecouteux
  - Didier Schwab


## Citation

If you use this model for your own research work, please cite as follows:

```bibtex
@inproceedings{macaire_jeptaln2024,
  title = {{Approches cascade et de bout-en-bout pour la traduction automatique de la parole en pictogrammes}},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Schwab, Didier and Lecouteux, Benjamin and Esperan{\c c}a-Rodier, Emmanuelle},
  url = {https://inria.hal.science/hal-04623007},
  booktitle = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
  address = {Toulouse, France},
  publisher = {{ATALA \& AFPC}},
  volume = {1 : articles longs et prises de position},
  pages = {22-35},
  year = {2024}
}
```