File size: 4,430 Bytes
a3043d2
e86526b
 
ababc42
e86526b
ce5bfd8
e86526b
 
 
c8b0e33
 
 
 
98120c7
267addc
ababc42
267addc
c8b0e33
e1804f1
 
a3043d2
f9dd19d
 
 
7ad5cec
f9dd19d
a3043d2
 
 
 
 
 
 
f9dd19d
 
 
 
 
a3043d2
 
 
 
f9dd19d
a3043d2
 
 
 
 
 
f9dd19d
a3043d2
 
 
f9dd19d
 
 
 
 
 
a3043d2
 
 
 
 
 
f9dd19d
 
 
 
 
 
 
 
bde5c1b
a3043d2
 
bde5c1b
a3043d2
 
 
 
f9dd19d
a3043d2
f9dd19d
a3043d2
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

---
language: fr
pipeline_tag: "token-classification"
widget:
 - text: "je voudrais réserver une chambre à paris pour demain et lundi"
 - text: "d'accord pour l'hôtel à quatre vingt dix euros la nuit"
 - text: "deux nuits s'il vous plait"
 - text: "dans un hôtel avec piscine à marseille"
tags:
- bert
- flaubert 
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
- MEDIA
---

# vpelloin/MEDIA_NLU-flaubert_oral_ft
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
It maps each input words into outputs concepts tags (76 available).

This model is trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) as its inital checkpoint. It obtained 11.98% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352), using Kaldi ASR transcriptions.

## Available MEDIA NLU models:
- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.

## Usage with Pipeline
```python
from transformers import pipeline

generator = pipeline(
    model="vpelloin/MEDIA_NLU-flaubert_oral_ft",
    task="token-classification"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]

for sentence in sentences:
    print([(tok['word'], tok['entity']) for tok in generator(sentence)])
```
## Usage with AutoTokenizer/AutoModel
```python
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_oral_ft"
)
model = AutoModelForTokenClassification.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_oral_ft"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outputs = model(**inputs).logits
print([
    [model.config.id2label[i] for i in b]
    for b in outputs.argmax(dim=-1).tolist()
])
```

## Reference

If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
```
@inproceedings{pelloin22_interspeech,
  author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
  title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={3453--3457},
  doi={10.21437/Interspeech.2022-352}
}
```