File size: 4,464 Bytes
4868c82
d677166
 
528b70e
d677166
0b2c0aa
d677166
 
 
528b70e
 
 
 
 
 
 
 
 
d677166
 
4868c82
cc370a7
 
 
cfe4b13
4868c82
 
 
 
 
 
 
 
cc370a7
 
 
 
 
4868c82
 
 
 
cc370a7
4868c82
 
 
 
 
 
cc370a7
4868c82
 
 
cc370a7
 
 
 
 
 
4868c82
 
 
 
 
 
cc370a7
 
 
 
 
 
 
 
 
4868c82
 
 
 
 
 
 
cc370a7
4868c82
cc370a7
4868c82
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

---
language: fr
pipeline_tag: "token-classification"
widget:
 - text: "je voudrais réserver une chambre à paris pour demain et lundi"
 - text: "d'accord pour l'hôtel à quatre vingt dix euros la nuit"
 - text: "deux nuits s'il vous plait"
 - text: "dans un hôtel avec piscine à marseille"
tags:
- bert
- flaubert 
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
- MEDIA
---

# vpelloin/MEDIA_NLU-flaubert_base_uncased
This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
It maps each input words into outputs concepts tags (76 available).

This model is trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased) as its inital checkpoint. It obtained 12.40% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352), using Kaldi ASR transcriptions.

## Available MEDIA NLU models:
- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.

## Usage with Pipeline
```python
from transformers import pipeline

generator = pipeline(
    model="vpelloin/MEDIA_NLU-flaubert_base_uncased",
    task="token-classification"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]

for sentence in sentences:
    print([(tok['word'], tok['entity']) for tok in generator(sentence)])
```
## Usage with AutoTokenizer/AutoModel
```python
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification
)
tokenizer = AutoTokenizer.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)
model = AutoModelForTokenClassification.from_pretrained(
    "vpelloin/MEDIA_NLU-flaubert_base_uncased"
)

sentences = [
    "je voudrais réserver une chambre à paris pour demain et lundi",
    "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
    "deux nuits s'il vous plait",
    "dans un hôtel avec piscine à marseille"
 ]
inputs = tokenizer(sentences, padding=True, return_tensors='pt')
outptus = model(**inputs).logits
print([
    [model.config.id2label[i] for i in b]
    for b in outptus.argmax(dim=-1).tolist()
])
```

## Reference

If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
```
@inproceedings{pelloin22_interspeech,
  author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
  title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={3453--3457},
  doi={10.21437/Interspeech.2022-352}
}
```