vpelloin commited on
Commit
a3043d2
1 Parent(s): 17c4388

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +47 -11
README.md CHANGED
@@ -1,3 +1,4 @@
 
1
  ---
2
  language: fr
3
  pipeline_tag: "token-classification"
@@ -17,32 +18,51 @@ tags:
17
  - MEDIA
18
  ---
19
 
20
- # vpelloin/MEDIA_NLU_flaubert_finetuned (FT)
21
-
22
  This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
23
  It maps each input words into outputs concepts tags (76 available).
24
 
25
- This model is a fine-tuning of [`flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) (FlauBERT finetuned on ASR data).
26
 
 
 
 
 
 
 
 
27
 
28
  ## Usage with Pipeline
29
  ```python
30
  from transformers import pipeline
31
 
32
- generator = pipeline(model="vpelloin/MEDIA_NLU_flaubert_finetuned", task="token-classification")
 
 
 
33
 
34
- print(generator)
35
- ```
 
 
 
 
36
 
 
 
 
37
  ## Usage with AutoTokenizer/AutoModel
38
  ```python
39
  from transformers import (
40
  AutoTokenizer,
41
  AutoModelForTokenClassification
42
  )
43
-
44
- tokenizer = AutoTokenizer.from_pretrained("vpelloin/MEDIA_NLU_flaubert_finetuned")
45
- model = AutoModelForTokenClassification.from_pretrained("vpelloin/MEDIA_NLU_flaubert_finetuned")
 
 
 
46
 
47
  sentences = [
48
  "je voudrais réserver une chambre à paris pour demain et lundi",
@@ -51,8 +71,24 @@ sentences = [
51
  "dans un hôtel avec piscine à marseille"
52
  ]
53
  inputs = tokenizer(sentences, padding=True, return_tensors='pt')
54
-
55
  outptus = model(**inputs).logits
 
 
 
 
 
 
 
56
 
57
- print([[model.config.id2label[i] for i in b] for b in outptus.argmax(dim=-1).tolist()])
58
  ```
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
  language: fr
4
  pipeline_tag: "token-classification"
 
18
  - MEDIA
19
  ---
20
 
21
+ # vpelloin/MEDIA_NLU-flaubert_oral_ft
 
22
  This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
23
  It maps each input words into outputs concepts tags (76 available).
24
 
25
+ This model is trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft) as its inital checkpoint. It obtained 11.98% CER (*lower is better*) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352).
26
 
27
+ ## Available MEDIA NLU models:
28
+ - [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
29
+ - [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
30
+ - [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
31
+ - [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
32
+ - [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
33
+ - [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.
34
 
35
  ## Usage with Pipeline
36
  ```python
37
  from transformers import pipeline
38
 
39
+ generator = pipeline(
40
+ model="vpelloin/MEDIA_NLU-flaubert_oral_ft",
41
+ task="token-classification"
42
+ )
43
 
44
+ sentences = [
45
+ "je voudrais réserver une chambre à paris pour demain et lundi",
46
+ "d'accord pour l'hôtel à quatre vingt dix euros la nuit",
47
+ "deux nuits s'il vous plait",
48
+ "dans un hôtel avec piscine à marseille"
49
+ ]
50
 
51
+ for sentence in sentences:
52
+ print([(tok['word'], tok['entity']) for tok in generator(sentence)])
53
+ ```
54
  ## Usage with AutoTokenizer/AutoModel
55
  ```python
56
  from transformers import (
57
  AutoTokenizer,
58
  AutoModelForTokenClassification
59
  )
60
+ tokenizer = AutoTokenizer.from_pretrained(
61
+ "vpelloin/MEDIA_NLU-flaubert_oral_ft"
62
+ )
63
+ model = AutoModelForTokenClassification.from_pretrained(
64
+ "vpelloin/MEDIA_NLU-flaubert_oral_ft"
65
+ )
66
 
67
  sentences = [
68
  "je voudrais réserver une chambre à paris pour demain et lundi",
 
71
  "dans un hôtel avec piscine à marseille"
72
  ]
73
  inputs = tokenizer(sentences, padding=True, return_tensors='pt')
 
74
  outptus = model(**inputs).logits
75
+ print([
76
+ [model.config.id2label[i] for i in b]
77
+ for b in outptus.argmax(dim=-1).tolist()
78
+ ])
79
+ ```
80
+
81
+ ## Reference
82
 
83
+ If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
84
  ```
85
+ @inproceedings{pelloin22_interspeech,
86
+ author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
87
+ title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
88
+ year=2022,
89
+ booktitle={Proc. Interspeech 2022},
90
+ pages={3453--3457},
91
+ doi={10.21437/Interspeech.2022-352}
92
+ }
93
+ ```
94
+