README.md · vpelloin/MEDIA_NLU-flaubert_base

MEDIA_NLU-flaubert_base_uncased / README.md

vpelloin

Upload README.md with huggingface_hub

44d9104 12 months ago

preview code

raw history blame contribute delete

No virus

4.46 kB


	---
	language: fr
	pipeline_tag: "token-classification"
	widget:
	- text: "je voudrais réserver une chambre à paris pour demain et lundi"
	- text: "d'accord pour l'hôtel à quatre vingt dix euros la nuit"
	- text: "deux nuits s'il vous plait"
	- text: "dans un hôtel avec piscine à marseille"
	tags:
	- bert
	- flaubert
	- natural language understanding
	- NLU
	- spoken language understanding
	- SLU
	- understanding
	- MEDIA
	---

	# vpelloin/MEDIA_NLU-flaubert_base_uncased
	This is a Natural Language Understanding (NLU) model for the French [MEDIA benchmark](https://catalogue.elra.info/en-us/repository/browse/ELRA-S0272/).
	It maps each input words into outputs concepts tags (76 available).

	This model is trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased) as its inital checkpoint. It obtained 12.40% CER (lower is better) in the MEDIA test set, in [our Interspeech 2023 publication](http://doi.org/10.21437/Interspeech.2022-352), using Kaldi ASR transcriptions.

	## Available MEDIA NLU models:
	- [`vpelloin/MEDIA_NLU-flaubert_base_cased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_cased): MEDIA NLU model trained using [`flaubert/flaubert_base_cased`](https://huggingface.co/flaubert/flaubert_base_cased). Obtains 13.20% CER on MEDIA test.
	- [`vpelloin/MEDIA_NLU-flaubert_base_uncased`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_base_uncased): MEDIA NLU model trained using [`flaubert/flaubert_base_uncased`](https://huggingface.co/flaubert/flaubert_base_uncased). Obtains 12.40% CER on MEDIA test.
	- [`vpelloin/MEDIA_NLU-flaubert_oral_ft`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_ft): MEDIA NLU model trained using [`nherve/flaubert-oral-ft`](https://huggingface.co/nherve/flaubert-oral-ft). Obtains 11.98% CER on MEDIA test.
	- [`vpelloin/MEDIA_NLU-flaubert_oral_mixed`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_mixed): MEDIA NLU model trained using [`nherve/flaubert-oral-mixed`](https://huggingface.co/nherve/flaubert-oral-mixed). Obtains 12.47% CER on MEDIA test.
	- [`vpelloin/MEDIA_NLU-flaubert_oral_asr`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr): MEDIA NLU model trained using [`nherve/flaubert-oral-asr`](https://huggingface.co/nherve/flaubert-oral-asr). Obtains 12.43% CER on MEDIA test.
	- [`vpelloin/MEDIA_NLU-flaubert_oral_asr_nb`](https://huggingface.co/vpelloin/MEDIA_NLU-flaubert_oral_asr_nb): MEDIA NLU model trained using [`nherve/flaubert-oral-asr_nb`](https://huggingface.co/nherve/flaubert-oral-asr_nb). Obtains 12.24% CER on MEDIA test.

	## Usage with Pipeline
	```python
	from transformers import pipeline

	generator = pipeline(
	model="vpelloin/MEDIA_NLU-flaubert_base_uncased",
	task="token-classification"
	)

	sentences = [
	"je voudrais réserver une chambre à paris pour demain et lundi",
	"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
	"deux nuits s'il vous plait",
	"dans un hôtel avec piscine à marseille"
	]

	for sentence in sentences:
	print([(tok['word'], tok['entity']) for tok in generator(sentence)])
	```
	## Usage with AutoTokenizer/AutoModel
	```python
	from transformers import (
	AutoTokenizer,
	AutoModelForTokenClassification
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"vpelloin/MEDIA_NLU-flaubert_base_uncased"
	)
	model = AutoModelForTokenClassification.from_pretrained(
	"vpelloin/MEDIA_NLU-flaubert_base_uncased"
	)

	sentences = [
	"je voudrais réserver une chambre à paris pour demain et lundi",
	"d'accord pour l'hôtel à quatre vingt dix euros la nuit",
	"deux nuits s'il vous plait",
	"dans un hôtel avec piscine à marseille"
	]
	inputs = tokenizer(sentences, padding=True, return_tensors='pt')
	outputs = model(**inputs).logits
	print([
	[model.config.id2label[i] for i in b]
	for b in outputs.argmax(dim=-1).tolist()
	])
	```

	## Reference

	If you use this model for your scientific publication, or if you find the resources in this repository useful, please cite the [following paper](http://doi.org/10.21437/Interspeech.2022-352):
	```
	@inproceedings{pelloin22_interspeech,
	author={Valentin Pelloin and Franck Dary and Nicolas Hervé and Benoit Favre and Nathalie Camelin and Antoine LAURENT and Laurent Besacier},
	title={ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks},
	year=2022,
	booktitle={Proc. Interspeech 2022},
	pages={3453--3457},
	doi={10.21437/Interspeech.2022-352}
	}
	```