4bf0c36 over 1 year ago

No virus

6.37 kB

	---
	license: apache-2.0
	language: fr
	library_name: transformers
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- hf-asr-leaderboard
	- robust-speech-event
	- CTC
	- Wav2vec2
	datasets:
	- common_voice
	- mozilla-foundation/common_voice_11_0
	- facebook/multilingual_librispeech
	- facebook/voxpopuli
	- gigant/african_accented_french
	metrics:
	- wer
	model-index:
	- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 11.0
	type: mozilla-foundation/common_voice_11_0
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 11.44
	- name: Test WER (+LM)
	type: wer
	value: 9.66
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Multilingual LibriSpeech (MLS)
	type: facebook/multilingual_librispeech
	args: french
	metrics:
	- name: Test WER
	type: wer
	value: 5.93
	- name: Test WER (+LM)
	type: wer
	value: 5.13
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VoxPopuli
	type: facebook/voxpopuli
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 9.33
	- name: Test WER (+LM)
	type: wer
	value: 8.51
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: African Accented French
	type: gigant/african_accented_french
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 16.22
	- name: Test WER (+LM)
	type: wer
	value: 15.39
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Robust Speech Event - Dev Data
	type: speech-recognition-community-v2/dev_data
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 16.56
	- name: Test WER (+LM)
	type: wer
	value: 12.96
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Fleurs
	type: google/fleurs
	args: fr_fr
	metrics:
	- name: Test WER
	type: wer
	value: 10.10
	- name: Test WER (+LM)
	type: wer
	value: 8.84
	---

	# Fine-tuned wav2vec2-FR-7K-large model for ASR in French

	<style>
	img {
	display: inline;
	}
	</style>

	![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey)
	![Model size](https://img.shields.io/badge/Params-315M-lightgrey)
	![Language](https://img.shields.io/badge/Language-French-lightgrey)

	This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz.

	## Usage

	1. To use on a local audio file with the language model

	```python
	import torch
	import torchaudio

	from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
	processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
	model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

	wav_path = "example.wav" # path to your audio file
	waveform, sample_rate = torchaudio.load(wav_path)
	waveform = waveform.squeeze(axis=0) # mono

	# resample
	if sample_rate != model_sample_rate:
	resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
	waveform = resampler(waveform)

	# normalize
	input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

	with torch.inference_mode():
	logits = model(input_dict.input_values.to(device)).logits

	predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
	```

	2. To use on a local audio file without the language model

	```python
	import torch
	import torchaudio

	from transformers import AutoModelForCTC, Wav2Vec2Processor

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
	processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
	model_sample_rate = processor.feature_extractor.sampling_rate

	wav_path = "example.wav" # path to your audio file
	waveform, sample_rate = torchaudio.load(wav_path)
	waveform = waveform.squeeze(axis=0) # mono

	# resample
	if sample_rate != model_sample_rate:
	resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
	waveform = resampler(waveform)

	# normalize
	input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

	with torch.inference_mode():
	logits = model(input_dict.input_values.to(device)).logits

	# decode
	predicted_ids = torch.argmax(logits, dim=-1)
	predicted_sentence = processor.batch_decode(predicted_ids)[0]
	```

	## Evaluation

	1. To evaluate on `mozilla-foundation/common_voice_11_0`

	```bash
	python eval.py \
	--model_id "bhuang/asr-wav2vec2-french" \
	--dataset "mozilla-foundation/common_voice_11_0" \
	--config "fr" \
	--split "test" \
	--log_outputs \
	--outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
	```

	2. To evaluate on `speech-recognition-community-v2/dev_data`

	```bash
	python eval.py \
	--model_id "bhuang/asr-wav2vec2-french" \
	--dataset "speech-recognition-community-v2/dev_data" \
	--config "fr" \
	--split "validation" \
	--chunk_length_s 30.0 \
	--stride_length_s 5.0 \
	--log_outputs \
	--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
	```