bp-mls100-xlsr / README.md

Update README.md

c0ab580 over 2 years ago

No virus

9.27 kB

	---
	language: pt
	datasets:
	- common_voice
	- mls
	- cetuc
	- lapsbm
	- voxforge
	- tedx
	- sid
	metrics:
	- wer
	tags:
	- audio
	- speech
	- wav2vec2
	- pt
	- portuguese-speech-corpus
	- automatic-speech-recognition
	- speech
	- PyTorch
	license: apache-2.0
	---

	# mls100-xlsr: Wav2vec 2.0 with MLS Dataset

	This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the [Multilingual Librispeech in Portuguese (MLS)](http://www.openslr.org/94/) dataset.

	In this notebook the model is tested against other available Brazilian Portuguese datasets.

	\| Dataset \| Train \| Valid \| Test \|
	\|--------------------------------\|-------:\|------:\|------:\|
	\| CETUC \| \| -- \| 5.4h \|
	\| Common Voice \| \| -- \| 9.5h \|
	\| LaPS BM \| \| -- \| 0.1h \|
	\| MLS \| 161h \| -- \| 3.7h \|
	\| Multilingual TEDx (Portuguese) \| \| -- \| 1.8h \|
	\| SID \| \| -- \| 1.0h \|
	\| VoxForge \| \| -- \| 0.1h \|
	\| Total \| 161h \| -- \| 21.6h \|


	#### Summary

	\| \| CETUC \| CV \| LaPS \| MLS \| SID \| TEDx \| VF \| AVG \|
	\|----------------------\|---------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| mls100 (demonstration below) \| 0.192 \| 0.260 \| 0.162 \| 0.163 \| 0.268 \| 0.492 \| 0.268 \| 0.258 \|
	\| mls100 + 4-gram (demonstration below) \| 0.087 \| 0.173 \| 0.077 \| 0.126 \| 0.245 \| 0.415 \| 0.218 \| 0.191 \|

	## Demonstration


	```python
	MODEL_NAME = "lgris/mls100-xlsr"
	```

	### Imports and dependencies


	```python
	%%capture
	!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
	!pip install datasets
	!pip install jiwer
	!pip install transformers
	!pip install soundfile
	!pip install pyctcdecode
	!pip install https://github.com/kpu/kenlm/archive/master.zip
	```


	```python
	import jiwer
	import torchaudio
	from datasets import load_dataset, load_metric
	from transformers import (
	Wav2Vec2ForCTC,
	Wav2Vec2Processor,
	)
	from pyctcdecode import build_ctcdecoder
	import torch
	import re
	import sys
	```

	### Helpers


	```python
	chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605

	def map_to_array(batch):
	speech, _ = torchaudio.load(batch["path"])
	batch["speech"] = speech.squeeze(0).numpy()
	batch["sampling_rate"] = 16_000
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
	batch["target"] = batch["sentence"]
	return batch
	```


	```python
	def calc_metrics(truths, hypos):
	wers = []
	mers = []
	wils = []
	for t, h in zip(truths, hypos):
	try:
	wers.append(jiwer.wer(t, h))
	mers.append(jiwer.mer(t, h))
	wils.append(jiwer.wil(t, h))
	except: # Empty string?
	pass
	wer = sum(wers)/len(wers)
	mer = sum(mers)/len(mers)
	wil = sum(wils)/len(wils)
	return wer, mer, wil
	```


	```python
	def load_data(dataset):
	data_files = {'test': f'{dataset}/test.csv'}
	dataset = load_dataset('csv', data_files=data_files)["test"]
	return dataset.map(map_to_array)
	```

	### Model


	```python
	class STT:

	def __init__(self,
	model_name,
	device='cuda' if torch.cuda.is_available() else 'cpu',
	lm=None):
	self.model_name = model_name
	self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
	self.processor = Wav2Vec2Processor.from_pretrained(model_name)
	self.vocab_dict = self.processor.tokenizer.get_vocab()
	self.sorted_dict = {
	k.lower(): v for k, v in sorted(self.vocab_dict.items(),
	key=lambda item: item[1])
	}
	self.device = device
	self.lm = lm
	if self.lm:
	self.lm_decoder = build_ctcdecoder(
	list(self.sorted_dict.keys()),
	self.lm
	)

	def batch_predict(self, batch):
	features = self.processor(batch["speech"],
	sampling_rate=batch["sampling_rate"][0],
	padding=True,
	return_tensors="pt")
	input_values = features.input_values.to(self.device)
	attention_mask = features.attention_mask.to(self.device)
	with torch.no_grad():
	logits = self.model(input_values, attention_mask=attention_mask).logits
	if self.lm:
	logits = logits.cpu().numpy()
	batch["predicted"] = []
	for sample_logits in logits:
	batch["predicted"].append(self.lm_decoder.decode(sample_logits))
	else:
	pred_ids = torch.argmax(logits, dim=-1)
	batch["predicted"] = self.processor.batch_decode(pred_ids)
	return batch
	```

	### Download datasets


	```python
	%%capture
	!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
	!mkdir bp_dataset
	!unzip bp_dataset -d bp_dataset/
	```


	```python
	%cd bp_dataset/
	```

	/content/bp_dataset


	### Tests


	```python
	stt = STT(MODEL_NAME)
	```

	#### CETUC


	```python
	ds = load_data('cetuc_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("CETUC WER:", wer)
	```
	CETUC WER: 0.192586382955233


	#### Common Voice


	```python
	ds = load_data('commonvoice_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("CV WER:", wer)
	```
	CV WER: 0.2604333640312866


	#### LaPS


	```python
	ds = load_data('lapsbm_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("Laps WER:", wer)
	```
	Laps WER: 0.16259469696969692


	#### MLS


	```python
	ds = load_data('mls_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("MLS WER:", wer)
	```
	MLS WER: 0.16343014413283674


	#### SID


	```python
	ds = load_data('sid_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("Sid WER:", wer)
	```
	Sid WER: 0.2682880375992515


	#### TEDx


	```python
	ds = load_data('tedx_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("TEDx WER:", wer)
	```
	TEDx WER: 0.49252836581485837


	#### VoxForge


	```python
	ds = load_data('voxforge_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("VoxForge WER:", wer)
	```
	VoxForge WER: 0.2686972402597403


	### Tests with LM


	```python
	!rm -rf ~/.cache
	%cd /content/
	# !gdown --id '1d13Onxy9ubmJZORZ8FO2vnsnl36QMiUc' # trained with wikipedia;
	stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
	# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
	# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
	%cd bp_dataset/
	```

	/content/bp_dataset


	#### CETUC


	```python
	ds = load_data('cetuc_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("CETUC WER:", wer)
	```
	CETUC WER: 0.0878818926974661


	#### Common Voice


	```python
	ds = load_data('commonvoice_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("CV WER:", wer)
	```
	CV WER: 0.173303354010221


	#### LaPS


	```python
	ds = load_data('lapsbm_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("Laps WER:", wer)
	```
	Laps WER: 0.07691919191919189


	#### MLS


	```python
	ds = load_data('mls_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("MLS WER:", wer)
	```
	MLS WER: 0.12624377042839321


	#### SID


	```python
	ds = load_data('sid_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("Sid WER:", wer)
	```
	Sid WER: 0.24545473435776916


	#### TEDx


	```python
	ds = load_data('tedx_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("TEDx WER:", wer)
	```
	TEDx WER: 0.4156272215612955


	#### VoxForge


	```python
	ds = load_data('voxforge_dataset')
	result = ds.map(stt.batch_predict, batched=True, batch_size=8)
	wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
	print("VoxForge WER:", wer)
	```
	VoxForge WER: 0.21832386363636366