File size: 2,069 Bytes
6201a7d
 
 
128f4c3
 
1882641
 
2593caf
1882641
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b717938
 
 
 
 
 
1882641
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language:
- sv
tags:
- multi-task
---

The best multi-task wav2vec 2.0 model for Swedish from [__Getman, Y., Al-Ghezi, R., Grósz, T., Kurimo, M. (2023) Multi-task wav2vec2 Serving as a Pronunciation Training System for Children__](https://www.isca-speech.org/archive/slate_2023/getman23_slate.html) that performs ASR and speech pronunciation rating task simultaneously.

## Usage

You must first install [aalto-speech/multitask-wav2vec2](https://github.com/aalto-speech/multitask-wav2vec2) to use this model. The model can then be used directly as follows:

```python
import torch
import librosa
import datasets
from transformers import Wav2Vec2ForMultiTask, Wav2Vec2Processor

def map_to_array(batch):
    speech, _ = librosa.load(batch["file"], sr=16000, mono=True)
    batch["speech"] = speech
    return batch

def map_to_pred_multitask(batch):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to(device)).logits
    predicted_ids_ctc = torch.argmax(logits[1], dim=-1)
    transcription = processor.batch_decode(predicted_ids_ctc)
    batch["transcription"] = transcription
    predicted_ids = torch.argmax(logits[0], dim=-1)
    batch['predictions'] = predicted_ids
    return batch

processor =  Wav2Vec2Processor.from_pretrained(MODEL_PATH)
model = Wav2Vec2ForMultiTask.from_pretrained(MODEL_PATH)

test_dataset = test_dataset.map(map_to_array)
result = test_dataset.map(map_to_pred_multitask)
```

## Citation

If you use our models or training scripts, please cite our article as:

```bibtex
@inproceedings{getman23_slate,
  author={Yaroslav Getman and Ragheb Al-Ghezi and Tamas Grosz and Mikko Kurimo},
  title={{Multi-task wav2vec2 Serving as a Pronunciation Training System for Children}},
  year=2023,
  booktitle={Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE)},
  pages={36--40},
  doi={10.21437/SLaTE.2023-8}
}
```