File size: 8,709 Bytes
258f478
9e5027e
258f478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b660a8b
0b34967
258f478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e5027e
258f478
 
 
9e5027e
258f478
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
language:
- eu
library_name: nemo
datasets:
- mozilla-foundation/common_voice_16_1
- gttsehu/basque_parliament_1
- openslr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- Conformer
- NeMo
- pytorch
- Transformer
model-index:
- name: stt_eu_conformer_transducer_large
  results:
  - task:
      type: Automatic Speech Recognition
      name: speech-recognition
    dataset:
      name: Mozilla Common Voice 16.1
      type: mozilla-foundation/common_voice_16_1
      config: eu
      split: test
      args:
        language: eu
    metrics:
    - name: Test WER
      type: wer
      value: 2.79
  - task:
      type: Automatic Speech Recognition
      name: speech-recognition
    dataset:
      name: Basque Parliament
      type: gttsehu/basque_parliament_1
      config: eu
      split: test
      args:
        language: eu
    metrics:
    - name: Test WER
      type: wer
      value: 3.83
  - task:
      type: Automatic Speech Recognition
      name: speech-recognition
    dataset:
      name: Basque Parliament
      type: gttsehu/basque_parliament_1
      config: eu
      split: validation
      args:
        language: eu
    metrics:
    - name: Dev WER
      type: wer
      value: 4.17
---

# HiTZ/Aholab's Basque Speech-to-Text model Conformer-Transducer
## Model Description

<style>
img {
 display: inline;
}
</style>

| [![Model architecture](https://img.shields.io/badge/Model_Arch-Conformer--Transducer-lightgrey#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-119M-lightgrey#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-eu-lightgrey#model-badge)](#datasets)

This model transcribes speech in lowercase Basque alphabet including spaces, and was trained on a composite dataset comprising of 548 hours of Basque speech. The model was fine-tuned from a pre-trained Spanish [stt_es_conformer_transducer_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_transducer_large) model using the [Nvidia NeMo](https://github.com/NVIDIA/NeMo) toolkit.  It is an autoregressive "large" variant of Conformer, with around 119 million parameters.
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.

## Usage
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.

```bash
pip install nemo_toolkit['all']
```

### Transcribing using Python
Clone repository to download the model:

```bash
git clone https://huggingface.co/asierhv/stt_eu_conformer_transducer_large
```

Given `NEMO_MODEL_FILEPATH` is the path that points to the downloaded `stt_eu_conformer_transducer_large.nemo` file.

```python
import nemo.collections.asr as nemo_asr

# Load the model
asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(NEMO_MODEL_FILEPATH)

# Create a list pointing to the audio files
audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"]

# Fix the batch_size to whatever number suits your purpouse
batch_size = 8

# Transcribe the audio files
transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size)

# Visualize the transcriptions
print(transcriptions)
```

## Input
This model accepts 16000 kHz Mono-channel Audio (wav files) as input.

## Output
This model provides transcribed speech as a string for a given audio sample.

## Model Architecture
Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC loss. You may find more info on the detail of this model here: [Conformer-Transducer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer). 

## Training
### Data preparation
This model has been trained on a composite dataset comprising 548 hours of Basque speech that contains:
- A processed subset of the `validated` split of the basque version of the public dataset [Mozilla Common Voice 16.1](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1): We have processed the `validated` split, which originally contains the `train`, `dev` and `test` splits, to create a subset free of sentences equal to the ones that are in the `test` split, to avoid leakage.
- The `train_clean` split of the basque version of the public dataset [Basque Parliament](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
- A processed subset of the basque version of the public dataset [OpenSLR](https://huggingface.co/datasets/openslr#slr76-crowdsourced-high-quality-basque-speech-data-set): This subset has been cleaned from numerical characters and acronyms.

The composite dataset for training has been precisely cleaned from any sentence that equals the ones in the `test` datasets where the WER metrics will be computed.

### Training procedure
This model was trained starting from the pre-trained Spanish model [stt_es_conformer_transducer_large](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_es_conformer_transducer_large) over several hundred of epochs in a GPU device, using the NeMo toolkit [3]
The tokenizer for these model was built using the text transcripts of the composite train dataset with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py), with a total of 256 basque language tokens.

## Performance
Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table.
| Tokenizer             | Vocabulary Size | MCV 16.1 Test | Basque Parliament Test | Basque Parliament Dev | Train Dataset                |
|-----------------------|-----------------|---------------|------------------------|-----------------------|------------------------------|
| SentencePiece Unigram | 256             | 2.79          | 3.83                   | 4.17                  | Composite Dataset (548 h)    |

## Limitations
Since this model was trained on almost publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

# Aditional Information
## Author
HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.

## Copyright
Copyright (c) 2024 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.

## Licensing Information
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Funding
This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU [ILENIA](https://proyectoilenia.es/) and by the project [IkerGaitu](https://www.hitz.eus/iker-gaitu/) funded by the Basque Government.
This model was trained at [Hyperion](https://scc.dipc.org/docs/systems/hyperion/overview/), one of the high-performance computing (HPC) systems hosted by the DIPC Supercomputing Center.

## References
- [1] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)
- [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
- [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

## Disclaimer
<details>
<summary>Click to expand</summary>
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.