README.md · voidful/wav2vec2-xlsr-multilingual-56 at main

File size: 12,129 Bytes


---
language:
- multilingual
- ar
- as
- br
- ca
- cnh
- cs
- cv
- cy
- de
- dv
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- hi
- hsb
- hu
- ia
- id
- ja
- ka
- ky
- lg
- lt
- ly
- mn
- mt
- nl
- or
- pl
- pt
- ro
- ru
- sah
- sl
- ta
- th
- tr
- tt
- uk
- vi
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
- robust-speech-event
- speech
- xlsr-fine-tuning-week
datasets:
- common_voice
language_bcp47:
- fy-NL
- ga-IE
- pa-IN
- rm-sursilv
- rm-vallader
- sy-SE
- zh-CN
- zh-HK
- zh-TW
model-index:
- name: XLSR Wav2Vec2 for 56 language by Voidful
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: Common Voice
      type: common_voice
    metrics:
    - type: cer
      value: 23.21
      name: Test CER
---

# Model Card for wav2vec2-xlsr-multilingual-56
 
 
# Model Details
 
## Model Description
 
- **Developed by:** voidful
- **Shared by [Optional]:** Hugging Face
- **Model type:** automatic-speech-recognition
- **Language(s) (NLP):** multilingual (*56 language, 1 model Multilingual ASR*)
- **License:** Apache-2.0
- **Related Models:**
  - **Parent Model:** wav2vec
- **Resources for more information:** 
    - [GitHub Repo](https://github.com/voidful/wav2vec2-xlsr-multilingual-56)
 	- [Model Space](https://huggingface.co/spaces/Kamtera/Persian_Automatic_Speech_Recognition_and-more)
 
 
# Uses
 
 
## Direct Use
 
This model can be used for the task of automatic-speech-recognition
 
## Downstream Use [Optional]
 
More information needed
 
## Out-of-Scope Use
 
The model should not be used to intentionally create hostile or alienating environments for people.
 
# Bias, Risks, and Limitations
 
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
 
 
## Recommendations
 
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
# Training Details
 
## Training Data
 
See the [common_voice dataset card](https://huggingface.co/datasets/common_voice)
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on 56 language using the [Common Voice](https://huggingface.co/datasets/common_voice).  
 
## Training Procedure
 
 
### Preprocessing
 
More information needed
 
### Speeds, Sizes, Times
 
 
When using this model, make sure that your speech input is sampled at 16kHz.
 
 
# Evaluation
 
 
## Testing Data, Factors & Metrics
 
### Testing Data
 
More information needed
 
### Factors
 
 
### Metrics
 
More information needed
## Results 
<details>
 <summary> Click to expand </summary>
 
| Common Voice Languages | Num. of data | Hour   | WER    | CER   |
|------------------------|--------------|--------|--------|-------|
| ar                     | 21744        | 81.5   | 75.29  | 31.23 |
| as                     | 394          | 1.1    | 95.37  | 46.05 |
| br                     | 4777         | 7.4    | 93.79  | 41.16 |
| ca                     | 301308       | 692.8  | 24.80  | 10.39 |
| cnh                    | 1563         | 2.4    | 68.11  | 23.10 |
| cs                     | 9773         | 39.5   | 67.86  | 12.57 |
| cv                     | 1749         | 5.9    | 95.43  | 34.03 |
| cy                     | 11615        | 106.7  | 67.03  | 23.97 |
| de                     | 262113       | 822.8  | 27.03  | 6.50  |
| dv                     | 4757         | 18.6   | 92.16  | 30.15 |
| el                     | 3717         | 11.1   | 94.48  | 58.67 |
| en                     | 580501       | 1763.6 | 34.87  | 14.84 |
| eo                     | 28574        | 162.3  | 37.77  | 6.23  |
| es                     | 176902       | 337.7  | 19.63  | 5.41  |
| et                     | 5473         | 35.9   | 86.87  | 20.79 |
| eu                     | 12677        | 90.2   | 44.80  | 7.32  |
| fa                     | 12806        | 290.6  | 53.81  | 15.09 |
| fi                     | 875          | 2.6    | 93.78  | 27.57 |
| fr                     | 314745       | 664.1  | 33.16  | 13.94 |
| fy-NL                  | 6717         | 27.2   | 72.54  | 26.58 |
| ga-IE                  | 1038         | 3.5    | 92.57  | 51.02 |
| hi                     | 292          | 2.0    | 90.95  | 57.43 |
| hsb                    | 980          | 2.3    | 89.44  | 27.19 |
| hu                     | 4782         | 9.3    | 97.15  | 36.75 |
| ia                     | 5078         | 10.4   | 52.00  | 11.35 |
| id                     | 3965         | 9.9    | 82.50  | 22.82 |
| it                     | 70943        | 178.0  | 39.09  | 8.72  |
| ja                     | 1308         | 8.2    | 99.21  | 62.06 |
| ka                     | 1585         | 4.0    | 90.53  | 18.57 |
| ky                     | 3466         | 12.2   | 76.53  | 19.80 |
| lg                     | 1634         | 17.1   | 98.95  | 43.84 |
| lt                     | 1175         | 3.9    | 92.61  | 26.81 |
| lv                     | 4554         | 6.3    | 90.34  | 30.81 |
| mn                     | 4020         | 11.6   | 82.68  | 30.14 |
| mt                     | 3552         | 7.8    | 84.18  | 22.96 |
| nl                     | 14398        | 71.8   | 57.18  | 19.01 |
| or                     | 517          | 0.9    | 90.93  | 27.34 |
| pa-IN                  | 255          | 0.8    | 87.95  | 42.03 |
| pl                     | 12621        | 112.0  | 56.14  | 12.06 |
| pt                     | 11106        | 61.3   | 53.24  | 16.32 |
| rm-sursilv             | 2589         | 5.9    | 78.17  | 23.31 |
| rm-vallader            | 931          | 2.3    | 73.67  | 21.76 |
| ro                     | 4257         | 8.7    | 83.84  | 21.95 |
| ru                     | 23444        | 119.1  | 61.83  | 15.18 |
| sah                    | 1847         | 4.4    | 94.38  | 38.46 |
| sl                     | 2594         | 6.7    | 84.21  | 20.54 |
| sv-SE                  | 4350         | 20.8   | 83.68  | 30.79 |
| ta                     | 3788         | 18.4   | 84.19  | 21.60 |
| th                     | 4839         | 11.7   | 141.87 | 37.16 |
| tr                     | 3478         | 22.3   | 66.77  | 15.55 |
| tt                     | 13338        | 26.7   | 86.80  | 33.57 |
| uk                     | 7271         | 39.4   | 70.23  | 14.34 |
| vi                     | 421          | 1.7    | 96.06  | 66.25 |
| zh-CN                  | 27284        | 58.7   | 89.67  | 23.96 |
| zh-HK                  | 12678        | 92.1   | 81.77  | 18.82 |
| zh-TW                  | 6402         | 56.6   | 85.08  | 29.07 |
 
 </details>
# Model Examination
 
More information needed
 
# Environmental Impact
 
 
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
 
# Technical Specifications [optional]
 
## Model Architecture and Objective
 
More information needed
 
## Compute Infrastructure
 
More information needed
 
### Hardware
 
More information needed
 
### Software
More information needed
 
# Citation
 
 
**BibTeX:**
 ```
More information needed
```
 
**APA:**
 ```
More information needed
```
 
# Glossary [optional]
More information needed
 
# More Information [optional]
 
More information needed
 
# Model Card Authors [optional]
 
voidful  in collaboration with Ezi Ozoani and the Hugging Face team
 
# Model Card Contact
 
More information needed
 
# How to Get Started with the Model
 
Use the code below to get started with the model.
 
<details>
<summary> Click to expand </summary>
 
 
## Env setup:
```
!pip install torchaudio
!pip install datasets transformers
!pip install asrp
!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
```
 
## Usage

```
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
    AutoTokenizer, 
    AutoModelWithLMHead 
)
import torch
import re
import sys
import soundfile as sf
model_name = "voidful/wav2vec2-xlsr-multilingual-56"
device = "cuda"
processor_name = "voidful/wav2vec2-xlsr-multilingual-56"
 
import pickle
with open("lang_ids.pk", 'rb') as output:
    lang_ids = pickle.load(output)
    
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
 
model.eval()
 
def load_file_to_data(file,sampling_rate=16_000):
    batch = {}
    speech, _ = torchaudio.load(file)
    if sampling_rate != '16_000' or sampling_rate != '16000':
        resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000)
        batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
        batch["sampling_rate"] = resampler.new_freq
    else:
        batch["speech"] = speech.squeeze(0).numpy()
        batch["sampling_rate"] = '16000'
    return batch
 
 
def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            comb_pred_ids = torch.argmax(voice_prob, dim=-1)
            decoded_results.append(processor.decode(comb_pred_ids))
 
    return decoded_results
 
def predict_lang_specific(data,lang_code):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device)
            if len(filtered_input[0]) == 0:
                decoded_results.append("")
            else:
                lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0)
                lang_index = torch.tensor(sorted(lang_ids[lang_code]))
                lang_mask.index_fill_(0, lang_index, 1)
                lang_mask = lang_mask.to(device)
                comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1)
                decoded_results.append(processor.decode(comb_pred_ids))
                
    return decoded_results
 
 
predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate
 
predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate
 
```
 
```python
{{ get_started_code | default("More information needed", true)}}
```
</details>