model documentation (#3)

Browse files

- model documentation (d7847b3dd5b31b0f9850a79b8fe319b528db8ee5)

Co-authored-by: Nazneen Rajani <nazneen@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +243 -92

README.md CHANGED Viewed

@@ -1,3 +1,4 @@
 ---
 language:
 - multilingual
@@ -47,7 +48,17 @@ language:
 - tt
 - uk
 - vi
-language_bcp47:
 - fy-NL
 - ga-IE
 - pa-IN
@@ -57,40 +68,232 @@ language_bcp47:
 - zh-CN
 - zh-HK
 - zh-TW
-datasets:
-- common_voice
-tags:
-- audio
-- automatic-speech-recognition
-- hf-asr-leaderboard
-- robust-speech-event
-- speech
-- xlsr-fine-tuning-week
-license: apache-2.0
 model-index:
 - name: XLSR Wav2Vec2 for 56 language by Voidful
   results:
   - task:
-      name: Speech Recognition
       type: automatic-speech-recognition
     dataset:
       name: Common Voice
       type: common_voice
     metrics:
-    - name: Test CER
-      type: cer
       value: 23.21
 ---
-# wav2vec2-xlsr-multilingual-56
-*56 language, 1 model Multilingual ASR*
 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on 56 language using the [Common Voice](https://huggingface.co/datasets/common_voice).
 When using this model, make sure that your speech input is sampled at 16kHz.
-For more detail: [https://github.com/voidful/wav2vec2-xlsr-multilingual-56](https://github.com/voidful/wav2vec2-xlsr-multilingual-56)
 ## Env setup:
 ```
 !pip install torchaudio
@@ -98,8 +301,9 @@ For more detail: [https://github.com/voidful/wav2vec2-xlsr-multilingual-56](http
 !pip install asrp
 !wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
 ```
 ## Usage
 ```
 import torchaudio
 from datasets import load_dataset, load_metric
@@ -116,16 +320,16 @@ import soundfile as sf
 model_name = "voidful/wav2vec2-xlsr-multilingual-56"
 device = "cuda"
 processor_name = "voidful/wav2vec2-xlsr-multilingual-56"
 import pickle
 with open("lang_ids.pk", 'rb') as output:
     lang_ids = pickle.load(output)
 model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
 processor = Wav2Vec2Processor.from_pretrained(processor_name)
 model.eval()
 def load_file_to_data(file,sampling_rate=16_000):
     batch = {}
     speech, _ = torchaudio.load(file)
@@ -137,8 +341,8 @@ def load_file_to_data(file,sampling_rate=16_000):
         batch["speech"] = speech.squeeze(0).numpy()
         batch["sampling_rate"] = '16000'
     return batch
 def predict(data):
     features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
     input_values = features.input_values.to(device)
@@ -153,9 +357,9 @@ def predict(data):
             voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
             comb_pred_ids = torch.argmax(voice_prob, dim=-1)
             decoded_results.append(processor.decode(comb_pred_ids))
     return decoded_results
 def predict_lang_specific(data,lang_code):
     features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
     input_values = features.input_values.to(device)
@@ -180,69 +384,16 @@ def predict_lang_specific(data,lang_code):
                 decoded_results.append(processor.decode(comb_pred_ids))
     return decoded_results
 predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate
 predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate
 ```
-## Result
-| Common Voice Languages | Num. of data | Hour   | WER    | CER   |
-|------------------------|--------------|--------|--------|-------|
-| ar                     | 21744        | 81.5   | 75.29  | 31.23 |
-| as                     | 394          | 1.1    | 95.37  | 46.05 |
-| br                     | 4777         | 7.4    | 93.79  | 41.16 |
-| ca                     | 301308       | 692.8  | 24.80  | 10.39 |
-| cnh                    | 1563         | 2.4    | 68.11  | 23.10 |
-| cs                     | 9773         | 39.5   | 67.86  | 12.57 |
-| cv                     | 1749         | 5.9    | 95.43  | 34.03 |
-| cy                     | 11615        | 106.7  | 67.03  | 23.97 |
-| de                     | 262113       | 822.8  | 27.03  | 6.50  |
-| dv                     | 4757         | 18.6   | 92.16  | 30.15 |
-| el                     | 3717         | 11.1   | 94.48  | 58.67 |
-| en                     | 580501       | 1763.6 | 34.87  | 14.84 |
-| eo                     | 28574        | 162.3  | 37.77  | 6.23  |
-| es                     | 176902       | 337.7  | 19.63  | 5.41  |
-| et                     | 5473         | 35.9   | 86.87  | 20.79 |
-| eu                     | 12677        | 90.2   | 44.80  | 7.32  |
-| fa                     | 12806        | 290.6  | 53.81  | 15.09 |
-| fi                     | 875          | 2.6    | 93.78  | 27.57 |
-| fr                     | 314745       | 664.1  | 33.16  | 13.94 |
-| fy-NL                  | 6717         | 27.2   | 72.54  | 26.58 |
-| ga-IE                  | 1038         | 3.5    | 92.57  | 51.02 |
-| hi                     | 292          | 2.0    | 90.95  | 57.43 |
-| hsb                    | 980          | 2.3    | 89.44  | 27.19 |
-| hu                     | 4782         | 9.3    | 97.15  | 36.75 |
-| ia                     | 5078         | 10.4   | 52.00  | 11.35 |
-| id                     | 3965         | 9.9    | 82.50  | 22.82 |
-| it                     | 70943        | 178.0  | 39.09  | 8.72  |
-| ja                     | 1308         | 8.2    | 99.21  | 62.06 |
-| ka                     | 1585         | 4.0    | 90.53  | 18.57 |
-| ky                     | 3466         | 12.2   | 76.53  | 19.80 |
-| lg                     | 1634         | 17.1   | 98.95  | 43.84 |
-| lt                     | 1175         | 3.9    | 92.61  | 26.81 |
-| lv                     | 4554         | 6.3    | 90.34  | 30.81 |
-| mn                     | 4020         | 11.6   | 82.68  | 30.14 |
-| mt                     | 3552         | 7.8    | 84.18  | 22.96 |
-| nl                     | 14398        | 71.8   | 57.18  | 19.01 |
-| or                     | 517          | 0.9    | 90.93  | 27.34 |
-| pa-IN                  | 255          | 0.8    | 87.95  | 42.03 |
-| pl                     | 12621        | 112.0  | 56.14  | 12.06 |
-| pt                     | 11106        | 61.3   | 53.24  | 16.32 |
-| rm-sursilv             | 2589         | 5.9    | 78.17  | 23.31 |
-| rm-vallader            | 931          | 2.3    | 73.67  | 21.76 |
-| ro                     | 4257         | 8.7    | 83.84  | 21.95 |
-| ru                     | 23444        | 119.1  | 61.83  | 15.18 |
-| sah                    | 1847         | 4.4    | 94.38  | 38.46 |
-| sl                     | 2594         | 6.7    | 84.21  | 20.54 |
-| sv-SE                  | 4350         | 20.8   | 83.68  | 30.79 |
-| ta                     | 3788         | 18.4   | 84.19  | 21.60 |
-| th                     | 4839         | 11.7   | 141.87 | 37.16 |
-| tr                     | 3478         | 22.3   | 66.77  | 15.55 |
-| tt                     | 13338        | 26.7   | 86.80  | 33.57 |
-| uk                     | 7271         | 39.4   | 70.23  | 14.34 |
-| vi                     | 421          | 1.7    | 96.06  | 66.25 |
-| zh-CN                  | 27284        | 58.7   | 89.67  | 23.96 |
-| zh-HK                  | 12678        | 92.1   | 81.77  | 18.82 |
-| zh-TW                  | 6402         | 56.6   | 85.08  | 29.07 |

 ---
 language:
 - multilingual
 - tt
 - uk
 - vi
+license: apache-2.0
+tags:
+- audio
+- automatic-speech-recognition
+- hf-asr-leaderboard
+- robust-speech-event
+- speech
+- xlsr-fine-tuning-week
+datasets:
+- common_voice
+language_bcp47:
 - fy-NL
 - ga-IE
 - pa-IN
 - zh-CN
 - zh-HK
 - zh-TW
 model-index:
 - name: XLSR Wav2Vec2 for 56 language by Voidful
   results:
   - task:
       type: automatic-speech-recognition
+      name: Speech Recognition
     dataset:
       name: Common Voice
       type: common_voice
     metrics:
+    - type: cer
       value: 23.21
+      name: Test CER
 ---
+# Model Card for wav2vec2-xlsr-multilingual-56
+# Model Details
+## Model Description
+- **Developed by:** voidful
+- **Shared by [Optional]:** Hugging Face
+- **Model type:** automatic-speech-recognition
+- **Language(s) (NLP):** multilingual (*56 language, 1 model Multilingual ASR*)
+- **License:** Apache-2.0
+- **Related Models:**
+  - **Parent Model:** wav2vec
+- **Resources for more information:**
+    - [GitHub Repo](https://github.com/voidful/wav2vec2-xlsr-multilingual-56)
+ 	- [Model Space](https://huggingface.co/spaces/Kamtera/Persian_Automatic_Speech_Recognition_and-more)
+# Uses
+## Direct Use
+This model can be used for the task of automatic-speech-recognition
+## Downstream Use [Optional]
+More information needed
+## Out-of-Scope Use
+The model should not be used to intentionally create hostile or alienating environments for people.
+# Bias, Risks, and Limitations
+Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
+## Recommendations
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+# Training Details
+## Training Data
+See the [common_voice dataset card](https://huggingface.co/datasets/common_voice)
 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on 56 language using the [Common Voice](https://huggingface.co/datasets/common_voice).
+## Training Procedure
+### Preprocessing
+More information needed
+### Speeds, Sizes, Times
 When using this model, make sure that your speech input is sampled at 16kHz.
+# Evaluation
+## Testing Data, Factors & Metrics
+### Testing Data
+More information needed
+### Factors
+### Metrics
+More information needed
+## Results
+<details>
+ <summary> Click to expand </summary>
+| Common Voice Languages | Num. of data | Hour   | WER    | CER   |
+|------------------------|--------------|--------|--------|-------|
+| ar                     | 21744        | 81.5   | 75.29  | 31.23 |
+| as                     | 394          | 1.1    | 95.37  | 46.05 |
+| br                     | 4777         | 7.4    | 93.79  | 41.16 |
+| ca                     | 301308       | 692.8  | 24.80  | 10.39 |
+| cnh                    | 1563         | 2.4    | 68.11  | 23.10 |
+| cs                     | 9773         | 39.5   | 67.86  | 12.57 |
+| cv                     | 1749         | 5.9    | 95.43  | 34.03 |
+| cy                     | 11615        | 106.7  | 67.03  | 23.97 |
+| de                     | 262113       | 822.8  | 27.03  | 6.50  |
+| dv                     | 4757         | 18.6   | 92.16  | 30.15 |
+| el                     | 3717         | 11.1   | 94.48  | 58.67 |
+| en                     | 580501       | 1763.6 | 34.87  | 14.84 |
+| eo                     | 28574        | 162.3  | 37.77  | 6.23  |
+| es                     | 176902       | 337.7  | 19.63  | 5.41  |
+| et                     | 5473         | 35.9   | 86.87  | 20.79 |
+| eu                     | 12677        | 90.2   | 44.80  | 7.32  |
+| fa                     | 12806        | 290.6  | 53.81  | 15.09 |
+| fi                     | 875          | 2.6    | 93.78  | 27.57 |
+| fr                     | 314745       | 664.1  | 33.16  | 13.94 |
+| fy-NL                  | 6717         | 27.2   | 72.54  | 26.58 |
+| ga-IE                  | 1038         | 3.5    | 92.57  | 51.02 |
+| hi                     | 292          | 2.0    | 90.95  | 57.43 |
+| hsb                    | 980          | 2.3    | 89.44  | 27.19 |
+| hu                     | 4782         | 9.3    | 97.15  | 36.75 |
+| ia                     | 5078         | 10.4   | 52.00  | 11.35 |
+| id                     | 3965         | 9.9    | 82.50  | 22.82 |
+| it                     | 70943        | 178.0  | 39.09  | 8.72  |
+| ja                     | 1308         | 8.2    | 99.21  | 62.06 |
+| ka                     | 1585         | 4.0    | 90.53  | 18.57 |
+| ky                     | 3466         | 12.2   | 76.53  | 19.80 |
+| lg                     | 1634         | 17.1   | 98.95  | 43.84 |
+| lt                     | 1175         | 3.9    | 92.61  | 26.81 |
+| lv                     | 4554         | 6.3    | 90.34  | 30.81 |
+| mn                     | 4020         | 11.6   | 82.68  | 30.14 |
+| mt                     | 3552         | 7.8    | 84.18  | 22.96 |
+| nl                     | 14398        | 71.8   | 57.18  | 19.01 |
+| or                     | 517          | 0.9    | 90.93  | 27.34 |
+| pa-IN                  | 255          | 0.8    | 87.95  | 42.03 |
+| pl                     | 12621        | 112.0  | 56.14  | 12.06 |
+| pt                     | 11106        | 61.3   | 53.24  | 16.32 |
+| rm-sursilv             | 2589         | 5.9    | 78.17  | 23.31 |
+| rm-vallader            | 931          | 2.3    | 73.67  | 21.76 |
+| ro                     | 4257         | 8.7    | 83.84  | 21.95 |
+| ru                     | 23444        | 119.1  | 61.83  | 15.18 |
+| sah                    | 1847         | 4.4    | 94.38  | 38.46 |
+| sl                     | 2594         | 6.7    | 84.21  | 20.54 |
+| sv-SE                  | 4350         | 20.8   | 83.68  | 30.79 |
+| ta                     | 3788         | 18.4   | 84.19  | 21.60 |
+| th                     | 4839         | 11.7   | 141.87 | 37.16 |
+| tr                     | 3478         | 22.3   | 66.77  | 15.55 |
+| tt                     | 13338        | 26.7   | 86.80  | 33.57 |
+| uk                     | 7271         | 39.4   | 70.23  | 14.34 |
+| vi                     | 421          | 1.7    | 96.06  | 66.25 |
+| zh-CN                  | 27284        | 58.7   | 89.67  | 23.96 |
+| zh-HK                  | 12678        | 92.1   | 81.77  | 18.82 |
+| zh-TW                  | 6402         | 56.6   | 85.08  | 29.07 |
+ </details>
+# Model Examination
+More information needed
+# Environmental Impact
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** More information needed
+- **Hours used:** More information needed
+- **Cloud Provider:** More information needed
+- **Compute Region:** More information needed
+- **Carbon Emitted:** More information needed
+# Technical Specifications [optional]
+## Model Architecture and Objective
+More information needed
+## Compute Infrastructure
+More information needed
+### Hardware
+More information needed
+### Software
+More information needed
+# Citation
+**BibTeX:**
+ ```
+More information needed
+```
+**APA:**
+ ```
+More information needed
+```
+# Glossary [optional]
+More information needed
+# More Information [optional]
+More information needed
+# Model Card Authors [optional]
+voidful  in collaboration with Ezi Ozoani and the Hugging Face team
+# Model Card Contact
+More information needed
+# How to Get Started with the Model
+Use the code below to get started with the model.
+<details>
+<summary> Click to expand </summary>
 ## Env setup:
 ```
 !pip install torchaudio
 !pip install asrp
 !wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk
 ```
 ## Usage
 ```
 import torchaudio
 from datasets import load_dataset, load_metric
 model_name = "voidful/wav2vec2-xlsr-multilingual-56"
 device = "cuda"
 processor_name = "voidful/wav2vec2-xlsr-multilingual-56"
 import pickle
 with open("lang_ids.pk", 'rb') as output:
     lang_ids = pickle.load(output)
 model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
 processor = Wav2Vec2Processor.from_pretrained(processor_name)
 model.eval()
 def load_file_to_data(file,sampling_rate=16_000):
     batch = {}
     speech, _ = torchaudio.load(file)
         batch["speech"] = speech.squeeze(0).numpy()
         batch["sampling_rate"] = '16000'
     return batch
 def predict(data):
     features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
     input_values = features.input_values.to(device)
             voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
             comb_pred_ids = torch.argmax(voice_prob, dim=-1)
             decoded_results.append(processor.decode(comb_pred_ids))
     return decoded_results
 def predict_lang_specific(data,lang_code):
     features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
     input_values = features.input_values.to(device)
                 decoded_results.append(processor.decode(comb_pred_ids))
     return decoded_results
 predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate
 predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate
 ```
+```python
+{{ get_started_code | default("More information needed", true)}}
+```
+</details>