Spaces:

justus-tobias
/

ASR_Model_Comparison

Sleeping

File size: 2,851 Bytes

752ce9b
61ba593
 
 
 
 
 
 
752ce9b
61ba593
 
 
 
 
 
 
 
8414736
8cfce12
8414736
 
 
 
 
 
 
8cfce12
15f66cd
8cfce12
 
 
 
 
 
15f66cd
 
db6e0bb
 
 
 
 
 
 
 
 
 
 
15f66cd

#### Whisper Tiny (EN)
- ID: openai/whisper-tiny.en
- Hugging Face: [model](https://huggingface.co/openai/whisper-tiny.en)
- Creator: openai
- Finetuned: No
- Model Size: 39 M Parameters
- Model Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf)
- Training Data: The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages.
@@
#### S2T Medium ASR
- ID: facebook/s2t-medium-librispeech-asr
- Hugging Face: [model](https://huggingface.co/facebook/s2t-medium-librispeech-asr)
- Creator: facebook
- Finetuned: No
- Model Size: 71.2 M Parameters
- Model Paper: [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171)
- Training Data: [LibriSpeech ASR Corpus](https://www.openslr.org/12)
@@
#### Wav2Vec Base 960h
- ID: facebook/wav2vec2-base-960h
- Hugging Face: [model](https://huggingface.co/facebook/wav2vec2-base-960h)
- Creator: facebook
- Finetuned: No
- Model Size: 94.4 M Parameters
- Model Paper: [Wav2vec 2.0: Learning the structure of speech from raw audio](https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)
- Training Data: ?
@@
#### Whisper Large v2 
- ID: openai/whisper-large-v2
- Hugging Face: [model](https://huggingface.co/openai/whisper-large-v2)
- Creator: openai
- Finetuned: No
- Model Size: 1.54 B Parameters
- Model Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- Training Data: The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages.

(evaluating this model might take a while due to it's size)
@@
#### HF Seamless M4T Medium
- ID: facebook/hf-seamless-m4t-medium
- Hugging Face: [model](https://huggingface.co/facebook/hf-seamless-m4t-medium)
- Creator: facebook
- Finetuned: No
- Model Size: 1.2 B Parameters
- Model Paper: [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf)
- Training Data: ?

(evaluating this model might take a while due to it's size)