Spaces:
Sleeping
Sleeping
File size: 2,851 Bytes
752ce9b 61ba593 752ce9b 61ba593 8414736 8cfce12 8414736 8cfce12 15f66cd 8cfce12 15f66cd db6e0bb 15f66cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
#### Whisper Tiny (EN) - ID: openai/whisper-tiny.en - Hugging Face: [model](https://huggingface.co/openai/whisper-tiny.en) - Creator: openai - Finetuned: No - Model Size: 39 M Parameters - Model Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) - Training Data: The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages. @@ #### S2T Medium ASR - ID: facebook/s2t-medium-librispeech-asr - Hugging Face: [model](https://huggingface.co/facebook/s2t-medium-librispeech-asr) - Creator: facebook - Finetuned: No - Model Size: 71.2 M Parameters - Model Paper: [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) - Training Data: [LibriSpeech ASR Corpus](https://www.openslr.org/12) @@ #### Wav2Vec Base 960h - ID: facebook/wav2vec2-base-960h - Hugging Face: [model](https://huggingface.co/facebook/wav2vec2-base-960h) - Creator: facebook - Finetuned: No - Model Size: 94.4 M Parameters - Model Paper: [Wav2vec 2.0: Learning the structure of speech from raw audio](https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) - Training Data: ? @@ #### Whisper Large v2 - ID: openai/whisper-large-v2 - Hugging Face: [model](https://huggingface.co/openai/whisper-large-v2) - Creator: openai - Finetuned: No - Model Size: 1.54 B Parameters - Model Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) - Training Data: The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages. (evaluating this model might take a while due to it's size) @@ #### HF Seamless M4T Medium - ID: facebook/hf-seamless-m4t-medium - Hugging Face: [model](https://huggingface.co/facebook/hf-seamless-m4t-medium) - Creator: facebook - Finetuned: No - Model Size: 1.2 B Parameters - Model Paper: [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) - Training Data: ? (evaluating this model might take a while due to it's size) |