What libraries can I use for Automatic Speech Recognition?

The espnet, nemo, speechbrain, transformers, and transformers.js libraries are compatible with Automatic Speech Recognition.

What models can I use for Automatic Speech Recognition?

The openai/whisper-large-v3, facebook/wav2vec2-base-960h, and facebook/s2t-small-mustc-en-fr-st models can be used for Automatic Speech Recognition.

What datasets can I use for Automatic Speech Recognition?

The mozilla-foundation/common_voice_13_0, librispeech_asr, and openslr datasets can be used for Automatic Speech Recognition.

What metrics can I use for Automatic Speech Recognition?

The werand cer metrics can be used for Automatic Speech Recognition.

Tasks

Automatic Speech Recognition

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text. It has many applications, such as voice user interfaces.

Inputs

Automatic Speech Recognition Model

Output

Transcript

Going along slushy country roads and speaking to damp audiences in...

About Automatic Speech Recognition

Use Cases

Virtual Speech Assistants

Many edge devices have an embedded virtual assistant to interact with the end users better. These assistances rely on ASR models to recognize different voice commands to perform various tasks. For instance, you can ask your phone for dialing a phone number, ask a general question, or schedule a meeting.

Caption Generation

A caption generation model takes audio as input from sources to generate automatic captions through transcription, for live-streamed or recorded videos. This can help with content accessibility. For example, an audience watching a video that includes a non-native language, can rely on captions to interpret the content. It can also help with information retention at online-classes environments improving knowledge assimilation while reading and taking notes faster.

Task Variants

Multilingual ASR

Multilingual ASR models can convert audio inputs with multiple languages into transcripts. Some multilingual ASR models include language identification blocks to improve the performance.

The use of Multilingual ASR has become popular, the idea of maintaining just a single model for all language can simplify the production pipeline. Take a look at Whisper to get an idea on how 100+ languages can be processed by a single model.

Inference

The Hub contains over ~9,000 ASR models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using Inference Endpoints. Here is a simple code snippet to do exactly this:

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/openai/whisper-large-v3"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

You can also use libraries such as transformers, speechbrain, NeMo and espnet if you want one-click managed Inference without any hassle.

from transformers import pipeline

with open("sample.flac", "rb") as f:
  data = f.read()

pipe = pipeline("automatic-speech-recognition", "openai/whisper-large-v2")
pipe("sample.flac")
# {'text': "GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS"}

You can use huggingface.js to transcribe text with javascript using models on Hugging Face Hub.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference(HF_TOKEN);
await inference.automaticSpeechRecognition({
    data: await (await fetch("sample.flac")).blob(),
    model: "openai/whisper-large-v2",
});

Solving ASR for your own data

We have some great news! You can fine-tune (transfer learning) a foundational speech model on a specific language without tonnes of data. Pretrained models such as Whisper, Wav2Vec2-MMS and HuBERT exist. OpenAI's Whisper model is a large multilingual model trained on 100+ languages and with 4 Million hours of speech.

The following detailed blog post shows how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR. With the right data and strategy you can fine-tune a high-performant model on a free Google Colab instance too. We suggest to read the blog post for more info!

Hugging Face Whisper Event

On December 2022, over 450 participants collaborated, fine-tuned and shared 600+ ASR Whisper models in 100+ different languages. You can compare these models on the event's speech recognition leaderboard.

These events help democratize ASR for all languages, including low-resource languages. In addition to the trained models, the event helps to build practical collaborative knowledge.

Useful Resources

Hugging Face Audio Course
Fine-tuning MetaAI's MMS Adapter Models for Multi-Lingual ASR
Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers
Boosting Wav2Vec2 with n-grams in 🤗 Transformers
ML for Audio Study Group - Intro to Audio and ASR Deep Dive
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
An ASR toolkit made by NVIDIA: NeMo with code and pretrained models useful for new ASR models. Watch the introductory video for an overview.
An introduction to SpeechT5, a multi-purpose speech recognition and synthesis model
Fine-tune Whisper For Multilingual ASR with 🤗Transformers
Automatic speech recognition task guide
Speech Synthesis, Recognition, and More With SpeechT5
Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers
Speculative Decoding for 2x Faster Whisper Inference

Deploy on Inference Endpoints

Compatible libraries

ESPnet NeMo speechbrain Transformers Transformers.js

Automatic Speech Recognition demo

using openai/whisper-large-v3

Models for Automatic Speech Recognition

Browse Models (16,125)

openai/whisper-large-v3

Automatic Speech Recognition • Updated Feb 8 • 1.99M • 2.36k

Note A powerful ASR model by OpenAI.

facebook/wav2vec2-base-960h

Automatic Speech Recognition • Updated Nov 14, 2022 • 4.35M • 234

Note A good generic ASR model by MetaAI.

facebook/s2t-small-mustc-en-fr-st

Automatic Speech Recognition • Updated Jan 24, 2023 • 2.36k • 1

Note An end-to-end model that performs ASR and Speech Translation by MetaAI.

Datasets for Automatic Speech Recognition

Browse Datasets (468)

mozilla-foundation/common_voice_13_0

Viewer • Updated Jun 26, 2023 • 5.84k • 122

Note 18,000 hours of multilingual audio-text dataset in 108 languages.

librispeech_asr

Viewer • Updated Jan 18 • 15.9k • 95

Note An English dataset with 1,000 hours of data.

openslr

Viewer • Updated Jan 18 • 1.48k • 21

Note High quality, multi-speaker audio data and their transcriptions in various languages.

Spaces using Automatic Speech Recognition

🤫

hf-audio/whisper-large-v3

Note A powerful general-purpose speech recognition application.

⚡️

sanchit-gandhi/whisper-jax

Note Fastest speech recognition application.

📞

facebook/seamless_m4t

Note A high quality speech and text translation model by Meta.

Metrics for Automatic Speech Recognition

wer: Word error rate (WER) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score.

cer: Character error rate (CER) is a common metric of the performance of an automatic speech recognition system. CER is similar to Word Error Rate (WER), but operates on character instead of word. Please refer to docs of WER for further information. Character error rate can be computed as: CER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct characters, N is the number of characters in the reference (N=S+D+C). CER's output is not always a number between 0 and 1, in particular when there is a high number of insertions. This value is often associated to the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.