Is there a way to stream audio into to this model and have it generate a stream of text?

#15
by roshanjames - opened

Hi, I wrote the following code that will record audio, save it to a file and then use whisper-large-v2 to transcribe.

from transformers import pipeline
import sounddevice as sd
import scipy.io.wavfile as wav

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2")

# Set the recording parameters
fs = 44100  # Sample rate
duration = 60  # Recording duration in seconds
channels = 2  # Number of channels

# Record the audio
audio = sd.rec(int(fs * duration), samplerate=fs, channels=channels)
print("Recording...", flush=True)
sd.wait()  # Wait until recording is finished
print("Recording stopped.", flush=True)
wav.write('audio.wav',fs, audio)

out = pipe('audio.wav')["text"]
print(out)

I have 2 questions:

  1. Instead of saving the audio to a file as I do above, is it possible to pass the audio (numpy array) to the model?

  2. In addition, can I actually stream audio to the model (instead of stopping the recording at some point and getting the transcription upto that point)?

Hey @roshanjames !

  1. It is indeed possible to pass a numpy array as an input to the pipeline method (see https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline.__call__.inputs). Just be sure to also pass the sampling rate such that the recorded audio sample is resampled to match the sampling rate expected by Whisper (16kHz) and that your numpy array is a 1-dimensional array (you need to double check this if you're recording with two-channels -> does sounddevice return a 1-dimensional array in this case or is it 2-d?):
from transformers import pipeline
import sounddevice as sd
import scipy.io.wavfile as wav

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2")

# Set the recording parameters
fs = 44100  # Sample rate
duration = 60  # Recording duration in seconds
channels = 2  # Number of channels

# Record the audio
audio = sd.rec(int(fs * duration), samplerate=fs, channels=channels)
print("Recording...", flush=True)
sd.wait()  # Wait until recording is finished
print("Recording stopped.", flush=True)

out = pipe({"raw": audio, "sampling_rate": fs)["text"]
print(out)
  1. Unfortunately the Whisper model does not work with streaming inference - we have to pass an entire audio sequence to the model in order to have it transcribe. Perhaps what you could do is use a voice activity detector model to detect when someone has started and stopped speaking (silence). When someone starts speaking, you start recording the audio. When they stop speaking, you stop the recording and pass whatever audio you have to the Whisper model and have it transcribe. Meanwhile, you continue listening for audio and start recording the audio as soon as the voice activity detection model picks up new speech. Repeat this for 'semi-live' inference

You can try this model to start: https://huggingface.co/pyannote/voice-activity-detection

Thank you very much Sanchit!

One followup question (and I'm happy to file this under a different title if you think I should): I can't figure out how to specify max_new_tokens to satisfy this warning. How do I get rid of this?

.../hugging-face/venv/lib/python3.9/site-packages/transformers/generation/utils.py:1387: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 448 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(

I ended up using a VAD, but a different one from the one you suggested above. My code is below just in case it is helpful to anyone else thinking about this. The code below starts 2 threads, one which listens to audio and puts the bytes into a queue. The other thread calls the VAD, find a gap in the voice stream and then sends that to whisper. On my machine, the vad+whisper calls take about 3 secs on cpu, so I chose to record in 10s chunks.

from transformers import pipeline
from queue import Queue
import time
import threading
import sounddevice as sd
import scipy.io.wavfile as wav
import numpy as np
import torch
import datetime

# Create a queue with a maxsize of 5
q = Queue(maxsize=2)

# Set the recording parameters
fs = 16000     # Sample rate
duration = 10  # Recording duration in seconds
channels = 1   # Number of channels

model, utils = torch.hub.load(repo_or_dir="snakers4/silero-vad",
                              model="silero_vad")
get_speech_timestamps = utils[0]

whisper = pipeline(task="automatic-speech-recognition",
                   model="openai/whisper-medium.en")

def transcribe(remaining, current):
    audio = np.concatenate((remaining, current))
    segments = get_speech_timestamps(audio, model, sampling_rate=fs, threshold=0.9)
    # print(segments, flush=True)
    end = -1
    # we want to split at the 'end' timestamp of teh second last snippet.
    if len(segments) > 1:
        end = segments[-2]["end"]
    elif len(segments) == 1:
        end = segments[-1]["end"]
    else:
        end = -1
    current = audio[:end]
    remaining = audio[end:]
    text = whisper({"raw": current, "sampling_rate": fs})["text"]
    return text, remaining

def time_to_string(t):
    dt = datetime.datetime.fromtimestamp(t)
    return dt.strftime("%Y-%m-%d %H:%M:%S")

def recorder():
    t1 = time.time()
    t2 = t1
    while True:
        # Record the audio
        audio = sd.rec(int(fs * duration), samplerate=fs, channels=channels)
        t0 = time.time()
        print(f"{time_to_string(t0)}: Recording... (last_duration:{t2-t1:.2f}s, last_gap:{(t0-t2)*1000.:.2f}ms)", flush=True)
        t1 = t0
        sd.wait()  # Wait until recording is finished
        t2 = time.time()
        q.put(audio.flatten())

def transcriber():
    remaining = np.array([])
    while True:
        current = q.get()
        t1 = time.time()
        text, remaining = transcribe(remaining, current)
        t2 = time.time()
        print(f"|| {t2-t1:.3f}secs: {text}", flush=True)

t1 = threading.Thread(target=recorder)
t2 = threading.Thread(target=transcriber)

t1.start()
t2.start()

t1.join()
t2.join()

Hey @roshanjames ,

Very cool code snippet using VAD + Whisper!

In terms of setting the max length, you can pass max_new_tokens when you call the pipeline:

text = whisper({"raw": current, "sampling_rate": fs}, max_new_tokens=448)["text"]

Although it's not a problem if you don't specify. The warning simply says that the model is generating until a pre-defined length (448) that you yourself haven't specified.

Sign up or log in to comment