Running in Colab

#17
by rodrigoheck - opened

I am trying to get a minimalistic replication of the streaming behavior. I eventually arrived at this code:

import time
import soundfile as sf
from src.simuleval_transcoder import SimulevalTranscoder
from src.simuleval_agent_directory import SimulevalAgentDirectory, AgentWithInfo
from src.transcoder_helpers import get_transcoder_output_events


agent_directory = SimulevalAgentDirectory()
pre_agent = agent_directory.build_agent_if_available("SeamlessStreaming", "vad_s2st_sc_main.yaml" )
agent = AgentWithInfo(pre_agent, 'SeamlessStream', 's2t', 'en')
transcoder = SimulevalTranscoder(
    agent,
    sample_rate=16000,
    debug=True,
    buffer_limit=5
)
transcoder.start()
# Function to simulate streaming audio
def stream_audio(file_path, chunk_size=320):
    with sf.SoundFile(audio_file_path, 'r') as f:
        while True:
            data = f.read(chunk_size)
            if len(data) == 0:
                break
            yield data

# Stream the audio and transcribe
audio_file_path = "converted_file.wav"
for chunk in stream_audio(audio_file_path):
    # Process incoming bytes
    transcoder.process_incoming_bytes(chunk, dynamic_config={"targetLanguage": "en"})

    # Check for transcription output
    events = get_transcoder_output_events(transcoder)
    for event in events:
      print(event)
      if event['event'] == 'translation_text':
          print(event['payload'])  # Print the transcribed text
    time.sleep(0.02)

# Finalize
transcoder.close = True

But something appears to be wrong, because although I can see the GPU is being utilized,

get_transcoder_output_events(transcoder)

never returns any event. Am I doing something wrong?

I can see from the debug folder that the audio files are gibberish. Apparently it is not processing the audio chunks correctly. It may be related to this issue?

https://github.com/facebookresearch/seamless_communication/issues/237

AI at Meta org

Hi @rodrigoheck - wanted to share this Colab notebook we prepared: https://fb.me/mt-neurips, which shows an example of simplified standalone streaming inference (scroll to the bottom). It simulates what is happening in the HF demo - i.e. getting an unsegmented audio stream and passing it in 320ms chunks to the streaming system.

It also provides a visualization/sample audio of the output translation which we overlay on top of the input audio, adding in the appropriate delay/silence to reflect how it would sound if you were actually streaming it in real time.

image.png
Here's the visualization from the notebook, for example. The top row is the input audio, while the later rows are the output audio (in chunks), as well as output text, offset by the corresponding delays.

Sign up or log in to comment