fciannella's picture
Working with service run on 7860
53ea588
|
raw
history blame
7.58 kB

Speech to Speech Demo

In this example, we showcase how to build a speech-to-speech voice assistant pipeline using WebRTC with real-time transcripts. It uses Pipecat pipeline with FastAPI on the backend, and React on the frontend. This pipeline uses a WebRTC based SmallWebRTCTransport, Riva ASR and TTS models and NVIDIA LLM Service.

Prerequisites

Setup API keys

  1. Copy and configure the environment file:

    cp env.example .env  # and add your credentials
    
  2. Ensure you have the required API keys:

    • NVIDIA_API_KEY - Required for accessing NIM ASR, TTS and LLM models
    • (Optional) ZEROSHOT_TTS_NVIDIA_API_KEY - Required for zero-shot TTS

Option 1: Deploy Using Docker

From the example/voice_agent_webrtc directory, run:

docker compose up --build -d

Then visit http://<machine-ip>:9000/ in your browser to start interacting with the application.

Note: To enable microphone access in Chrome, go to chrome://flags/, enable "Insecure origins treated as secure", add http://<machine-ip>:9000 to the list, and restart Chrome.

Option 2: Deploy Using Python Environment

Requirements

  • Python (>=3.12)
  • uv

All Python dependencies are listed in pyproject.toml and can be installed with uv.

Run

uv run pipeline.py

Then run the ui from ui/README.md.

Using Coturn Server

If you want to share widely or want to deploy on cloud platforms, you will need to setup coturn server. Follow instructions below for modifications required in example code for using coturn:

Deploy Coturn Server

Update HOST_IP_EXTERNAL and run the below command:

docker run -d --network=host instrumentisto/coturn -n --verbose --log-file=stdout --external-ip=<HOST_IP_EXTERNAL>  --listening-ip=<HOST_IP_EXTERNAL>  --lt-cred-mech --fingerprint --user=admin:admin --no-multicast-peers --realm=tokkio.realm.org --min-port=51000 --max-port=52000

Update pipeline.py

Add the following configuration to your pipeline.py file to use the coturn server:

ice_servers = [
    IceServer(
        urls="<TURN_SERVER_URL>",
        username="<TURN_USERNAME>",
        credential="<TURN_PASSWORD>"
    )
]

Update ui/src/config.ts

Add the following configuration to your ui/src/config.ts file to use the coturn server:

export const RTC_CONFIG: ConstructorParameters<typeof RTCPeerConnection>[0] = {
    iceServers: [
      {
        urls: "<turn_server_url>",
        username: "<turn_server_username>",
        credential: "<turn_server_credential>",
      },
    ],
  };

Bot customizations

Enabling Speculative Speech Processing

Speculative speech processing reduces bot response latency by working directly on Riva ASR early interim user transcripts instead of waiting for final transcripts. This feature only works when using Riva ASR.

Switching ASR, LLM, and TTS Models

You can easily customize ASR (Automatic Speech Recognition), LLM (Large Language Model), and TTS (Text-to-Speech) services by configuring environment variables. This allows you to switch between NIM cloud-hosted models and locally deployed models.

The following environment variables control the endpoints and models:

  • RIVA_ASR_URL: Address of the Riva ASR (speech-to-text) service (e.g., localhost:50051 for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
  • RIVA_TTS_URL: Address of the Riva TTS (text-to-speech) service. (e.g., localhost:50051 for local, "grpc.nvcf.nvidia.com:443" for cloud endpoint).
  • NVIDIA_LLM_URL: URL for the NVIDIA LLM service. (e.g., http://<machine-ip>:8000/v1 for local, "https://integrate.api.nvidia.com/v1" for cloud endpoint. )

You can set model, language, and voice using the RIVA_ASR_MODEL, RIVA_TTS_MODEL, NVIDIA_LLM_MODEL, RIVA_ASR_LANGUAGE, RIVA_TTS_LANGUAGE, and RIVA_TTS_VOICE_ID environment variables.

Update these variables in your Docker Compose configuration to match your deployment and desired models. For more details on available models and configuration options, refer to the NIM NVIDIA Magpie, NIM NVIDIA Parakeet, and NIM META Llama documentation.

Example: Switching to the Llama 3.3-70B Model

To use larger LLMs like Llama 3.3-70B model in your deployment, you need to update both the Docker Compose configuration and the environment variables for your Python application. Follow these steps:

  • In your docker-compose.yml file, find the nvidia-llm service section.
  • Change the NIM image to 70B model: nvcr.io/nim/meta/llama-3.3-70b-instruct:latest
  • Update the device_ids to allocate at least two GPUs (for example, ['2', '3']).
  • Update the environment variable under python-app service to NVIDIA_LLM_MODEL=meta/llama-3.3-70b-instruct

Setting up Zero-shot Magpie Latest Model

Follow these steps to configure and use the latest Zero-shot Magpie TTS model:

  1. Update Docker Compose Configuration

Modify the riva-tts-magpie service in your docker-compose file with the following configuration:

 riva-tts-magpie:
  image: <magpie-tts-zeroshot-image:version>  # Replace this with the actual image tag
  environment:
    - NGC_API_KEY=${ZEROSHOT_TTS_NVIDIA_API_KEY}
    - NIM_HTTP_API_PORT=9000
    - NIM_GRPC_API_PORT=50051
  ports:
    - "49000:50051"
  shm_size: 16GB
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
  • Ensure your ZEROSHOT_TTS_NVIDIA_API_KEY key is properly set in your .env file:
    ZEROSHOT_TTS_NVIDIA_API_KEY=
    
  1. Configure TTS Voice Settings

Update the following environment variables under the python-app service:

RIVA_TTS_VOICE_ID=Magpie-ZeroShot.Female-1
RIVA_TTS_MODEL=magpie_tts_ensemble-Magpie-ZeroShot
  1. Zero-shot Audio Prompt Configuration

To use a custom voice with zero-shot learning:

  • Add your audio prompt file to the workspace
  • Mount the audio file into your container by adding a volume in your docker-compose.yml under the python-app service:
    services:
      python-app:
        # ... existing code ...
        volumes:
          - ./audio_prompts:/app/audio_prompts
    
  • Set the ZERO_SHOT_AUDIO_PROMPT environment variable to the path relative to your application root:
    environment:
      - ZERO_SHOT_AUDIO_PROMPT=audio_prompts/voice_sample.wav  # Path relative to app root
    

Note: The zero-shot audio prompt is only required when using the Magpie Zero-shot model. For standard Magpie multilingual models, this configuration should be omitted.