{"guide": {"name": "conversational-chatbot", "category": "streaming", "pretty_category": "Streaming", "guide_index": 4, "absolute_index": 37, "pretty_name": "Conversational Chatbot", "content": "# Building Conversational Chatbots with Gradio\n\n\n\n## Introduction\n\nThe next generation of AI user interfaces is moving towards audio-native experiences. Users will be able to speak to chatbots and receive spoken responses in return. Several models have been built under this paradigm, including GPT-4o and [mini omni](https://github.com/gpt-omni/mini-omni).\n\nIn this guide, we'll walk you through building your own conversational chat application using mini omni as an example. You can see a demo of the finished app below:\n\n<video src=\"https://github.com/user-attachments/assets/db36f4db-7535-49f1-a2dd-bd36c487ebdf\" controls\nheight=\"600\" width=\"600\" style=\"display: block; margin: auto;\" autoplay=\"true\" loop=\"true\">\n</video>\n\n## Application Overview\n\nOur application will enable the following user experience:\n\n1. Users click a button to start recording their message\n2. The app detects when the user has finished speaking and stops recording\n3. The user's audio is passed to the omni model, which streams back a response\n4. After omni mini finishes speaking, the user's microphone is reactivated\n5. All previous spoken audio, from both the user and omni, is displayed in a chatbot component\n\nLet's dive into the implementation details.\n\n## Processing User Audio\n\nWe'll stream the user's audio from their microphone to the server and determine if the user has stopped speaking on each new chunk of audio.\n\nHere's our `process_audio` function:\n\n```python\nimport numpy as np\nfrom utils import determine_pause\n\ndef process_audio(audio: tuple, state: AppState):\n    if state.stream is None:\n        state.stream = audio[1]\n        state.sampling_rate = audio[0]\n    else:\n        state.stream = np.concatenate((state.stream, audio[1]))\n\n    pause_detected = determine_pause(state.stream, state.sampling_rate, state)\n    state.pause_detected = pause_detected\n\n    if state.pause_detected and state.started_talking:\n        return gr.Audio(recording=False), state\n    return None, state\n```\n\nThis function takes two inputs:\n1. The current audio chunk (a tuple of `(sampling_rate, numpy array of audio)`)\n2. The current application state\n\nWe'll use the following `AppState` dataclass to manage our application state:\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass AppState:\n    stream: np.ndarray | None = None\n    sampling_rate: int = 0\n    pause_detected: bool = False\n    stopped: bool = False\n    conversation: list = []\n```\n\nThe function concatenates new audio chunks to the existing stream and checks if the user has stopped speaking. If a pause is detected, it returns an update to stop recording. Otherwise, it returns `None` to indicate no changes.\n\nThe implementation of the `determine_pause` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/eb027808c7bfe5179b46d9352e3fa1813a45f7c3/app.py#L98).\n\n## Generating the Response\n\nAfter processing the user's audio, we need to generate and stream the chatbot's response. Here's our `response` function:\n\n```python\nimport io\nimport tempfile\nfrom pydub import AudioSegment\n\ndef response(state: AppState):\n    if not state.pause_detected and not state.started_talking:\n        return None, AppState()\n    \n    audio_buffer = io.BytesIO()\n\n    segment = AudioSegment(\n        state.stream.tobytes(),\n        frame_rate=state.sampling_rate,\n        sample_width=state.stream.dtype.itemsize,\n        channels=(1 if len(state.stream.shape) == 1 else state.stream.shape[1]),\n    )\n    segment.export(audio_buffer, format=\"wav\")\n\n    with tempfile.NamedTemporaryFile(suffix=\".wav\", delete=False) as f:\n        f.write(audio_buffer.getvalue())\n    \n    state.conversation.append({\"role\": \"user\",\n                                \"content\": {\"path\": f.name,\n                                \"mime_type\": \"audio/wav\"}})\n    \n    output_buffer = b\"\"\n\n    for mp3_bytes in speaking(audio_buffer.getvalue()):\n        output_buffer += mp3_bytes\n        yield mp3_bytes, state\n\n    with tempfile.NamedTemporaryFile(suffix=\".mp3\", delete=False) as f:\n        f.write(output_buffer)\n    \n    state.conversation.append({\"role\": \"assistant\",\n                    \"content\": {\"path\": f.name,\n                                \"mime_type\": \"audio/mp3\"}})\n    yield None, AppState(conversation=state.conversation)\n```\n\nThis function:\n1. Converts the user's audio to a WAV file\n2. Adds the user's message to the conversation history\n3. Generates and streams the chatbot's response using the `speaking` function\n4. Saves the chatbot's response as an MP3 file\n5. Adds the chatbot's response to the conversation history\n\nNote: The implementation of the `speaking` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/main/app.py#L116).\n\n## Building the Gradio App\n\nNow let's put it all together using Gradio's Blocks API:\n\n```python\nimport gradio as gr\n\ndef start_recording_user(state: AppState):\n    if not state.stopped:\n        return gr.Audio(recording=True)\n\nwith gr.Blocks() as demo:\n    with gr.Row():\n        with gr.Column():\n            input_audio = gr.Audio(\n                label=\"Input Audio\", sources=\"microphone\", type=\"numpy\"\n            )\n        with gr.Column():\n            chatbot = gr.Chatbot(label=\"Conversation\", type=\"messages\")\n            output_audio = gr.Audio(label=\"Output Audio\", streaming=True, autoplay=True)\n    state = gr.State(value=AppState())\n\n    stream = input_audio.stream(\n        process_audio,\n        [input_audio, state],\n        [input_audio, state],\n        stream_every=0.5,\n        time_limit=30,\n    )\n    respond = input_audio.stop_recording(\n        response,\n        [state],\n        [output_audio, state]\n    )\n    respond.then(lambda s: s.conversation, [state], [chatbot])\n\n    restart = output_audio.stop(\n        start_recording_user,\n        [state],\n        [input_audio]\n    )\n    cancel = gr.Button(\"Stop Conversation\", variant=\"stop\")\n    cancel.click(lambda: (AppState(stopped=True), gr.Audio(recording=False)), None,\n                [state, input_audio], cancels=[respond, restart])\n\nif __name__ == \"__main__\":\n    demo.launch()\n```\n\nThis setup creates a user interface with:\n- An input audio component for recording user messages\n- A chatbot component to display the conversation history\n- An output audio component for the chatbot's responses\n- A button to stop and reset the conversation\n\nThe app streams user audio in 0.5-second chunks, processes it, generates responses, and updates the conversation history accordingly.\n\n## Conclusion\n\nThis guide demonstrates how to build a conversational chatbot application using Gradio and the mini omni model. You can adapt this framework to create various audio-based chatbot demos. To see the full application in action, visit the Hugging Face Spaces demo: https://huggingface.co/spaces/gradio/omni-mini\n\nFeel free to experiment with different models, audio processing techniques, or user interface designs to create your own unique conversational AI experiences!", "tags": ["AUDIO", "STREAMING", "CHATBOTS"], "spaces": [], "url": "/guides/conversational-chatbot/", "contributor": null}}