{"guide": {"name": "conversational-chatbot", "category": "streaming", "pretty_category": "Streaming", "guide_index": 4, "absolute_index": 37, "pretty_name": "Conversational Chatbot", "content": "# Building Conversational Chatbots with Gradio\n\n\n\n## Introduction\n\nThe next generation of AI user interfaces is moving towards audio-native experiences. Users will be able to speak to chatbots and receive spoken responses in return. Several models have been built under this paradigm, including GPT-4o and [mini omni](https://github.com/gpt-omni/mini-omni).\n\nIn this guide, we'll walk you through building your own conversational chat application using mini omni as an example. You can see a demo of the finished app below:\n\n\n\n## Application Overview\n\nOur application will enable the following user experience:\n\n1. Users click a button to start recording their message\n2. The app detects when the user has finished speaking and stops recording\n3. The user's audio is passed to the omni model, which streams back a response\n4. After omni mini finishes speaking, the user's microphone is reactivated\n5. All previous spoken audio, from both the user and omni, is displayed in a chatbot component\n\nLet's dive into the implementation details.\n\n## Processing User Audio\n\nWe'll stream the user's audio from their microphone to the server and determine if the user has stopped speaking on each new chunk of audio.\n\nHere's our `process_audio` function:\n\n```python\nimport numpy as np\nfrom utils import determine_pause\n\ndef process_audio(audio: tuple, state: AppState):\n if state.stream is None:\n state.stream = audio[1]\n state.sampling_rate = audio[0]\n else:\n state.stream = np.concatenate((state.stream, audio[1]))\n\n pause_detected = determine_pause(state.stream, state.sampling_rate, state)\n state.pause_detected = pause_detected\n\n if state.pause_detected and state.started_talking:\n return gr.Audio(recording=False), state\n return None, state\n```\n\nThis function takes two inputs:\n1. The current audio chunk (a tuple of `(sampling_rate, numpy array of audio)`)\n2. The current application state\n\nWe'll use the following `AppState` dataclass to manage our application state:\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass AppState:\n stream: np.ndarray | None = None\n sampling_rate: int = 0\n pause_detected: bool = False\n stopped: bool = False\n conversation: list = []\n```\n\nThe function concatenates new audio chunks to the existing stream and checks if the user has stopped speaking. If a pause is detected, it returns an update to stop recording. Otherwise, it returns `None` to indicate no changes.\n\nThe implementation of the `determine_pause` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/eb027808c7bfe5179b46d9352e3fa1813a45f7c3/app.py#L98).\n\n## Generating the Response\n\nAfter processing the user's audio, we need to generate and stream the chatbot's response. Here's our `response` function:\n\n```python\nimport io\nimport tempfile\nfrom pydub import AudioSegment\n\ndef response(state: AppState):\n if not state.pause_detected and not state.started_talking:\n return None, AppState()\n \n audio_buffer = io.BytesIO()\n\n segment = AudioSegment(\n state.stream.tobytes(),\n frame_rate=state.sampling_rate,\n sample_width=state.stream.dtype.itemsize,\n channels=(1 if len(state.stream.shape) == 1 else state.stream.shape[1]),\n )\n segment.export(audio_buffer, format=\"wav\")\n\n with tempfile.NamedTemporaryFile(suffix=\".wav\", delete=False) as f:\n f.write(audio_buffer.getvalue())\n \n state.conversation.append({\"role\": \"user\",\n \"content\": {\"path\": f.name,\n \"mime_type\": \"audio/wav\"}})\n \n output_buffer = b\"\"\n\n for mp3_bytes in speaking(audio_buffer.getvalue()):\n output_buffer += mp3_bytes\n yield mp3_bytes, state\n\n with tempfile.NamedTemporaryFile(suffix=\".mp3\", delete=False) as f:\n f.write(output_buffer)\n \n state.conversation.append({\"role\": \"assistant\",\n \"content\": {\"path\": f.name,\n \"mime_type\": \"audio/mp3\"}})\n yield None, AppState(conversation=state.conversation)\n```\n\nThis function:\n1. Converts the user's audio to a WAV file\n2. Adds the user's message to the conversation history\n3. Generates and streams the chatbot's response using the `speaking` function\n4. Saves the chatbot's response as an MP3 file\n5. Adds the chatbot's response to the conversation history\n\nNote: The implementation of the `speaking` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/main/app.py#L116).\n\n## Building the Gradio App\n\nNow let's put it all together using Gradio's Blocks API:\n\n```python\nimport gradio as gr\n\ndef start_recording_user(state: AppState):\n if not state.stopped:\n return gr.Audio(recording=True)\n\nwith gr.Blocks() as demo:\n with gr.Row():\n with gr.Column():\n input_audio = gr.Audio(\n label=\"Input Audio\", sources=\"microphone\", type=\"numpy\"\n )\n with gr.Column():\n chatbot = gr.Chatbot(label=\"Conversation\", type=\"messages\")\n output_audio = gr.Audio(label=\"Output Audio\", streaming=True, autoplay=True)\n state = gr.State(value=AppState())\n\n stream = input_audio.stream(\n process_audio,\n [input_audio, state],\n [input_audio, state],\n stream_every=0.5,\n time_limit=30,\n )\n respond = input_audio.stop_recording(\n response,\n [state],\n [output_audio, state]\n )\n respond.then(lambda s: s.conversation, [state], [chatbot])\n\n restart = output_audio.stop(\n start_recording_user,\n [state],\n [input_audio]\n )\n cancel = gr.Button(\"Stop Conversation\", variant=\"stop\")\n cancel.click(lambda: (AppState(stopped=True), gr.Audio(recording=False)), None,\n [state, input_audio], cancels=[respond, restart])\n\nif __name__ == \"__main__\":\n demo.launch()\n```\n\nThis setup creates a user interface with:\n- An input audio component for recording user messages\n- A chatbot component to display the conversation history\n- An output audio component for the chatbot's responses\n- A button to stop and reset the conversation\n\nThe app streams user audio in 0.5-second chunks, processes it, generates responses, and updates the conversation history accordingly.\n\n## Conclusion\n\nThis guide demonstrates how to build a conversational chatbot application using Gradio and the mini omni model. You can adapt this framework to create various audio-based chatbot demos. To see the full application in action, visit the Hugging Face Spaces demo: https://huggingface.co/spaces/gradio/omni-mini\n\nFeel free to experiment with different models, audio processing techniques, or user interface designs to create your own unique conversational AI experiences!", "tags": ["AUDIO", "STREAMING", "CHATBOTS"], "spaces": [], "url": "/guides/conversational-chatbot/", "contributor": null}}