Create your own transcription app

This tutorial will guide you through building a complete transcription application using Hugging Face Inference Endpoints. We’ll create an app that can transcribe audio files and generate intelligent summaries with action items - perfect for meeting notes, interviews, or any audio content.

This tutorial uses Python and Gradio, but you can adapt the approach to any language that can make HTTP requests. The models deployed on Inference Endpoints use standard APIs, so you can integrate them into web applications, mobile apps, or any other system.

Create your transcription endpoint

First, we need to create an Inference Endpoint for audio transcription. We’ll use OpenAI’s Whisper model for high-quality speech recognition.

Start by navigating to the Inference Endpoints UI, and once you have logged in you should see a button for creating a new Inference Endpoint. Click the “New” button.

From there you’ll be directed to the catalog. The Model Catalog consists of popular models which have tuned configurations to work as one-click deploys. You can filter by name, task, price of the hardware and much more.

catalog

Search for “whisper” to find transcription models, or you can create a custom endpoint with openai/whisper-large-v3. This model provides excellent transcription quality for multiple languages and handles various audio formats.

For transcription models, we recommend:

GPU: NVIDIA L4 or A10G for good performance with audio processing
Instance Size: x1 (sufficient for most transcription workloads)
Auto-scaling: Enable scale-to-zero to save costs when not in use

Click “Create Endpoint” to deploy your transcription service.

config

Your endpoint will take about 5 minutes to initialize. Once it’s ready, you’ll see it in the “Running” state.

Create your text generation endpoint

Now let’s do the same again but now for a text generation model. For generating summaries and action items, we’ll create a second endpoint using the Qwen/Qwen3-1.7B model.

Follow the same process:

Click “New” button in the Inference Endpoints UI
Search for qwen3 1.7b in the catalog
The NVIDIA L4 with x1 instance size is recommended for this model
Keep the default settings (scale-to-zero enabled, 1-hour timeout)
Click “Create Endpoint”

This model is optimized for text generation tasks and will provide excellent summarization capabilities. Both endpoints will take about 3-5 minutes to initialize.

Test your endpoints

Once your endpoints are running, you can test them in the playground. The transcription endpoint will accept audio files and return text transcripts.

playground

Test with a short audio sample to verify the transcription quality.

Get your endpoint details

You’ll need the endpoint details from your endpoints page:

Base URL: https://<endpoint-name>.endpoints.huggingface.cloud/v1/
Model name: The name of your endpoint
Token: Your HF token from settings

endpoint-details

You can validate your details by testing your endpoint out in the command line with curl.

curl "<endpoint-url>" \
-X POST \
--data-binary '@<audio-file>' \
-H "Accept: application/json" \
-H "Content-Type: audio/flac" \

Building the transcription app

Now let’s build a transcription application step by step. We’ll break it down into logical blocks to create a complete solution that can transcribe audio and generate intelligent summaries.

Step 1: Set up dependencies and imports

We’ll use the requests library to connect to both endpoints and gradio to create the interface. Let’s install the required packages:

pip install gradio requests

Then, set up your imports in a new Python file:

import os

import gradio as gr
import requests

Step 2: Configure your endpoint connections

Set up the configuration to connect to both your transcription and summarization endpoints based on the details you collected in the previous steps.

# Configuration for both endpoints
TRANSCRIPTION_ENDPOINT = "https://your-whisper-endpoint.endpoints.huggingface.cloud/api/v1/audio/transcriptions"
SUMMARIZATION_ENDPOINT = "https://your-qwen-endpoint.endpoints.huggingface.cloud/v1/chat/completions"
HF_TOKEN = os.getenv("HF_TOKEN")  # Your Hugging Face Hub token

# Headers for authentication
headers = {
    "Authorization": f"Bearer {HF_TOKEN}"
}

Your endpoints are now configured to handle both audio transcription and text summarization.

You might also want to use os.getenv for your endpoint details.

Step 3: Create the transcription function

Next, we’ll create a function to handle audio file uploads and transcription:

def transcribe_audio(audio_file_path):
    """Transcribe audio using direct requests to the endpoint"""
    
    # Read audio file and prepare for upload
    with open(audio_file_path, "rb") as audio_file:
        # Read the audio file as binary data and represent it as a file object
        files = {"file": audio_file.read()}
    
    # Make the request to the transcription endpoint
    response = requests.post(TRANSCRIPTION_ENDPOINT, headers=headers, files=files)
    
    # Check if the request was successful
    if response.status_code == 200:
        result = response.json()
        return result.get("text", "No transcription available")
    else:
        return f"Error: {response.status_code} - {response.text}"

The transcription endpoint expects a file upload in the files parameter. Make sure to read the audio file as binary data and pass it correctly to the API.

Step 4: Create the summarization function

Now we’ll create a function to generate summaries from the transcribed text. We’ll do some simple prompt engineering to get the best results.

def generate_summary(transcript):
    """Generate summary using requests to the chat completions endpoint"""
    
    # define a nice prompt to get the best results for our use case
    prompt = f"""
    Analyze this meeting transcript and provide:
    1. A concise summary of key points
    2. Action items with responsible parties
    3. Important decisions made
    
    Transcript: {transcript}
    
    Format with clear sections:
    ## Summary
    ## Action Items  
    ## Decisions Made
    """
    
    # Prepare the payload using the Messages API format
    payload = {
        "model": "your-qwen-endpoint-name",  # Use the name of your endpoint
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 1000, # we can also set a max_tokens parameter to limit the length of the response
        "temperature": 0.7, # we might want to set lower temperature for more deterministic results
        "stream": False # we don't need streaming for this use case
    }
    
    # Headers for chat completions
    chat_headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Bearer {HF_TOKEN}"
    }
    
    # Make the request
    response = requests.post(SUMMARIZATION_ENDPOINT, headers=chat_headers, json=payload)
    response.raise_for_status()
    
    # Parse the response
    result = response.json()
    return result["choices"][0]["message"]["content"]

Step 5: Wrap it all together

Now let’s build our Gradio interface. We’ll use the gr.Interface class to create a simple interface that allows us to upload an audio file and see the transcript and summary.

First, we’ll create a main processing function that handles the complete workflow.

def process_meeting_audio(audio_file):
    """Main processing function that handles the complete workflow"""
    if audio_file is None:
        return "Please upload an audio file.", ""
    
    try:
        # Step 1: Transcribe the audio
        transcript = transcribe_audio(audio_file)
        
        # Step 2: Generate summary from transcript
        summary = generate_summary(transcript)
        
        return transcript, summary
    
    except Exception as e:
        return f"Error processing audio: {str(e)}", ""

Then, we can run that function in a Gradio interface. We’ll add some descriptions and a title to make it more user-friendly.

# Create Gradio interface
app = gr.Interface(
    fn=process_meeting_audio,
    inputs=gr.Audio(label="Upload Meeting Audio", type="filepath"),
    outputs=[
        gr.Textbox(label="Full Transcript", lines=10),
        gr.Textbox(label="Meeting Summary", lines=8),
    ],
    title="🎤 AI Meeting Notes",
    description="Upload audio to get instant transcripts and summaries.",
)

That’s it! You can now run the app locally with python app.py and test it out.

Click to view the complete script

import gradio as gr
import os
import requests

# Configuration for both endpoints
TRANSCRIPTION_ENDPOINT = "https://your-whisper-endpoint.endpoints.huggingface.cloud/api/v1/audio/transcriptions"
SUMMARIZATION_ENDPOINT = "https://your-qwen-endpoint.endpoints.huggingface.cloud/v1/chat/completions"
HF_TOKEN = os.getenv("HF_TOKEN")  # Your Hugging Face Hub token

# Headers for authentication
headers = {
    "Authorization": f"Bearer {HF_TOKEN}"
}

def transcribe_audio(audio_file_path):
    """Transcribe audio using direct requests to the endpoint"""
    
    # Read audio file and prepare for upload
    with open(audio_file_path, "rb") as audio_file:
        files = {"file": audio_file.read()}
    
    # Make the request to the transcription endpoint
    response = requests.post(TRANSCRIPTION_ENDPOINT, headers=headers, files=files)
    
    if response.status_code == 200:
        result = response.json()
        return result.get("text", "No transcription available")
    else:
        return f"Error: {response.status_code} - {response.text}"


def generate_summary(transcript):
    """Generate summary using requests to the chat completions endpoint"""
    
    prompt = f"""
    Analyze this meeting transcript and provide:
    1. A concise summary of key points
    2. Action items with responsible parties
    3. Important decisions made
    
    Transcript: {transcript}
    
    Format with clear sections:
    ## Summary
    ## Action Items  
    ## Decisions Made
    """
    
    # Prepare the payload using the Messages API format
    payload = {
        "model": "your-qwen-endpoint-name",  # Use the name of your endpoint
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 1000,
        "temperature": 0.7,
        "stream": False
    }
    
    # Headers for chat completions
    chat_headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
        "Authorization": f"Bearer {HF_TOKEN}"
    }
    
    # Make the request
    response = requests.post(SUMMARIZATION_ENDPOINT, headers=chat_headers, json=payload)
    response.raise_for_status()
    
    # Parse the response
    result = response.json()
    return result["choices"][0]["message"]["content"]


def process_meeting_audio(audio_file):
    """Main processing function that handles the complete workflow"""
    if audio_file is None:
        return "Please upload an audio file.", ""
    
    try:
        # Step 1: Transcribe the audio
        transcript = transcribe_audio(audio_file)
        
        # Step 2: Generate summary from transcript
        summary = generate_summary(transcript)
        
        return transcript, summary
    
    except Exception as e:
        return f"Error processing audio: {str(e)}", ""


# Create Gradio interface
app = gr.Interface(
    fn=process_meeting_audio,
    inputs=gr.Audio(label="Upload Meeting Audio", type="filepath"),
    outputs=[
        gr.Textbox(label="Full Transcript", lines=10),
        gr.Textbox(label="Meeting Summary", lines=8),
    ],
    title="🎤 AI Meeting Notes",
    description="Upload audio to get instant transcripts and summaries.",
)

if __name__ == "__main__":
    app.launch()

app

Deploy your transcription app

Now, let’s deploy it to Hugging Face Spaces so everyone can use it!

Create a new Space: Go to huggingface.co/new-space
Choose Gradio SDK and make it public
Upload your files: Upload app.py and any requirements
Add your token: In Space settings, add HF_TOKEN as a secret
Configure hardware: Consider GPU for faster processing
Launch: Your app will be live at https://huggingface.co/spaces/your-username/your-space-name

Your transcription app is now ready to handle meeting notes, interviews, podcasts, and any other audio content that needs to be transcribed and summarized!

Next steps

Great work! You’ve now built a complete transcription application with intelligent summarization.

Here are some ways to extend your transcription app:

Multi-language support: Add language detection and support for multiple languages
Speaker identification: Use a model from the hub with speaker diarization capabilities.
Custom prompts: Allow users to customize the summary format and style
Implement Text-to-Speech: Use a model from the hub to convert your summary to another audio file!

Update on GitHub

Inference Endpoints (dedicated)

Create your own transcription app

Create your transcription endpoint

Create your text generation endpoint

Test your endpoints

Get your endpoint details

Building the transcription app

Step 1: Set up dependencies and imports

Step 2: Configure your endpoint connections

Step 3: Create the transcription function

Step 4: Create the summarization function

Step 5: Wrap it all together

Deploy your transcription app

Next steps