Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.49.1
metadata
title: Raynos AI Audio Transcription
emoji: ๐๏ธ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.39.0
app_file: app.py
pinned: false
license: apache-2.0
models:
- openai/whisper-base
- google/gemma-2b-it
Raynos AI - Real-time Audio Transcription & JSON Extraction
๐ฏ Overview
Raynos AI is an advanced audio transcription application that combines OpenAI's Whisper model with Google's Gemma model to provide:
- Real-time Audio Transcription: Convert speech to text using state-of-the-art Whisper models
- Structured JSON Extraction: Automatically extract key information (names, locations, dates, events) from transcriptions
- Multiple Input Methods: Support for microphone recording, file upload, and streaming
- Flexible Transcription Engines: Choose between local Whisper or cloud-based Deepgram
๐ Features
Audio Processing
- ๐ค Live Microphone Recording: Real-time audio capture and transcription
- ๐ File Upload: Process pre-recorded audio files (MP3, WAV, AAC, etc.)
- ๐ Streaming Mode: Continuous transcription for long recordings
- ๐ฑ Mobile Support: Optimized for mobile device audio input
Transcription Options
- Whisper Models: Choose from tiny, base, small, medium, or large models
- Deepgram Integration: Optional cloud-based transcription (requires API key)
- Language Support: Auto-detect or specify language
- Buffer Control: Adjustable buffer duration for optimal performance
JSON Extraction
- Smart Information Extraction: Automatically identifies and structures:
- Person names
- Locations (cities, countries, addresses)
- Dates and times
- Events and activities
- Key topics and themes
- Temporal Context: Links extracted information to timestamps
๐ ๏ธ Configuration
Environment Variables (Optional)
DEEPGRAM_API_KEY: Enable Deepgram cloud transcriptionCUDA_VISIBLE_DEVICES: Control GPU usage
Model Selection
The app automatically selects appropriate models based on available hardware:
- GPU Available: Uses larger, more accurate models
- CPU Only: Falls back to smaller, faster models
๐ Technical Details
Models Used
- Transcription: OpenAI Whisper (various sizes)
- Extraction: Google Gemma-2B-IT (optional, for JSON extraction)
Audio Processing
- Sample Rate: 16kHz
- Format: Mono channel
- Chunk Size: 1024 samples
๐ฎ Usage
Select Input Method:
- Desktop: Use microphone or upload file
- Mobile: Use mobile audio streaming
Configure Settings:
- Choose transcription engine (Whisper/Deepgram)
- Select model size (accuracy vs speed trade-off)
- Set language (auto-detect or specific)
Start Transcription:
- Click "Start Streaming" for live audio
- Or "Process File" for uploaded audio
View Results:
- Real-time transcription display
- Structured JSON output with extracted information
๐ Notes
- First run may take time to download models
- GPU recommended for best performance
- Larger models provide better accuracy but require more resources
๐ค Contributing
This is an open-source project. Contributions are welcome!
๐ License
Apache License 2.0
Built with โค๏ธ using Gradio and Hugging Face