Spaces:

LiKenun
/

ai-building-blocks

Running on Zero

App Files Files Community

LiKenun commited on Nov 3

Commit

5bebd85

1 Parent(s): 39d9406

Add documentation

Browse files

Files changed (9) hide show

README.md +180 -3
app.py +17 -0
automatic_speech_recognition.py +33 -1
chatbot.py +77 -3
image_classification.py +33 -1
image_to_text.py +27 -1
text_to_image.py +20 -1
text_to_speech.py +26 -1
utils.py +144 -0

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: Ai Building Blocks
 emoji: 👀
 colorFrom: purple
-colorTo: pink
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
@@ -11,4 +11,181 @@ license: wtfpl
 short_description: A gallery of building blocks for building AI applications
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: AI Building Blocks
 emoji: 👀
 colorFrom: purple
+colorTo: blue
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 short_description: A gallery of building blocks for building AI applications
 ---
+# AI Building Blocks
+A gallery of AI building blocks for building AI applications, featuring a Gradio web interface with multiple tabs for different AI tasks.
+## Features
+This application provides the following AI building blocks:
+- **Text-to-image Generation**: Generate images from text prompts using Hugging Face Inference API
+- **Image-to-text (Image Captioning)**: Generate text descriptions of images using BLIP models
+- **Image Classification**: Classify recyclable items using Trash-Net model
+- **Text-to-speech (TTS)**: Convert text to speech audio
+- **Automatic Speech Recognition (ASR)**: Transcribe audio to text using Whisper models
+- **Chatbot**: Have conversations with AI chatbots supporting both modern chat models and seq2seq models
+## Prerequisites
+- Python 3.8 or higher
+- PyTorch with hardware acceleration (strongly recommended - see [PyTorch Installation](#pytorch-installation))
+- CUDA-capable GPU (optional, but recommended for better performance)
+## Installation
+1. Clone this repository:
+   ```bash
+   git clone <repository-url>
+   cd ai-building-blocks
+   ```
+2. Create a virtual environment:
+   ```bash
+   python -m venv .venv
+   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+   ```
+3. Install PyTorch with CUDA support (see [PyTorch Installation](#pytorch-installation) below).
+4. Install the remaining dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+## PyTorch Installation
+PyTorch is not included in `requirements.txt` because installation varies based on your hardware and operating system. **It is strongly recommended to install PyTorch with hardware acceleration support** for optimal performance.
+For official installation instructions with CUDA support, please visit:
+- **Official PyTorch Installation Guide**: https://pytorch.org/get-started/locally/
+Select your platform, package manager, Python version, and CUDA version to get the appropriate installation command. For example:
+- **CUDA 12.1** (recommended for modern NVIDIA GPUs):
+  ```bash
+  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+  ```
+- **CUDA 11.8**:
+  ```bash
+  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+  ```
+- **CPU only** (not recommended for production):
+  ```bash
+  pip install torch torchvision torchaudio
+  ```
+## Configuration
+Create a `.env` file in the project root directory with the following environment variables:
+### Required Environment Variables
+```env
+# Hugging Face API Token (required for Inference API access)
+# Get your token from: https://huggingface.co/settings/tokens
+HF_TOKEN=your_huggingface_token_here
+# Model IDs for each building block
+TEXT_TO_IMAGE_MODEL=model_id_for_text_to_image
+IMAGE_TO_TEXT_MODEL=model_id_for_image_captioning
+IMAGE_CLASSIFICATION_MODEL=model_id_for_image_classification
+TEXT_TO_SPEECH_MODEL=model_id_for_text_to_speech
+AUDIO_TRANSCRIPTION_MODEL=model_id_for_speech_recognition
+CHAT_MODEL=model_id_for_chatbot
+```
+### Optional Environment Variables
+```env
+# Request timeout in seconds (default: 45)
+REQUEST_TIMEOUT=45
+```
+### Example `.env` File
+```env
+HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+# Example model IDs (adjust based on your needs)
+TEXT_TO_IMAGE_MODEL=black-forest-labs/FLUX.1-dev
+IMAGE_CLASSIFICATION_MODEL=prithivMLmods/Trash-Net
+IMAGE_TO_TEXT_MODEL=Salesforce/blip-image-captioning-large
+TEXT_TO_SPEECH_MODEL=kakao-enterprise/vits-ljs
+AUDIO_TRANSCRIPTION_MODEL=openai/whisper-large-v3
+CHAT_MODEL=Qwen/Qwen2.5-1.5B-Instruct
+REQUEST_TIMEOUT=45
+```
+**Note**: `.env` should already be included in the `.gitignore` file. Make sure to never `git add --force --` it to prevent committing sensitive tokens.
+## Running the Application
+1. Activate your virtual environment (if not already activated):
+   ```bash
+   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+   ```
+2. Run the application:
+   ```bash
+   python app.py
+   ```
+3. Open your web browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`).
+4. The Gradio interface will display multiple tabs, each corresponding to a different AI building block.
+## Project Structure
+```
+ai-building-blocks/
+├── app.py                              # Main application entry point
+├── text_to_image.py                    # Text-to-image generation module
+├── image_to_text.py                    # Image captioning module
+├── image_classification.py             # Image classification module
+├── text_to_speech.py                   # Text-to-speech module
+├── automatic_speech_recognition.py     # Speech recognition module
+├── chatbot.py                          # Chatbot module
+├── utils.py                            # Utility functions
+├── requirements.txt                    # Python dependencies
+├── .env                                # Environment variables (create this)
+└── README.md                           # This file
+```
+## Hardware Acceleration
+This application is designed to leverage hardware acceleration when available:
+- **NVIDIA CUDA**: Automatically detected and used if available
+- **AMD ROCm**: Supported via CUDA compatibility
+- **Intel XPU**: Automatically detected if available
+- **Apple Silicon (MPS)**: Automatically detected and used on Apple devices
+- **CPU**: Falls back to CPU if no GPU acceleration is available
+The application automatically selects the best available device. For optimal performance, especially with local models (image-to-text, text-to-speech, chatbot), a CUDA-capable GPU is strongly recommended. This is _untested_ on other hardware. 😉
+## Troubleshooting
+### PyTorch Not Detecting GPU
+If PyTorch is not detecting your GPU:
+1. Verify CUDA is installed: `nvidia-smi`
+2. Ensure PyTorch was installed with CUDA support (see [PyTorch Installation](#pytorch-installation))
+3. Check PyTorch CUDA availability: `python -c "import torch; print(torch.cuda.is_available())"`
+### Missing Environment Variables
+Ensure all required environment variables are set in your `.env` file. Missing variables will cause the application to fail when trying to use the corresponding feature.
+### Model Loading Errors
+If you encounter errors loading models:
+1. Verify your `HF_TOKEN` is valid and has access to the models. Some models are gated.
+2. Check that model IDs in your `.env` file are correct.
+3. Ensure you have sufficient disk space for model downloads.
+4. For local models, ensure you have sufficient RAM or VRAM.

app.py CHANGED Viewed

@@ -10,11 +10,28 @@ from text_to_speech import create_text_to_speech_tab
 class App:
     def __init__(self, client: InferenceClient):
         self.client = client
     def run(self):
         with gr.Blocks(title="AI Building Blocks") as demo:
             gr.Markdown("# AI Building Blocks")
             gr.Markdown("A gallery of building blocks for building AI applications")

 class App:
+    """Main application class for the AI Building Blocks Gradio interface.
+    This class orchestrates the entire application by creating the Gradio UI
+    and integrating all the individual building block tabs.
+    """
     def __init__(self, client: InferenceClient):
+        """Initialize the App with an InferenceClient instance.
+        Args:
+            client: Hugging Face InferenceClient instance for making API calls
+                to Hugging Face's inference endpoints.
+        """
         self.client = client
     def run(self):
+        """Launch the Gradio application with all building block tabs.
+        Creates a Gradio Blocks interface with multiple tabs, each representing
+        a different AI building block. The application will block until the
+        interface is closed.
+        """
         with gr.Blocks(title="AI Building Blocks") as demo:
             gr.Markdown("# AI Building Blocks")
             gr.Markdown("A gallery of building blocks for building AI applications")

automatic_speech_recognition.py CHANGED Viewed

@@ -5,6 +5,28 @@ import gradio as gr
 from utils import save_audio_to_temp_file, get_model_sample_rate, request_audio
 def automatic_speech_recognition(client: InferenceClient, audio: tuple[int, bytes]) -> str:
     temp_file_path = None
     try:
         model_id = getenv("AUDIO_TRANSCRIPTION_MODEL")
@@ -21,7 +43,17 @@ def automatic_speech_recognition(client: InferenceClient, audio: tuple[int, byte
 def create_asr_tab(client: InferenceClient):
-    """Create the automatic speech recognition tab."""
     gr.Markdown("Transcribe audio to text.")
     audio_transcription_url_input = gr.Textbox(label="Audio URL")
     audio_transcription_audio_request_button = gr.Button("Get Audio")

 from utils import save_audio_to_temp_file, get_model_sample_rate, request_audio
 def automatic_speech_recognition(client: InferenceClient, audio: tuple[int, bytes]) -> str:
+    """Transcribe audio to text using Hugging Face Inference API.
+    This function converts speech audio into text transcription. The audio is
+    resampled to match the model's expected sample rate, saved to a temporary
+    file, and then sent to the Inference API for transcription.
+    Args:
+        client: Hugging Face InferenceClient instance for API calls.
+        audio: Tuple containing:
+            - int: Sample rate of the input audio (e.g., 44100 Hz)
+            - bytes: Raw audio data as bytes
+    Returns:
+        String containing the transcribed text from the audio.
+    Note:
+        - The model ID is determined by the AUDIO_TRANSCRIPTION_MODEL environment variable.
+        - Audio is automatically resampled to match the model's expected sample rate.
+        - Audio is saved as a WAV file for InferenceClient compatibility.
+        - Automatically cleans up temporary files after transcription.
+        - Uses openai/whisper-large-v3 or similar ASR models.
+    """
     temp_file_path = None
     try:
         model_id = getenv("AUDIO_TRANSCRIPTION_MODEL")
 def create_asr_tab(client: InferenceClient):
+    """Create the automatic speech recognition tab in the Gradio interface.
+    This function sets up all UI components for automatic speech recognition, including:
+    - URL input textbox for fetching audio files from the web
+    - Button to retrieve audio from URL
+    - Audio input component for uploading or recording audio
+    - Transcribe button and output textbox
+    Args:
+        client: Hugging Face InferenceClient instance to pass to the automatic_speech_recognition function.
+    """
     gr.Markdown("Transcribe audio to text.")
     audio_transcription_url_input = gr.Textbox(label="Audio URL")
     audio_transcription_audio_request_button = gr.Button("Get Audio")

chatbot.py CHANGED Viewed

@@ -9,7 +9,26 @@ _tokenizer = None
 _is_seq2seq = None
 def get_chatbot():
-    """Get or create the chatbot model instance. Supports both causal LM and seq2seq models."""
     global _chatbot, _tokenizer, _is_seq2seq
     if _chatbot is None:
         model_id = getenv("CHAT_MODEL")
@@ -46,6 +65,33 @@ def get_chatbot():
 @spaces_gpu
 def chat(message: str, conversation_history: list[dict] | None) -> tuple[str, list[dict]]:
     model, tokenizer, is_seq2seq = get_chatbot()
     # Initialize conversation history if this is the first message
@@ -129,7 +175,19 @@ def chat(message: str, conversation_history: list[dict] | None) -> tuple[str, li
 def create_chatbot_tab():
-    """Create the chatbot tab."""
     gr.Markdown("Have a conversation with an AI chatbot.")
     chatbot_history = gr.State(value=None)  # Store the conversation history.
     chatbot_output = gr.Chatbot(label="Conversation")
@@ -137,7 +195,23 @@ def create_chatbot_tab():
     chatbot_send_button = gr.Button("Send")
     def chat_interface(message: str, history: list | None, conversation_state: list[dict] | None):
-        """Handle chatbot interaction with Gradio chat format."""
         if not message.strip():
             return history, conversation_state, ""
         response, updated_conversation = chat(message, conversation_state)  # Get response from chatbot.

 _is_seq2seq = None
 def get_chatbot():
+    """Get or create the chatbot model instance.
+    This function implements a singleton pattern to load and cache the chatbot
+    model and tokenizer. It supports both causal language models (like GPT-style
+    models) and sequence-to-sequence models (like BlenderBot). The model type
+    is automatically detected from the model configuration.
+    Returns:
+        Tuple containing:
+            - Model: The loaded transformer model (AutoModelForCausalLM or AutoModelForSeq2SeqLM)
+            - Tokenizer: The corresponding tokenizer
+            - bool: Whether the model is a seq2seq model (True) or causal LM (False)
+    Note:
+        - The model ID is determined by the CHAT_MODEL environment variable.
+        - Models are loaded with safetensors for secure loading.
+        - Automatically selects the best available device (CUDA/XPU/MPS/CPU).
+        - Sets pad_token to eos_token if pad_token is not configured.
+        - Model is cached globally after first load for performance.
+    """
     global _chatbot, _tokenizer, _is_seq2seq
     if _chatbot is None:
         model_id = getenv("CHAT_MODEL")
 @spaces_gpu
 def chat(message: str, conversation_history: list[dict] | None) -> tuple[str, list[dict]]:
+    """Generate a chatbot response given a user message and conversation history.
+    This function handles conversation with AI chatbots, supporting both modern
+    chat models with chat templates (like Qwen, Mistral) and older models
+    without templates (like BlenderBot). It manages conversation history and
+    formats inputs appropriately based on the model type.
+    Args:
+        message: The user's current message as a string.
+        conversation_history: Optional list of previous conversation messages.
+            Each message is a dict with "role" ("user" or "assistant") and "content".
+            If None, starts a new conversation.
+    Returns:
+        Tuple containing:
+            - str: The assistant's response message
+            - list[dict]: Updated conversation history including the new exchange
+    Note:
+        - Supports models with chat templates (uses apply_chat_template)
+        - Falls back to manual formatting for models without templates
+        - Handles both causal LM and seq2seq model architectures
+        - Uses sampling with temperature=0.7 for varied responses
+        - Generates up to 256 new tokens
+        - Automatically manages conversation context and history
+        - Extracts only newly generated text for causal LMs with chat templates
+    """
     model, tokenizer, is_seq2seq = get_chatbot()
     # Initialize conversation history if this is the first message
 def create_chatbot_tab():
+    """Create the chatbot tab in the Gradio interface.
+    This function sets up all UI components for the conversational chatbot,
+    including:
+    - Chatbot component for displaying conversation history
+    - Text input box for user messages
+    - Send button and Enter key submission support
+    - Internal state management for conversation history
+    It also wires up event handlers for both button clicks and Enter key presses,
+    and manages the conversion between Gradio's chat format and the internal
+    conversation history format.
+    """
     gr.Markdown("Have a conversation with an AI chatbot.")
     chatbot_history = gr.State(value=None)  # Store the conversation history.
     chatbot_output = gr.Chatbot(label="Conversation")
     chatbot_send_button = gr.Button("Send")
     def chat_interface(message: str, history: list | None, conversation_state: list[dict] | None):
+        """Handle chatbot interaction with Gradio chat format.
+        This function serves as the bridge between Gradio's chat interface format
+        and the internal chatbot API. It converts formats, handles empty messages,
+        and manages state updates.
+        Args:
+            message: The user's message string from the input box.
+            history: Gradio's chat history format (list of [user_msg, bot_msg] pairs).
+            conversation_state: Internal conversation history format (list of dicts).
+        Returns:
+            Tuple containing:
+                - Updated Gradio chat history
+                - Updated internal conversation state
+                - Empty string (to clear the input field)
+        """
         if not message.strip():
             return history, conversation_state, ""
         response, updated_conversation = chat(message, conversation_state)  # Get response from chatbot.

image_classification.py CHANGED Viewed

@@ -9,6 +9,28 @@ from utils import save_image_to_temp_file, request_image
 def image_classification(client: InferenceClient, image: Image) -> DataFrame:
     try:
         temp_file_path = save_image_to_temp_file(image) # Needed because InferenceClient does not accept PIL Images directly.
         classifications = client.image_classification(temp_file_path, model=getenv("IMAGE_CLASSIFICATION_MODEL"))
@@ -27,7 +49,17 @@ def image_classification(client: InferenceClient, image: Image) -> DataFrame:
 def create_image_classification_tab(client: InferenceClient):
-    """Create the image classification tab."""
     gr.Markdown("Classify a recyclable item as one of: cardboard, glass, metal, paper, plastic, or other using [Trash-Net](https://huggingface.co/prithivMLmods/Trash-Net).")
     image_classification_url_input = gr.Textbox(label="Image URL")
     image_classification_image_request_button = gr.Button("Get Image")

 def image_classification(client: InferenceClient, image: Image) -> DataFrame:
+    """Classify an image using Hugging Face Inference API.
+    This function classifies a recyclable item image into categories:
+    cardboard, glass, metal, paper, plastic, or other. The image is saved
+    to a temporary file since InferenceClient requires a file path rather than
+    a PIL Image object directly.
+    Args:
+        client: Hugging Face InferenceClient instance for API calls.
+        image: PIL Image object to classify.
+    Returns:
+        Pandas DataFrame with two columns:
+            - Label: The classification label (e.g., "cardboard", "glass")
+            - Probability: The confidence score as a percentage string (e.g., "95.23%")
+    Note:
+        - The model ID is determined by the IMAGE_CLASSIFICATION_MODEL environment variable.
+        - Uses Trash-Net model for recyclable item classification.
+        - Automatically cleans up temporary files after classification.
+        - Temporary file is created with format preservation if possible.
+    """
     try:
         temp_file_path = save_image_to_temp_file(image) # Needed because InferenceClient does not accept PIL Images directly.
         classifications = client.image_classification(temp_file_path, model=getenv("IMAGE_CLASSIFICATION_MODEL"))
 def create_image_classification_tab(client: InferenceClient):
+    """Create the image classification tab in the Gradio interface.
+    This function sets up all UI components for image classification, including:
+    - URL input textbox for fetching images from the web
+    - Button to retrieve image from URL
+    - Image preview component
+    - Classify button and output dataframe showing labels and probabilities
+    Args:
+        client: Hugging Face InferenceClient instance to pass to the image_classification function.
+    """
     gr.Markdown("Classify a recyclable item as one of: cardboard, glass, metal, paper, plastic, or other using [Trash-Net](https://huggingface.co/prithivMLmods/Trash-Net).")
     image_classification_url_input = gr.Textbox(label="Image URL")
     image_classification_image_request_button = gr.Button("Get Image")

image_to_text.py CHANGED Viewed

@@ -8,6 +8,25 @@ from utils import get_pytorch_device, spaces_gpu, request_image
 @spaces_gpu
 def image_to_text(image: Image) -> list[str]:
     image_to_text_model_id = getenv("IMAGE_TO_TEXT_MODEL")
     pytorch_device = get_pytorch_device()
     processor = AutoProcessor.from_pretrained(image_to_text_model_id)
@@ -24,7 +43,14 @@ def image_to_text(image: Image) -> list[str]:
 def create_image_to_text_tab():
-    """Create the image-to-text captioning tab."""
     gr.Markdown("Generate a text description of an image.")
     image_to_text_url_input = gr.Textbox(label="Image URL")
     image_to_text_image_request_button = gr.Button("Get Image")

 @spaces_gpu
 def image_to_text(image: Image) -> list[str]:
+    """Generate text captions for an image using BLIP model.
+    This function uses a BLIP (Bootstrapping Language-Image Pre-training) model
+    to generate multiple caption candidates for the input image. The model is
+    loaded, inference is performed, and then cleaned up to free GPU memory.
+    Args:
+        image: PIL Image object to generate captions for.
+    Returns:
+        List of string captions describing the image.
+    Note:
+        - The model ID is determined by the IMAGE_TO_TEXT_MODEL environment variable.
+        - Uses safetensors for secure model loading.
+        - Automatically selects the best available device (CUDA/XPU/MPS/CPU).
+        - Cleans up model and GPU memory after inference.
+        - Uses beam search with 3 beams, max length 20, min length 5.
+    """
     image_to_text_model_id = getenv("IMAGE_TO_TEXT_MODEL")
     pytorch_device = get_pytorch_device()
     processor = AutoProcessor.from_pretrained(image_to_text_model_id)
 def create_image_to_text_tab():
+    """Create the image-to-text captioning tab in the Gradio interface.
+    This function sets up all UI components for image captioning, including:
+    - URL input textbox for fetching images from the web
+    - Button to retrieve image from URL
+    - Image preview component
+    - Caption button and output list
+    """
     gr.Markdown("Generate a text description of an image.")
     image_to_text_url_input = gr.Textbox(label="Image URL")
     image_to_text_image_request_button = gr.Button("Get Image")

text_to_image.py CHANGED Viewed

@@ -6,11 +6,30 @@ from huggingface_hub import InferenceClient
 def text_to_image(client: InferenceClient, prompt: str) -> Image:
     return client.text_to_image(prompt, model=getenv("TEXT_TO_IMAGE_MODEL"))
 def create_text_to_image_tab(client: InferenceClient):
-    """Create the text-to-image generation tab."""
     gr.Markdown("Generate an image from a text prompt.")
     text_to_image_prompt = gr.Textbox(label="Prompt")
     text_to_image_generate_button = gr.Button("Generate")

 def text_to_image(client: InferenceClient, prompt: str) -> Image:
+    """Generate an image from a text prompt using Hugging Face Inference API.
+    Args:
+        client: Hugging Face InferenceClient instance for API calls.
+        prompt: Text description of the desired image.
+    Returns:
+        PIL Image object representing the generated image.
+    Note:
+        The model to use is determined by the TEXT_TO_IMAGE_MODEL environment variable.
+    """
     return client.text_to_image(prompt, model=getenv("TEXT_TO_IMAGE_MODEL"))
 def create_text_to_image_tab(client: InferenceClient):
+    """Create the text-to-image generation tab in the Gradio interface.
+    This function sets up all UI components for text-to-image generation,
+    including input textbox, generate button, and output image display.
+    Args:
+        client: Hugging Face InferenceClient instance to pass to the text_to_image function.
+    """
     gr.Markdown("Generate an image from a text prompt.")
     text_to_image_prompt = gr.Textbox(label="Prompt")
     text_to_image_generate_button = gr.Button("Generate")

text_to_speech.py CHANGED Viewed

@@ -7,6 +7,27 @@ from utils import spaces_gpu
 @spaces_gpu
 def text_to_speech(text: str) -> tuple[int, bytes]:
     narrator = pipeline(
         "text-to-speech",
         getenv("TEXT_TO_SPEECH_MODEL"),
@@ -19,7 +40,11 @@ def text_to_speech(text: str) -> tuple[int, bytes]:
 def create_text_to_speech_tab():
-    """Create the text-to-speech tab."""
     gr.Markdown("Generate speech from text.")
     text_to_speech_text = gr.Textbox(label="Text")
     text_to_speech_generate_button = gr.Button("Generate")

 @spaces_gpu
 def text_to_speech(text: str) -> tuple[int, bytes]:
+    """Convert text to speech audio using a TTS (Text-to-Speech) model.
+    This function uses a transformer pipeline to generate speech audio from
+    text input. The model is loaded, inference is performed, and then cleaned
+    up to free GPU memory.
+    Args:
+        text: Input text string to convert to speech.
+    Returns:
+        Tuple containing:
+            - int: Sampling rate of the generated audio (e.g., 22050 Hz)
+            - bytes: Raw audio data as bytes
+    Note:
+        - The model ID is determined by the TEXT_TO_SPEECH_MODEL environment variable.
+        - Uses safetensors for secure model loading.
+        - Automatically selects the best available device (CUDA/XPU/MPS/CPU).
+        - Cleans up model and GPU memory after inference.
+        - Returns audio in format compatible with Gradio Audio component.
+    """
     narrator = pipeline(
         "text-to-speech",
         getenv("TEXT_TO_SPEECH_MODEL"),
 def create_text_to_speech_tab():
+    """Create the text-to-speech tab in the Gradio interface.
+    This function sets up all UI components for text-to-speech generation,
+    including input textbox, generate button, and output audio player.
+    """
     gr.Markdown("Generate speech from text.")
     text_to_speech_text = gr.Textbox(label="Text")
     text_to_speech_generate_button = gr.Button("Generate")

utils.py CHANGED Viewed

@@ -17,15 +17,48 @@ try:
 except ImportError:
     # For local development, use a no-op decorator because spaces is not available.
     def spaces_gpu(func):
         return func
 def get_pytorch_device() -> str:
     return ("cuda" if torch.cuda.is_available() # Nvidia CUDA and AMD ROCm
        else "xpu" if torch.xpu.is_available() # Intel XPU
        else "mps" if torch.mps.is_available() # Apple Silicon
        else "cpu") # gl bro 🫠
 def request_image(url: str) -> Image:
     try:
         response = requests.get(url, timeout=int(getenv("REQUEST_TIMEOUT", "45")))
         response.raise_for_status()
@@ -38,6 +71,33 @@ def request_image(url: str) -> Image:
         raise gr.Error(f"Failed to fetch image from URL: {str(e)}")
 def request_audio(url: str) -> tuple[int, np.ndarray]:
     try:
         response = requests.get(url, timeout=int(getenv("REQUEST_TIMEOUT", "45")))
         response.raise_for_status()
@@ -53,6 +113,25 @@ def request_audio(url: str) -> tuple[int, np.ndarray]:
         raise gr.Error(f"Failed to load audio file: {str(e)}")
 def save_image_to_temp_file(image: Image) -> str:
     image_format = image.format if image.format else 'PNG'
     format_extension = image_format.lower() if image_format else 'png'
     temp_file = NamedTemporaryFile(delete=False, suffix=f".{format_extension}")
@@ -62,6 +141,24 @@ def save_image_to_temp_file(image: Image) -> str:
     return temp_path
 def get_model_sample_rate(model_id: str) -> int:
     try:
         processor = AutoProcessor.from_pretrained(model_id)
         return processor.feature_extractor.sampling_rate
@@ -69,6 +166,31 @@ def get_model_sample_rate(model_id: str) -> int:
         return 16000 # Fallback value as most ASR models use 16kHz
 def resample_audio(target_sample_rate: int, audio: tuple[int, bytes | np.ndarray]) -> np.ndarray:
     sample_rate, audio_data = audio
     # Convert audio data to a numpy array if it’s bytes
@@ -86,6 +208,28 @@ def resample_audio(target_sample_rate: int, audio: tuple[int, bytes | np.ndarray
     return audio_array
 def save_audio_to_temp_file(target_sample_rate: int, audio: tuple[int, bytes | np.ndarray]) -> str:
     audio_array = resample_audio(target_sample_rate, audio)
     temp_file = NamedTemporaryFile(delete=False, suffix='.wav')
     temp_path = temp_file.name

 except ImportError:
     # For local development, use a no-op decorator because spaces is not available.
     def spaces_gpu(func):
+        """No-op decorator for local development when spaces module is not available."""
         return func
 def get_pytorch_device() -> str:
+    """Determine the best available PyTorch device for computation.
+    Checks for available hardware accelerators in priority order:
+    1. CUDA (Nvidia GPUs and AMD ROCm)
+    2. XPU (Intel GPUs)
+    3. MPS (Apple Silicon/Metal Performance Shaders)
+    4. CPU (fallback)
+    Returns:
+        String device name: "cuda", "xpu", "mps", or "cpu"
+    """
     return ("cuda" if torch.cuda.is_available() # Nvidia CUDA and AMD ROCm
        else "xpu" if torch.xpu.is_available() # Intel XPU
        else "mps" if torch.mps.is_available() # Apple Silicon
        else "cpu") # gl bro 🫠
 def request_image(url: str) -> Image:
+    """Fetch an image from a URL and return it as a PIL Image.
+    Downloads an image from the provided URL and converts it to a PIL Image
+    object for processing. Handles various HTTP errors and timeouts gracefully.
+    Args:
+        url: HTTP/HTTPS URL pointing to an image file.
+    Returns:
+        PIL Image object loaded from the URL.
+    Raises:
+        gr.Error: If the image cannot be fetched due to:
+            - HTTP errors (4xx, 5xx status codes)
+            - Network timeouts
+            - Other request exceptions
+    Note:
+        - Timeout is configurable via REQUEST_TIMEOUT environment variable (default: 45 seconds)
+        - Supports common image formats (JPEG, PNG, GIF, WebP, etc.)
+    """
     try:
         response = requests.get(url, timeout=int(getenv("REQUEST_TIMEOUT", "45")))
         response.raise_for_status()
         raise gr.Error(f"Failed to fetch image from URL: {str(e)}")
 def request_audio(url: str) -> tuple[int, np.ndarray]:
+    """Fetch an audio file from a URL and return it as audio data.
+    Downloads an audio file from the provided URL and loads it using librosa,
+    which supports many audio formats. Returns the audio data in a format
+    compatible with Gradio's Audio component.
+    Args:
+        url: HTTP/HTTPS URL pointing to an audio file.
+    Returns:
+        Tuple containing:
+            - int: Sample rate of the audio in Hz (e.g., 44100, 22050)
+            - np.ndarray: Audio waveform data as a numpy array (float32, normalized)
+    Raises:
+        gr.Error: If the audio cannot be fetched or loaded due to:
+            - HTTP errors (4xx, 5xx status codes)
+            - Network timeouts
+            - Unsupported audio formats
+            - Other request or audio loading exceptions
+    Note:
+        - Timeout is configurable via REQUEST_TIMEOUT environment variable (default: 45 seconds)
+        - Supports many audio formats (MP3, WAV, FLAC, OGG, M4A, etc.)
+        - Audio is loaded at its native sample rate (sr=None)
+        - Returns normalized float32 audio data suitable for processing
+    """
     try:
         response = requests.get(url, timeout=int(getenv("REQUEST_TIMEOUT", "45")))
         response.raise_for_status()
         raise gr.Error(f"Failed to load audio file: {str(e)}")
 def save_image_to_temp_file(image: Image) -> str:
+    """Save a PIL Image to a temporary file on disk.
+    Creates a temporary file with an appropriate extension based on the image's
+    format and saves the image to it. This is needed because some APIs (like
+    Hugging Face InferenceClient) require file paths rather than PIL Image objects.
+    Args:
+        image: PIL Image object to save.
+    Returns:
+        String path to the temporary file where the image was saved.
+    Note:
+        - Preserves the original image format if available
+        - Falls back to PNG format if image.format is None
+        - Temporary file is not automatically deleted (caller is responsible for cleanup)
+        - File extension is determined from the image format
+        - Useful for APIs that require local file paths rather than in-memory objects
+    """
     image_format = image.format if image.format else 'PNG'
     format_extension = image_format.lower() if image_format else 'png'
     temp_file = NamedTemporaryFile(delete=False, suffix=f".{format_extension}")
     return temp_path
 def get_model_sample_rate(model_id: str) -> int:
+    """Get the expected sample rate for an audio processing model.
+    Retrieves the sample rate configuration from a Hugging Face model's
+    feature extractor. This is useful for ensuring audio is resampled to
+    match the model's expected input format.
+    Args:
+        model_id: Hugging Face model identifier (e.g., "openai/whisper-large-v3").
+    Returns:
+        Integer sample rate in Hz that the model expects (e.g., 16000).
+        Defaults to 16000 Hz if the sample rate cannot be determined.
+    Note:
+        - Most ASR models use 16kHz sample rate
+        - Uses AutoProcessor to access the model's feature extractor configuration
+        - Returns a sensible default (16kHz) if the model config cannot be loaded
+    """
     try:
         processor = AutoProcessor.from_pretrained(model_id)
         return processor.feature_extractor.sampling_rate
         return 16000 # Fallback value as most ASR models use 16kHz
 def resample_audio(target_sample_rate: int, audio: tuple[int, bytes | np.ndarray]) -> np.ndarray:
+    """Resample audio data to a target sample rate.
+    Converts audio data to the target sample rate using librosa's resampling.
+    Handles both bytes and numpy array input formats, converting bytes to
+    float32 numpy arrays as needed.
+    Args:
+        target_sample_rate: Desired output sample rate in Hz (e.g., 16000).
+        audio: Tuple containing:
+            - int: Current sample rate of the audio
+            - bytes | np.ndarray: Audio data (can be raw bytes or numpy array)
+    Returns:
+        Numpy array (float32) containing the resampled audio waveform.
+        If sample rates match, returns the audio data unchanged.
+    Raises:
+        ValueError: If audio_data is neither bytes nor np.ndarray.
+    Note:
+        - Converts bytes to float32 by assuming int16 PCM format
+        - Normalizes int16 values to [-1.0, 1.0] range
+        - Only resamples if source and target sample rates differ
+        - Uses librosa's high-quality resampling algorithm
+    """
     sample_rate, audio_data = audio
     # Convert audio data to a numpy array if it’s bytes
     return audio_array
 def save_audio_to_temp_file(target_sample_rate: int, audio: tuple[int, bytes | np.ndarray]) -> str:
+    """Resample audio to target sample rate and save to a temporary WAV file.
+    This function resamples audio data to match a target sample rate and saves
+    it as a WAV file. This is useful for preparing audio for APIs that require
+    specific sample rates and file formats.
+    Args:
+        target_sample_rate: Target sample rate in Hz for the output file (e.g., 16000).
+        audio: Tuple containing:
+            - int: Current sample rate of the input audio
+            - bytes | np.ndarray: Audio data to process
+    Returns:
+        String path to the temporary WAV file where the audio was saved.
+    Note:
+        - Automatically resamples audio if sample rates don't match
+        - Saves audio as WAV format (16-bit PCM)
+        - Temporary file is not automatically deleted (caller is responsible for cleanup)
+        - Audio is normalized and converted to float32 before saving
+        - Useful for preparing audio for Hugging Face InferenceClient APIs
+    """
     audio_array = resample_audio(target_sample_rate, audio)
     temp_file = NamedTemporaryFile(delete=False, suffix='.wav')
     temp_path = temp_file.name