hg_demo / README.md
88hours's picture
Upload folder using huggingface_hub
17934c8 verified
metadata
title: multimodel-rag-chat-with-videos
app_file: main_demo.py
sdk: gradio
sdk_version: 5.17.1

ReArchitecture Multimodal RAG System Pipeline Journey

I ported it locally and isolated each concept into a step as Python runnable It is simplified, refactored and bug-fixed now. I migrated from Prediction Guard to HuggingFace.

Interactive Video Chat Demo and Multimodal RAG System Architecture

A multimodal AI system should be able to understand both text and video content.


Step 1 - Learn Gradio (UI) (30 mins)

Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.

Key Concepts:

  • fn: The function wrapped by the UI.
  • inputs: The Gradio components used for input (should match function arguments).
  • outputs: The Gradio components used for output (should match return values).

πŸ“– Gradio Documentation

Gradio includes 30+ built-in components.

πŸ’‘ Tip: For inputs and outputs, you can pass either:

  • The component name as a string (e.g., "textbox")
  • An instance of the component class (e.g., gr.Textbox())

Sharing Your Demo

demo.launch(share=True)  # Share your demo with just one extra parameter.

Gradio Advanced Features

Gradio.Blocks

Gradio provides gr.Blocks, a flexible way to design web apps with custom layouts and complex interactions:

  • Arrange components freely on the page.
  • Handle multiple data flows.
  • Use outputs as inputs for other components.
  • Dynamically update components based on user interaction.

Gradio.ChatInterface

  • Always set type="messages" in gr.ChatInterface.
  • The default (type="tuples") is deprecated and will be removed in future versions.
  • For more UI flexibility, use gr.ChatBot.
  • gr.ChatInterface supports Markdown (not tested yet).

Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)

Developed in collaboration with Intel, this model maps image-caption pairs into 512-dimensional vectors.

Measuring Similarity

  • Cosine Similarity β†’ Measures how close images are in vector space (efficient & commonly used).
  • Euclidean Distance β†’ Uses cv2.NORM_L2 to compute similarity between two images.

Converting to 2D for Visualization

  • UMAP reduces 512D embeddings to 2D for display purposes.

Preprocessing Videos for Multimodal RAG

Case 1: WEBVTT β†’ Extracting Text Segments from Video

- Converts video + text into structured metadata.  
- Splits content into multiple segments.  

Case 2: Whisper (Small) β†’ Video Only

- Extracts **audio** β†’ `model.transcribe()`.  
- Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.  
- Uses **Case 1** processing.  

Case 3: LvLM β†’ Video + Silent/Music Extraction

- Uses **Llava (LvLM model)** for **frame-based captioning**.  
- Encodes each frame as a **Base64 image**.  
- Extracts context and captions from video frames.  
- Uses **Case 1** processing.  

Step 4 - What is LLaVA?

LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their contextβ€”all.

Step 5 - what is a vector Store?

A vector store is a specialized database designed to:

  • Store and manage high-dimensional vector data efficiently

  • Perform similarity-based searches where K=1 returns the most similar result

  • In LanceDB specifically, store multiple data types: . Text content (captions) . Image file paths . Metadata . Vector embeddings

_ = MultimodalLanceDB.from_text_image_pairs(
    texts=updated_vid1_trans+vid2_trans,
    image_paths=vid1_img_path+vid2_img_path,
    embedding=BridgeTowerEmbeddings(),
    metadatas=vid1_metadata+vid2_metadata,
    connection=db,
    table_name=TBL_NAME,
    mode="overwrite", 
)

Gotchas and Solutions

Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower
Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations
Model Size: BridgeTower model requires ~3.5GB download
Image Downloads: Some Flickr images may be unavailable; implement robust error handling
Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions
Install from git+https://github.com/openai/whisper.git 

Install ffmepg using brew

```bash
    brew install ffmpeg
    brew link ffmpeg
```

    

Learning and Skills

Technical Skills:

Basic Machine learning and deep learning
Vector embeddings and similarity search
Multimodal data processing

Framework & Library Expertise:

Hugging Face Transformers
Gradio UI development
LangChain integration (Basic)
PyTorch basics
LanceDB vector storage

AI/ML Concepts:

Multimodal RAG system architecture
Vector embeddings and similarity search
Large Language Models (LLaVA)
Image-text pair processing
Dimensionality reduction techniques

Multimedia Processing:

Video frame extraction
Audio transcription (Whisper)
Image processing (PIL)
Base64 encoding/decoding
WebVTT handling

System Design:

Client-server architecture
API endpoint design
Data pipeline construction
Vector store implementation
Multimodal system integration

Hugging Face

Remote: hf_origin branch:hf_main title: Hg Demo emoji: 😻 colorFrom: gray colorTo: red sdk: gradio sdk_version: 5.18.0 app_file: app.py pinned: false license: mit short_description: 'A space to keep AI work for demo '