ReArchitecture Multimodal RAG System Pipeline Journey

I ported it locally and isolated each concept into a step as Python runnable It is simplified, refactored and bug-fixed now. I migrated from Prediction Guard to HuggingFace.

Interactive Video Chat Demo and Multimodal RAG System Architecture

A multimodal AI system should be able to understand both text and video content.


Step 1 - Learn Gradio (UI) (30 mins)

Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.

Key Concepts:

  • fn: The function wrapped by the UI.
  • inputs: The Gradio components used for input (should match function arguments).
  • outputs: The Gradio components used for output (should match return values).

πŸ“– Gradio Documentation

Gradio includes 30+ built-in components.

πŸ’‘ Tip: For inputs and outputs, you can pass either:

  • The component name as a string (e.g., "textbox")
  • An instance of the component class (e.g., gr.Textbox())

Sharing Your Demo

demo.launch(share=True)  # Share your demo with just one extra parameter.

Gradio Advanced Features

Gradio.Blocks

Gradio provides gr.Blocks, a flexible way to design web apps with custom layouts and complex interactions:

  • Arrange components freely on the page.
  • Handle multiple data flows.
  • Use outputs as inputs for other components.
  • Dynamically update components based on user interaction.

Gradio.ChatInterface

  • Always set type="messages" in gr.ChatInterface.
  • The default (type="tuples") is deprecated and will be removed in future versions.
  • For more UI flexibility, use gr.ChatBot.
  • gr.ChatInterface supports Markdown (not tested yet).

Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)

Developed in collaboration with Intel, this model maps image-caption pairs into 512-dimensional vectors.

Measuring Similarity

  • Cosine Similarity β†’ Measures how close images are in vector space (efficient & commonly used).
  • Euclidean Distance β†’ Uses cv2.NORM_L2 to compute similarity between two images.

Converting to 2D for Visualization

  • UMAP reduces 512D embeddings to 2D for display purposes.

Preprocessing Videos for Multimodal RAG

Case 1: WEBVTT β†’ Extracting Text Segments from Video

- Converts video + text into structured metadata.  
- Splits content into multiple segments.  

Case 2: Whisper (Small) β†’ Video Only

- Extracts **audio** β†’ `model.transcribe()`.  
- Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.  
- Uses **Case 1** processing.  

Case 3: LvLM β†’ Video + Silent/Music Extraction

- Uses **Llava (LvLM model)** for **frame-based captioning**.  
- Encodes each frame as a **Base64 image**.  
- Extracts context and captions from video frames.  
- Uses **Case 1** processing.  

Step 4 - What is LLaVA?

LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their contextβ€”all.

Step 5 - what is a vector Store?

A vector store is a specialized database designed to:

  • Store and manage high-dimensional vector data efficiently

  • Perform similarity-based searches where K=1 returns the most similar result

  • In LanceDB specifically, store multiple data types: . Text content (captions) . Image file paths . Metadata . Vector embeddings

_ = MultimodalLanceDB.from_text_image_pairs(
    texts=updated_vid1_trans+vid2_trans,
    image_paths=vid1_img_path+vid2_img_path,
    embedding=BridgeTowerEmbeddings(),
    metadatas=vid1_metadata+vid2_metadata,
    connection=db,
    table_name=TBL_NAME,
    mode="overwrite", 
)

Gotchas and Solutions

Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower
Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations
Model Size: BridgeTower model requires ~3.5GB download
Image Downloads: Some Flickr images may be unavailable; implement robust error handling
Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions
Install from git+https://github.com/openai/whisper.git 

Install ffmepg using brew

```bash
    brew install ffmpeg
    brew link ffmpeg
```

    

Learning and Skills

Technical Skills:

Basic Machine learning and deep learning
Vector embeddings and similarity search
Multimodal data processing

Framework & Library Expertise:

Hugging Face Transformers
Gradio UI development
LangChain integration (Basic)
PyTorch basics
LanceDB vector storage

AI/ML Concepts:

Multimodal RAG system architecture
Vector embeddings and similarity search
Large Language Models (LLaVA)
Image-text pair processing
Dimensionality reduction techniques

Multimedia Processing:

Video frame extraction
Audio transcription (Whisper)
Image processing (PIL)
Base64 encoding/decoding
WebVTT handling

System Design:

Client-server architecture
API endpoint design
Data pipeline construction
Vector store implementation
Multimodal system integration

Hugging Face

Remote: hf_origin branch:hf_main title: Hg Demo emoji: 😻 colorFrom: gray colorTo: red sdk: gradio sdk_version: 5.18.0 app_file: app.py pinned: false license: mit short_description: 'A space to keep AI work for demo '

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.