Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
title: Unit 8 Final Project - End-to-End AI Solution Implementation
emoji: ๐
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
short_description: Multimodal image captioning & vibe evaluation.
Assignment 8 โ Multimodal Image Captioning & Vibe Evaluation
This Space implements a multimodal AI web app for my AI Solutions class.
The app compares two image captioning models on the same image, analyzes the emotional โvibeโ of each caption, and evaluates model performance using NLP metrics.
The goal is to explore how Vision-Language Models (VLMs) and text-based models (LLM-style components) can work together in a single pipeline, and to provide a clear interface for testing and analysis.
๐ง What This App Does
Given an image and a user-provided ground truth caption, the app:
Generates captions with two image captioning models:
- Model 1: BLIP image captioning
- Model 2: ViT-GPT2 image captioning
Detects the emotional โvibeโ of each caption using a zero-shot text classifier with labels such as:
- Peaceful / Calm
- Happy / Joy
- Sad / Sorrow
- Angry / Upset
- Fear / Scared
- Action / Violence
Evaluates the captions against the ground truth using NLP techniques:
- Semantic similarity via
sentence-transformers(cosine similarity) - ROUGE-L via the
evaluatelibrary (word-overlap accuracy)
- Semantic similarity via
Displays all results in a Gradio interface:
- Captions for each model
- Vibe labels + confidence scores
- A summary block with similarity and ROUGE-L scores
This makes it easy to see not just what the models say, but also how close they are to a human caption and how the wording affects the emotional tone.
๐ Models & Libraries Used
Vision-Language Models (VLMs) for captioning
- BLIP image captioning model
- ViT-GPT2 image captioning model
Text / NLP Components
- Zero-shot text classifier for vibe detection
sentence-transformers/all-MiniLM-L6-v2for semantic similarityevaluatelibrary for ROUGE-L
Framework / UI
- Gradio for the web interface
- Deployed as a Hugging Face Space (this repo)
๐ผ๏ธ How to Use the App
Upload an image
- Use one of the provided example images or upload your own.
Enter a ground truth caption
- Type a short sentence that, in your own words, best describes the image.
Click โSubmitโ
- The app will:
- Run both captioning models
- Classify the vibe of each caption
- Compute similarity and ROUGE-L vs. your ground truth
- The app will:
Review the outputs
- Compare how each model describes the scene
- Check if the vibe matches what you expect
- Look at the metrics to see which caption is closer to your description