Spaces:

jguimond
/

assignment_8_v3

Sleeping

App Files Files Community

assignment_8_v3 / README.md

jguimond

Update README with more information about the project

0fa10ed verified 8 days ago

preview code

raw

history blame contribute delete

2.96 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Unit 8 Final Project - End-to-End AI Solution Implementation
emoji: 🚀
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
short_description: Multimodal image captioning & vibe evaluation.

Assignment 8 – Multimodal Image Captioning & Vibe Evaluation

This Space implements a multimodal AI web app for my AI Solutions class.
The app compares two image captioning models on the same image, analyzes the emotional “vibe” of each caption, and evaluates model performance using NLP metrics.

The goal is to explore how Vision-Language Models (VLMs) and text-based models (LLM-style components) can work together in a single pipeline, and to provide a clear interface for testing and analysis.

🧠 What This App Does

Given an image and a user-provided ground truth caption, the app:

Generates captions with two image captioning models:
- Model 1: BLIP image captioning
- Model 2: ViT-GPT2 image captioning
Detects the emotional “vibe” of each caption using a zero-shot text classifier with labels such as:
- Peaceful / Calm
- Happy / Joy
- Sad / Sorrow
- Angry / Upset
- Fear / Scared
- Action / Violence
Evaluates the captions against the ground truth using NLP techniques:
- Semantic similarity via sentence-transformers (cosine similarity)
- ROUGE-L via the evaluate library (word-overlap accuracy)
Displays all results in a Gradio interface:
- Captions for each model
- Vibe labels + confidence scores
- A summary block with similarity and ROUGE-L scores

This makes it easy to see not just what the models say, but also how close they are to a human caption and how the wording affects the emotional tone.

🔍 Models & Libraries Used

Vision-Language Models (VLMs) for captioning
- BLIP image captioning model
- ViT-GPT2 image captioning model
Text / NLP Components
- Zero-shot text classifier for vibe detection
- sentence-transformers/all-MiniLM-L6-v2 for semantic similarity
- evaluate library for ROUGE-L
Framework / UI
- Gradio for the web interface
- Deployed as a Hugging Face Space (this repo)

🖼️ How to Use the App

Upload an image
- Use one of the provided example images or upload your own.
Enter a ground truth caption
- Type a short sentence that, in your own words, best describes the image.
Click “Submit”
- The app will:
  - Run both captioning models
  - Classify the vibe of each caption
  - Compute similarity and ROUGE-L vs. your ground truth
Review the outputs
- Compare how each model describes the scene
- Check if the vibe matches what you expect
- Look at the metrics to see which caption is closer to your description