assignment_8_v3 / README.md
jguimond's picture
Update README with more information about the project
0fa10ed verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Unit 8 Final Project - End-to-End AI Solution Implementation
emoji: ๐Ÿš€
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 6.0.1
app_file: app.py
pinned: false
short_description: Multimodal image captioning & vibe evaluation.

Assignment 8 โ€“ Multimodal Image Captioning & Vibe Evaluation

This Space implements a multimodal AI web app for my AI Solutions class.
The app compares two image captioning models on the same image, analyzes the emotional โ€œvibeโ€ of each caption, and evaluates model performance using NLP metrics.

The goal is to explore how Vision-Language Models (VLMs) and text-based models (LLM-style components) can work together in a single pipeline, and to provide a clear interface for testing and analysis.


๐Ÿง  What This App Does

Given an image and a user-provided ground truth caption, the app:

  1. Generates captions with two image captioning models:

    • Model 1: BLIP image captioning
    • Model 2: ViT-GPT2 image captioning
  2. Detects the emotional โ€œvibeโ€ of each caption using a zero-shot text classifier with labels such as:

    • Peaceful / Calm
    • Happy / Joy
    • Sad / Sorrow
    • Angry / Upset
    • Fear / Scared
    • Action / Violence
  3. Evaluates the captions against the ground truth using NLP techniques:

    • Semantic similarity via sentence-transformers (cosine similarity)
    • ROUGE-L via the evaluate library (word-overlap accuracy)
  4. Displays all results in a Gradio interface:

    • Captions for each model
    • Vibe labels + confidence scores
    • A summary block with similarity and ROUGE-L scores

This makes it easy to see not just what the models say, but also how close they are to a human caption and how the wording affects the emotional tone.


๐Ÿ” Models & Libraries Used

  • Vision-Language Models (VLMs) for captioning

    • BLIP image captioning model
    • ViT-GPT2 image captioning model
  • Text / NLP Components

    • Zero-shot text classifier for vibe detection
    • sentence-transformers/all-MiniLM-L6-v2 for semantic similarity
    • evaluate library for ROUGE-L
  • Framework / UI

    • Gradio for the web interface
    • Deployed as a Hugging Face Space (this repo)

๐Ÿ–ผ๏ธ How to Use the App

  1. Upload an image

    • Use one of the provided example images or upload your own.
  2. Enter a ground truth caption

    • Type a short sentence that, in your own words, best describes the image.
  3. Click โ€œSubmitโ€

    • The app will:
      • Run both captioning models
      • Classify the vibe of each caption
      • Compute similarity and ROUGE-L vs. your ground truth
  4. Review the outputs

    • Compare how each model describes the scene
    • Check if the vibe matches what you expect
    • Look at the metrics to see which caption is closer to your description