Spaces:

promptAId
/

Promptaid-VIsion

Running

App Files Files Community

Promptaid-VIsion / py_backend /tests /HUGGINGFACE_INTEGRATION.md

SCGR

UI refine & clean up

1686de5 3 months ago

preview code

raw

history blame

5.7 kB

Hugging Face VLM Integration

This document explains how to use Hugging Face Vision-Language Models (VLMs) in your PromptAid Vision application.

🚀 Quick Start

1. Environment Setup

Add your Hugging Face API key to your .env file:

# .env
HF_API_KEY=your_huggingface_api_key_here

2. Available Models

The following Hugging Face models are automatically registered when you start the application:

LLaVA 1.5 7B (LLAVA_1_5_7B) - Advanced vision-language model
BLIP-2 (BLIP2_OPT_2_7B) - Image captioning specialist
InstructBLIP (INSTRUCTBLIP_VICUNA_7B) - Instruction-following vision model

3. How It Works

The Hugging Face services follow the same pattern as GPT-4 and Gemini:

Image Processing: Converts uploaded images to base64 format
Prompt Enhancement: Automatically adds metadata extraction instructions
API Call: Sends request to Hugging Face Inference API
Response Parsing: Extracts caption and metadata from JSON response
Fallback Handling: Gracefully handles parsing errors

🔧 Technical Details

Service Architecture

class HuggingFaceService(VLMService):
    """Base class for all Hugging Face VLM services"""
    
    async def generate_caption(self, image_bytes: bytes, prompt: str) -> Dict[str, Any]:
        # 1. Build enhanced prompt with metadata extraction
        # 2. Prepare API payload
        # 3. Make async HTTP request
        # 4. Parse response and extract metadata
        # 5. Return structured result

Model-Specific Services

class LLaVAService(HuggingFaceService):
    """LLaVA 1.5 7B - Advanced vision-language understanding"""
    
class BLIP2Service(HuggingFaceService):
    """BLIP-2 - Specialized image captioning"""
    
class InstructBLIPService(HuggingFaceService):
    """InstructBLIP - Instruction-following vision model"""

API Endpoints

The services use Hugging Face's Inference API:

Vision Models: https://api-inference.huggingface.co/models/{model_id}
Text Models: https://api-inference.huggingface.co/models/{model_id}

Response Format

All services return the same structured format:

{
  "caption": "Detailed analysis of the crisis map...",
  "metadata": {
    "title": "Flood Emergency in Coastal Region",
    "source": "PDC",
    "type": "FLOOD",
    "countries": ["US", "MX"],
    "epsg": "4326"
  },
  "confidence": null,
  "processing_time": 2.45,
  "raw_response": {
    "model": "llava-hf/llava-1.5-7b-hf",
    "response": {...},
    "parsed_successfully": true
  }
}

🧪 Testing

Run Integration Test

cd py_backend
python test_hf_integration.py

Test Individual Services

from app.services.huggingface_service import LLaVAService
from app.services.vlm_service import vlm_manager

# Register service
llava_service = LLaVAService("your_api_key")
vlm_manager.register_service(llava_service)

# Test caption generation
result = await vlm_manager.generate_caption(
    image_bytes=your_image_bytes,
    prompt="Describe this crisis map",
    model_name="LLAVA_1_5_7B"
)

📊 Performance Considerations

Timeouts

Default: 120 seconds for large models
Configurable: Adjust in huggingface_service.py

Model Selection

LLaVA: Best for complex analysis, slower
BLIP-2: Fastest, good for basic captioning
InstructBLIP: Balanced performance and capability

Error Handling

API Failures: Automatic fallback to error messages
Parsing Errors: Graceful degradation to raw text
Network Issues: Configurable retry logic

🔍 Debugging

Enable Debug Logging

The services include comprehensive debug logging:

# Check console output for:
# - API request details
# - Response parsing
# - Metadata extraction
# - Error handling

Common Issues

API Key Invalid: Check HF_API_KEY in .env
Model Loading: Some models may take time to load on first request
Rate Limiting: Hugging Face has rate limits for free tier
Model Compatibility: Ensure model supports vision tasks

🚀 Usage in Frontend

The Hugging Face models are automatically available in your frontend:

// In your upload form, users can select:
const modelOptions = [
  { value: 'GPT4O', label: 'GPT-4 Vision' },
  { value: 'GEMINI15', label: 'Google Gemini' },
  { value: 'LLAVA_1_5_7B', label: 'LLaVA 1.5 7B' },      // ← New!
  { value: 'BLIP2_OPT_2_7B', label: 'BLIP-2' },           // ← New!
  { value: 'INSTRUCTBLIP_VICUNA_7B', label: 'InstructBLIP' } // ← New!
];

📝 Configuration

Environment Variables

# Required
HF_API_KEY=your_api_key

# Optional (for advanced configuration)
HF_TIMEOUT=120
HF_MAX_TOKENS=800
HF_TEMPERATURE=0.7

Service Registration

Services are automatically registered in caption.py:

if settings.HF_API_KEY:
    llava_service = LLaVAService(settings.HF_API_KEY)
    vlm_manager.register_service(llava_service)
    # ... other services

🎯 Next Steps

Test Integration: Run the test script
Verify Models: Check frontend model selection
Monitor Performance: Watch API response times
Customize Prompts: Adjust metadata extraction instructions
Add New Models: Extend with additional Hugging Face models