Spaces:
Running
Running
Hugging Face VLM Integration
This document explains how to use Hugging Face Vision-Language Models (VLMs) in your PromptAid Vision application.
π Quick Start
1. Environment Setup
Add your Hugging Face API key to your .env file:
# .env
HF_API_KEY=your_huggingface_api_key_here
2. Available Models
The following Hugging Face models are automatically registered when you start the application:
- LLaVA 1.5 7B (
LLAVA_1_5_7B) - Advanced vision-language model - BLIP-2 (
BLIP2_OPT_2_7B) - Image captioning specialist - InstructBLIP (
INSTRUCTBLIP_VICUNA_7B) - Instruction-following vision model
3. How It Works
The Hugging Face services follow the same pattern as GPT-4 and Gemini:
- Image Processing: Converts uploaded images to base64 format
- Prompt Enhancement: Automatically adds metadata extraction instructions
- API Call: Sends request to Hugging Face Inference API
- Response Parsing: Extracts caption and metadata from JSON response
- Fallback Handling: Gracefully handles parsing errors
π§ Technical Details
Service Architecture
class HuggingFaceService(VLMService):
"""Base class for all Hugging Face VLM services"""
async def generate_caption(self, image_bytes: bytes, prompt: str) -> Dict[str, Any]:
# 1. Build enhanced prompt with metadata extraction
# 2. Prepare API payload
# 3. Make async HTTP request
# 4. Parse response and extract metadata
# 5. Return structured result
Model-Specific Services
class LLaVAService(HuggingFaceService):
"""LLaVA 1.5 7B - Advanced vision-language understanding"""
class BLIP2Service(HuggingFaceService):
"""BLIP-2 - Specialized image captioning"""
class InstructBLIPService(HuggingFaceService):
"""InstructBLIP - Instruction-following vision model"""
API Endpoints
The services use Hugging Face's Inference API:
- Vision Models:
https://api-inference.huggingface.co/models/{model_id} - Text Models:
https://api-inference.huggingface.co/models/{model_id}
Response Format
All services return the same structured format:
{
"caption": "Detailed analysis of the crisis map...",
"metadata": {
"title": "Flood Emergency in Coastal Region",
"source": "PDC",
"type": "FLOOD",
"countries": ["US", "MX"],
"epsg": "4326"
},
"confidence": null,
"processing_time": 2.45,
"raw_response": {
"model": "llava-hf/llava-1.5-7b-hf",
"response": {...},
"parsed_successfully": true
}
}
π§ͺ Testing
Run Integration Test
cd py_backend
python test_hf_integration.py
Test Individual Services
from app.services.huggingface_service import LLaVAService
from app.services.vlm_service import vlm_manager
# Register service
llava_service = LLaVAService("your_api_key")
vlm_manager.register_service(llava_service)
# Test caption generation
result = await vlm_manager.generate_caption(
image_bytes=your_image_bytes,
prompt="Describe this crisis map",
model_name="LLAVA_1_5_7B"
)
π Performance Considerations
Timeouts
- Default: 120 seconds for large models
- Configurable: Adjust in
huggingface_service.py
Model Selection
- LLaVA: Best for complex analysis, slower
- BLIP-2: Fastest, good for basic captioning
- InstructBLIP: Balanced performance and capability
Error Handling
- API Failures: Automatic fallback to error messages
- Parsing Errors: Graceful degradation to raw text
- Network Issues: Configurable retry logic
π Debugging
Enable Debug Logging
The services include comprehensive debug logging:
# Check console output for:
# - API request details
# - Response parsing
# - Metadata extraction
# - Error handling
Common Issues
- API Key Invalid: Check
HF_API_KEYin.env - Model Loading: Some models may take time to load on first request
- Rate Limiting: Hugging Face has rate limits for free tier
- Model Compatibility: Ensure model supports vision tasks
π Usage in Frontend
The Hugging Face models are automatically available in your frontend:
// In your upload form, users can select:
const modelOptions = [
{ value: 'GPT4O', label: 'GPT-4 Vision' },
{ value: 'GEMINI15', label: 'Google Gemini' },
{ value: 'LLAVA_1_5_7B', label: 'LLaVA 1.5 7B' }, // β New!
{ value: 'BLIP2_OPT_2_7B', label: 'BLIP-2' }, // β New!
{ value: 'INSTRUCTBLIP_VICUNA_7B', label: 'InstructBLIP' } // β New!
];
π Configuration
Environment Variables
# Required
HF_API_KEY=your_api_key
# Optional (for advanced configuration)
HF_TIMEOUT=120
HF_MAX_TOKENS=800
HF_TEMPERATURE=0.7
Service Registration
Services are automatically registered in caption.py:
if settings.HF_API_KEY:
llava_service = LLaVAService(settings.HF_API_KEY)
vlm_manager.register_service(llava_service)
# ... other services
π― Next Steps
- Test Integration: Run the test script
- Verify Models: Check frontend model selection
- Monitor Performance: Watch API response times
- Customize Prompts: Adjust metadata extraction instructions
- Add New Models: Extend with additional Hugging Face models