notRaphael
/

video-intelligence-platform

Model card Files Files and versions

xet

Community

notRaphael commited on 16 days ago

Commit

1eb45d2

verified ·

1 Parent(s): 2781fa9

docs: update README with API verification details and troubleshooting

Browse files

Files changed (1) hide show

README.md +47 -0

README.md CHANGED Viewed

@@ -57,6 +57,8 @@ cd video-intelligence-platform
 pip install -r requirements.txt
 ```
 ### 3. Get a Gemini API key (free)
 - Go to https://aistudio.google.com/apikey
 - Create a free API key
@@ -86,6 +88,29 @@ python app.py --search "red car" --api-key YOUR_KEY
 | Text Embeddings | Gemini text-embedding-004 | API | Cloud |
 | Query/RAG | Gemini 2.0 Flash | API | Cloud |
 ## 🌳 How the Akinator Tree Works
 When a search returns too many results (>10), the system:
@@ -128,6 +153,28 @@ The platform is designed for future fine-tuning on TPU:
 | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
 | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
 ## 📁 Project Structure
 ```

 pip install -r requirements.txt
 ```
+> **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with ≥8GB RAM is recommended.
 ### 3. Get a Gemini API key (free)
 - Go to https://aistudio.google.com/apikey
 - Create a free API key
 | Text Embeddings | Gemini text-embedding-004 | API | Cloud |
 | Query/RAG | Gemini 2.0 Flash | API | Cloud |
+## 🔧 API Verification (Apr 2026)
+All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:
+### SigLIP2 (`google/siglip2-so400m-patch14-384`)
+- `AutoModel` / `AutoProcessor` → resolves to `SiglipModel` / `SiglipProcessor`
+- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
+- Text input **must** use `padding="max_length"` (training requirement)
+- Uses sigmoid (not softmax) for similarity scores
+### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
+- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` → resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
+- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` — auto-converts internally
+- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
+- Returns dict with both `"text_labels"` and `"labels"` keys
+- `target_sizes` expects `(height, width)` tuples
+### Gemini (`google-genai` SDK)
+- Uses `google.genai` (NOT deprecated `google.generativeai`)
+- `genai.Client(api_key=...)` → `client.models.generate_content(...)`, `client.models.embed_content(...)`
+- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
+- Embedding is **text-only** — cannot embed images/video directly
 ## 🌳 How the Akinator Tree Works
 When a search returns too many results (>10), the system:
 | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
 | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
+## ⚠️ Troubleshooting
+### "Could not import module 'AutoProcessor'"
+This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
+```bash
+pip install -U transformers
+# Also clear stale cache:
+rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
+rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
+```
+### Out of Memory during model loading
+SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
+- Set `device="cpu"` in config (default)
+- Close other memory-heavy applications
+- Consider using only one model at a time
+### Gemini rate limiting
+The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
+- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
+- Using a paid Gemini API tier
 ## 📁 Project Structure
 ```