docs: update README with API verification details and troubleshooting
Browse files
README.md
CHANGED
|
@@ -57,6 +57,8 @@ cd video-intelligence-platform
|
|
| 57 |
pip install -r requirements.txt
|
| 58 |
```
|
| 59 |
|
|
|
|
|
|
|
| 60 |
### 3. Get a Gemini API key (free)
|
| 61 |
- Go to https://aistudio.google.com/apikey
|
| 62 |
- Create a free API key
|
|
@@ -86,6 +88,29 @@ python app.py --search "red car" --api-key YOUR_KEY
|
|
| 86 |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
|
| 87 |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
## π³ How the Akinator Tree Works
|
| 90 |
|
| 91 |
When a search returns too many results (>10), the system:
|
|
@@ -128,6 +153,28 @@ The platform is designed for future fine-tuning on TPU:
|
|
| 128 |
| [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
|
| 129 |
| [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
## π Project Structure
|
| 132 |
|
| 133 |
```
|
|
|
|
| 57 |
pip install -r requirements.txt
|
| 58 |
```
|
| 59 |
|
| 60 |
+
> **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with β₯8GB RAM is recommended.
|
| 61 |
+
|
| 62 |
### 3. Get a Gemini API key (free)
|
| 63 |
- Go to https://aistudio.google.com/apikey
|
| 64 |
- Create a free API key
|
|
|
|
| 88 |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
|
| 89 |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |
|
| 90 |
|
| 91 |
+
## π§ API Verification (Apr 2026)
|
| 92 |
+
|
| 93 |
+
All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:
|
| 94 |
+
|
| 95 |
+
### SigLIP2 (`google/siglip2-so400m-patch14-384`)
|
| 96 |
+
- `AutoModel` / `AutoProcessor` β resolves to `SiglipModel` / `SiglipProcessor`
|
| 97 |
+
- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
|
| 98 |
+
- Text input **must** use `padding="max_length"` (training requirement)
|
| 99 |
+
- Uses sigmoid (not softmax) for similarity scores
|
| 100 |
+
|
| 101 |
+
### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
|
| 102 |
+
- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` β resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
|
| 103 |
+
- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` β auto-converts internally
|
| 104 |
+
- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
|
| 105 |
+
- Returns dict with both `"text_labels"` and `"labels"` keys
|
| 106 |
+
- `target_sizes` expects `(height, width)` tuples
|
| 107 |
+
|
| 108 |
+
### Gemini (`google-genai` SDK)
|
| 109 |
+
- Uses `google.genai` (NOT deprecated `google.generativeai`)
|
| 110 |
+
- `genai.Client(api_key=...)` β `client.models.generate_content(...)`, `client.models.embed_content(...)`
|
| 111 |
+
- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
|
| 112 |
+
- Embedding is **text-only** β cannot embed images/video directly
|
| 113 |
+
|
| 114 |
## π³ How the Akinator Tree Works
|
| 115 |
|
| 116 |
When a search returns too many results (>10), the system:
|
|
|
|
| 153 |
| [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
|
| 154 |
| [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
|
| 155 |
|
| 156 |
+
## β οΈ Troubleshooting
|
| 157 |
+
|
| 158 |
+
### "Could not import module 'AutoProcessor'"
|
| 159 |
+
This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
|
| 160 |
+
```bash
|
| 161 |
+
pip install -U transformers
|
| 162 |
+
# Also clear stale cache:
|
| 163 |
+
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
|
| 164 |
+
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### Out of Memory during model loading
|
| 168 |
+
SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
|
| 169 |
+
- Set `device="cpu"` in config (default)
|
| 170 |
+
- Close other memory-heavy applications
|
| 171 |
+
- Consider using only one model at a time
|
| 172 |
+
|
| 173 |
+
### Gemini rate limiting
|
| 174 |
+
The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
|
| 175 |
+
- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
|
| 176 |
+
- Using a paid Gemini API tier
|
| 177 |
+
|
| 178 |
## π Project Structure
|
| 179 |
|
| 180 |
```
|