notRaphael commited on
Commit
1eb45d2
Β·
verified Β·
1 Parent(s): 2781fa9

docs: update README with API verification details and troubleshooting

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md CHANGED
@@ -57,6 +57,8 @@ cd video-intelligence-platform
57
  pip install -r requirements.txt
58
  ```
59
 
 
 
60
  ### 3. Get a Gemini API key (free)
61
  - Go to https://aistudio.google.com/apikey
62
  - Create a free API key
@@ -86,6 +88,29 @@ python app.py --search "red car" --api-key YOUR_KEY
86
  | Text Embeddings | Gemini text-embedding-004 | API | Cloud |
87
  | Query/RAG | Gemini 2.0 Flash | API | Cloud |
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ## 🌳 How the Akinator Tree Works
90
 
91
  When a search returns too many results (>10), the system:
@@ -128,6 +153,28 @@ The platform is designed for future fine-tuning on TPU:
128
  | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
129
  | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ## πŸ“ Project Structure
132
 
133
  ```
 
57
  pip install -r requirements.txt
58
  ```
59
 
60
+ > **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with β‰₯8GB RAM is recommended.
61
+
62
  ### 3. Get a Gemini API key (free)
63
  - Go to https://aistudio.google.com/apikey
64
  - Create a free API key
 
88
  | Text Embeddings | Gemini text-embedding-004 | API | Cloud |
89
  | Query/RAG | Gemini 2.0 Flash | API | Cloud |
90
 
91
+ ## πŸ”§ API Verification (Apr 2026)
92
+
93
+ All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:
94
+
95
+ ### SigLIP2 (`google/siglip2-so400m-patch14-384`)
96
+ - `AutoModel` / `AutoProcessor` β†’ resolves to `SiglipModel` / `SiglipProcessor`
97
+ - `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
98
+ - Text input **must** use `padding="max_length"` (training requirement)
99
+ - Uses sigmoid (not softmax) for similarity scores
100
+
101
+ ### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
102
+ - `AutoModelForZeroShotObjectDetection` / `AutoProcessor` β†’ resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
103
+ - Processor accepts text as `str`, `list[str]`, or `list[list[str]]` β€” auto-converts internally
104
+ - `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
105
+ - Returns dict with both `"text_labels"` and `"labels"` keys
106
+ - `target_sizes` expects `(height, width)` tuples
107
+
108
+ ### Gemini (`google-genai` SDK)
109
+ - Uses `google.genai` (NOT deprecated `google.generativeai`)
110
+ - `genai.Client(api_key=...)` β†’ `client.models.generate_content(...)`, `client.models.embed_content(...)`
111
+ - `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
112
+ - Embedding is **text-only** β€” cannot embed images/video directly
113
+
114
  ## 🌳 How the Akinator Tree Works
115
 
116
  When a search returns too many results (>10), the system:
 
153
  | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
154
  | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
155
 
156
+ ## ⚠️ Troubleshooting
157
+
158
+ ### "Could not import module 'AutoProcessor'"
159
+ This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
160
+ ```bash
161
+ pip install -U transformers
162
+ # Also clear stale cache:
163
+ rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
164
+ rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
165
+ ```
166
+
167
+ ### Out of Memory during model loading
168
+ SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
169
+ - Set `device="cpu"` in config (default)
170
+ - Close other memory-heavy applications
171
+ - Consider using only one model at a time
172
+
173
+ ### Gemini rate limiting
174
+ The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
175
+ - Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
176
+ - Using a paid Gemini API tier
177
+
178
  ## πŸ“ Project Structure
179
 
180
  ```