NVisionAI
/

kimodo-api

+# kimodo-api
+REST API microservice wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.
+---
+## Installation
+```bash
+docker pull ghcr.io/eyalenav/kimodo-api:latest
+```
+### Run
+```bash
+docker run --rm \
+  --gpus '"device=0"' \
+  -p 9551:9551 \
+  -v ~/.cache/huggingface:/root/.cache/huggingface \
+  -e HUGGINGFACE_TOKEN=hf_... \
+  ghcr.io/eyalenav/kimodo-api:latest
+```
+> **First run:** downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in `/root/.cache/huggingface`).
+---
+## API Reference
+### `GET /health`
+Check server status.
+**Request**
+```
+GET http://localhost:9551/health
+```
+**Response**
+```json
+{
+  "status": "ok"
+}
+```
+---
+### `POST /generate`
+Generate a motion clip from a text prompt.
+**Request**
+```
+POST http://localhost:9551/generate
+Content-Type: application/json
+```
+```json
+{
+  "prompt": "person pushing through a crowd aggressively",
+  "num_frames": 120,
+  "fps": 30
+}
+```
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `prompt` | string | required | Natural language motion description |
+| `num_frames` | int | `120` | Number of frames to generate |
+| `fps` | int | `30` | Frames per second (metadata only) |
+**Response**
+Binary NPZ file (`application/octet-stream`).
+The NPZ contains:
+| Key | Shape | Description |
+|---|---|---|
+| `poses` | `(T, 77, 3)` | Joint rotations (axis-angle) per frame |
+| `trans` | `(T, 3)` | Root translation per frame |
+| `betas` | `(16,)` | SMPL body shape parameters |
+**Example**
+```bash
+curl -X POST http://localhost:9551/generate \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "person falling to the ground after being pushed"}' \
+  --output output_motion.npz
+```
+---
+### `POST /generate_bvh`
+Generate motion and return as BVH (Biovision Hierarchy) format.
+**Request**
+```
+POST http://localhost:9551/generate_bvh
+Content-Type: application/json
+```
+```json
+{
+  "prompt": "two people fighting, punches thrown",
+  "num_frames": 150
+}
+```
+**Response**
+BVH text file (`text/plain`).
+**Example**
+```bash
+curl -X POST http://localhost:9551/generate_bvh \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "drunk person stumbling and falling"}' \
+  --output output_motion.bvh
+```
+---
+## Hardware Requirements
+| Resource | Minimum | Recommended |
+|---|---|---|
+| GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 |
+| VRAM | 24 GB | 48 GB |
+| RAM | 32 GB | 64 GB |
+| Disk | 50 GB | 100 GB |
+| CUDA | 12.1+ | 12.8 |
+---
+## Environment Variables
+| Variable | Required | Description |
+|---|---|---|
+| `HUGGINGFACE_TOKEN` | Yes | HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct` |
+| `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"0"`) |
+| `PORT` | No | Override default port `9551` |
+---
+## Integration with VisionAI-Flywheel
+`kimodo-api` is designed to run alongside `render-api` and `cosmos-transfer` as part of the full pipeline:
+```yaml
+# docker-compose.yml excerpt
+services:
+  kimodo-api:
+    image: ghcr.io/eyalenav/kimodo-api:latest
+    ports:
+      - "9551:9551"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              device_ids: ["1"]
+              capabilities: [gpu]
+    volumes:
+      - hf_cache:/root/.cache/huggingface
+    environment:
+      - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
+```
+Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)
+---
+## Example: Full Python client
+```python
+import requests
+import numpy as np
+import io
+def generate_motion(prompt: str, num_frames: int = 120) -> dict:
+    """Generate motion NPZ from text prompt."""
+    response = requests.post(
+        "http://localhost:9551/generate",
+        json={"prompt": prompt, "num_frames": num_frames},
+        timeout=120
+    )
+    response.raise_for_status()
+    npz = np.load(io.BytesIO(response.content))
+    return {
+        "poses": npz["poses"],   # (T, 77, 3)
+        "trans": npz["trans"],   # (T, 3)
+        "betas": npz["betas"],   # (16,)
+    }
+# Example usage
+motion = generate_motion("security guard running toward an incident")
+print(f"Generated {motion['poses'].shape[0]} frames")
+```
+---
+## License
+Apache 2.0
+> Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image.