NVisionAI commited on
Commit
2cb21d2
·
verified ·
1 Parent(s): 126cc14

Add standalone API documentation

Browse files
Files changed (1) hide show
  1. API.md +208 -0
API.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # kimodo-api
2
+
3
+ REST API microservice wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.
4
+
5
+ ---
6
+
7
+ ## Installation
8
+
9
+ ```bash
10
+ docker pull ghcr.io/eyalenav/kimodo-api:latest
11
+ ```
12
+
13
+ ### Run
14
+
15
+ ```bash
16
+ docker run --rm \
17
+ --gpus '"device=0"' \
18
+ -p 9551:9551 \
19
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
20
+ -e HUGGINGFACE_TOKEN=hf_... \
21
+ ghcr.io/eyalenav/kimodo-api:latest
22
+ ```
23
+
24
+ > **First run:** downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in `/root/.cache/huggingface`).
25
+
26
+ ---
27
+
28
+ ## API Reference
29
+
30
+ ### `GET /health`
31
+
32
+ Check server status.
33
+
34
+ **Request**
35
+ ```
36
+ GET http://localhost:9551/health
37
+ ```
38
+
39
+ **Response**
40
+ ```json
41
+ {
42
+ "status": "ok"
43
+ }
44
+ ```
45
+
46
+ ---
47
+
48
+ ### `POST /generate`
49
+
50
+ Generate a motion clip from a text prompt.
51
+
52
+ **Request**
53
+ ```
54
+ POST http://localhost:9551/generate
55
+ Content-Type: application/json
56
+ ```
57
+
58
+ ```json
59
+ {
60
+ "prompt": "person pushing through a crowd aggressively",
61
+ "num_frames": 120,
62
+ "fps": 30
63
+ }
64
+ ```
65
+
66
+ | Field | Type | Default | Description |
67
+ |---|---|---|---|
68
+ | `prompt` | string | required | Natural language motion description |
69
+ | `num_frames` | int | `120` | Number of frames to generate |
70
+ | `fps` | int | `30` | Frames per second (metadata only) |
71
+
72
+ **Response**
73
+
74
+ Binary NPZ file (`application/octet-stream`).
75
+
76
+ The NPZ contains:
77
+ | Key | Shape | Description |
78
+ |---|---|---|
79
+ | `poses` | `(T, 77, 3)` | Joint rotations (axis-angle) per frame |
80
+ | `trans` | `(T, 3)` | Root translation per frame |
81
+ | `betas` | `(16,)` | SMPL body shape parameters |
82
+
83
+ **Example**
84
+ ```bash
85
+ curl -X POST http://localhost:9551/generate \
86
+ -H "Content-Type: application/json" \
87
+ -d '{"prompt": "person falling to the ground after being pushed"}' \
88
+ --output output_motion.npz
89
+ ```
90
+
91
+ ---
92
+
93
+ ### `POST /generate_bvh`
94
+
95
+ Generate motion and return as BVH (Biovision Hierarchy) format.
96
+
97
+ **Request**
98
+ ```
99
+ POST http://localhost:9551/generate_bvh
100
+ Content-Type: application/json
101
+ ```
102
+
103
+ ```json
104
+ {
105
+ "prompt": "two people fighting, punches thrown",
106
+ "num_frames": 150
107
+ }
108
+ ```
109
+
110
+ **Response**
111
+
112
+ BVH text file (`text/plain`).
113
+
114
+ **Example**
115
+ ```bash
116
+ curl -X POST http://localhost:9551/generate_bvh \
117
+ -H "Content-Type: application/json" \
118
+ -d '{"prompt": "drunk person stumbling and falling"}' \
119
+ --output output_motion.bvh
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Hardware Requirements
125
+
126
+ | Resource | Minimum | Recommended |
127
+ |---|---|---|
128
+ | GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 |
129
+ | VRAM | 24 GB | 48 GB |
130
+ | RAM | 32 GB | 64 GB |
131
+ | Disk | 50 GB | 100 GB |
132
+ | CUDA | 12.1+ | 12.8 |
133
+
134
+ ---
135
+
136
+ ## Environment Variables
137
+
138
+ | Variable | Required | Description |
139
+ |---|---|---|
140
+ | `HUGGINGFACE_TOKEN` | Yes | HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct` |
141
+ | `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"0"`) |
142
+ | `PORT` | No | Override default port `9551` |
143
+
144
+ ---
145
+
146
+ ## Integration with VisionAI-Flywheel
147
+
148
+ `kimodo-api` is designed to run alongside `render-api` and `cosmos-transfer` as part of the full pipeline:
149
+
150
+ ```yaml
151
+ # docker-compose.yml excerpt
152
+ services:
153
+ kimodo-api:
154
+ image: ghcr.io/eyalenav/kimodo-api:latest
155
+ ports:
156
+ - "9551:9551"
157
+ deploy:
158
+ resources:
159
+ reservations:
160
+ devices:
161
+ - driver: nvidia
162
+ device_ids: ["1"]
163
+ capabilities: [gpu]
164
+ volumes:
165
+ - hf_cache:/root/.cache/huggingface
166
+ environment:
167
+ - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
168
+ ```
169
+
170
+ Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)
171
+
172
+ ---
173
+
174
+ ## Example: Full Python client
175
+
176
+ ```python
177
+ import requests
178
+ import numpy as np
179
+ import io
180
+
181
+ def generate_motion(prompt: str, num_frames: int = 120) -> dict:
182
+ """Generate motion NPZ from text prompt."""
183
+ response = requests.post(
184
+ "http://localhost:9551/generate",
185
+ json={"prompt": prompt, "num_frames": num_frames},
186
+ timeout=120
187
+ )
188
+ response.raise_for_status()
189
+
190
+ npz = np.load(io.BytesIO(response.content))
191
+ return {
192
+ "poses": npz["poses"], # (T, 77, 3)
193
+ "trans": npz["trans"], # (T, 3)
194
+ "betas": npz["betas"], # (16,)
195
+ }
196
+
197
+ # Example usage
198
+ motion = generate_motion("security guard running toward an incident")
199
+ print(f"Generated {motion['poses'].shape[0]} frames")
200
+ ```
201
+
202
+ ---
203
+
204
+ ## License
205
+
206
+ Apache 2.0
207
+
208
+ > Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image.