NVisionAI commited on
Commit
126cc14
Β·
verified Β·
1 Parent(s): bfa7a8c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - motion-generation
5
+ - text-to-motion
6
+ - human-motion
7
+ - surveillance
8
+ - synthetic-data
9
+ - docker
10
+ - rest-api
11
+ - kimodo
12
+ - nvidia
13
+ pipeline_tag: text-to-video
14
+ ---
15
+
16
+ # kimodo-api πŸƒ
17
+
18
+ A **REST API wrapper** around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) β€” the state-of-the-art text-to-motion diffusion model trained on 700 hours of commercial mocap data.
19
+
20
+ This image turns Kimodo into a microservice you can call from any pipeline, no Python environment needed.
21
+
22
+ ## Quick Start
23
+
24
+ ```bash
25
+ docker pull ghcr.io/eyalenav/kimodo-api:latest
26
+
27
+ docker run --rm --gpus '"device=0"' -p 9551:9551 \
28
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
29
+ -e HUGGINGFACE_TOKEN=hf_... \
30
+ ghcr.io/eyalenav/kimodo-api:latest
31
+ ```
32
+
33
+ > ⚠️ First run downloads Llama-3-8B-Instruct (~16GB) for the text encoder. Requires a HuggingFace token with access to `meta-llama/Meta-Llama-3-8B-Instruct`.
34
+
35
+ ## API
36
+
37
+ ### `POST /generate`
38
+
39
+ Generate a motion clip from a text prompt.
40
+
41
+ ```bash
42
+ curl -X POST http://localhost:9551/generate \
43
+ -H "Content-Type: application/json" \
44
+ -d '{"prompt": "person pushing through a crowd aggressively"}'
45
+ ```
46
+
47
+ **Response:** NPZ file (binary) β€” SOMA 77-joint skeleton format, compatible with BVH export.
48
+
49
+ ### `GET /health`
50
+
51
+ ```bash
52
+ curl http://localhost:9551/health
53
+ # {"status": "ok"}
54
+ ```
55
+
56
+ ## Requirements
57
+
58
+ | Resource | Minimum |
59
+ |---|---|
60
+ | GPU | RTX 3090 / A100 / RTX 6000 Ada |
61
+ | VRAM | 24 GB |
62
+ | RAM | 32 GB |
63
+ | Disk | 50 GB (model weights) |
64
+
65
+ ## What's inside
66
+
67
+ - **Kimodo** β€” NVIDIA's kinematic motion diffusion model (77-joint SOMA skeleton)
68
+ - **LLM2Vec** text encoder backed by **Llama-3-8B-Instruct**
69
+ - **FastAPI** server on port 9551
70
+ - Health check + graceful startup
71
+
72
+ ## Part of VisionAI-Flywheel
73
+
74
+ This service is one component of a full synthetic surveillance data pipeline:
75
+
76
+ ```
77
+ [kimodo-api] β†’ NPZ motion
78
+ ↓
79
+ [render-api] β†’ SOMA mesh render (MP4)
80
+ ↓
81
+ [cosmos-transfer] β†’ Sim2Real photorealistic video
82
+ ↓
83
+ [NVIDIA VSS] β†’ VLM annotation β†’ fine-tuning dataset
84
+ ```
85
+
86
+ πŸ”— Full pipeline: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)
87
+
88
+ ## License
89
+
90
+ Apache 2.0 β€” see [LICENSE](https://github.com/EyalEnav/VisionAI-Flywheel/blob/main/LICENSE)
91
+
92
+ > Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) and downloaded at runtime. They are not bundled in this image.