Scaling Instructable Agents Across Many Simulated Worlds
Paper
•
2404.10179
•
Published
•
27
an encoder-decoder model which compresses videos to discrete embeddings (tokens) and a transformer model to translate text embeddings to video tokens.